Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent attempts to take snapshots too quickly and for too long if s3 is misconfigured #49

Open
davidnuzik opened this issue Aug 19, 2021 · 0 comments

Comments

@davidnuzik
Copy link

Summary:
The rancher-system-agent service appears to take many snapshots in a very short timeframe (seconds) if s3 is misconfigured and uploading to the s3 bucket fails. Then after this brief time period, it will still keep trying, seemingly infinitely, over a longer time period to take snapshots and upload (about every 5 minutes). The details outlined below should provide more information and context.

Environment:
Rancher version: v2.6-head 2778ecaba pulled at 8/19/21 12:51 pm Pacific
Rancher cluster type: single-node docker
Docker version: 20.10

Downstream cluster type: rke2
Downstream K8s version: v1.21.3-rc6+rke2r2

Steps to Reproduce:

  1. In Rancher, create an 1-node RKE2 cluster and set up a mostly proper s3 config -- meaning correct keys, region, etc. Just don't supply a legit bucket name - enter a fake one. Tell Rancher to take the snapshot in the Rancher UI and it will. If you look at the rancher-system-agent logs this will fail but it does take snapshots locally. The agent does this many times for some reason in a very short period of time.
  2. Watch the /var/lib/rancher/rke2/server/db/snapshots directory on the node. When you take the snapshot it takes many in a very short period.
  3. If you simultaneously watch the rancher-system-agent systemd logs you'll see it loops very quickly over taking all these snapshots. It seems to back off and keep trying but more slowly after that point.

Expected Result:
I expected the rancher-system-agent to indeed retry taking a snapshot when it fails to upload to my s3 bucket, but not to retake snapshots so aggressively in such a short period/loop.

Actual Result:
The rancher-system-agent takes many snapshots in a very short period of time, then more slowly but seemingly infinitely over time.

Consider this directory listing of /var/lib/rancher/rke2/server/db/snapshots on the node:

root@dave-s3improper-pool1-gzrql:/var/lib/rancher/rke2/server/db/snapshots# ls -laht --time-style=full
total 159M
drwx------ 2 root root 4.0K 2021-08-19 20:20:34.348504465 +0000 .
-rw------- 1 root root  11M 2021-08-19 20:20:34.332504123 +0000 on-demand-dave-s3improper-pool1-gzrql-1629404434
-rw------- 1 root root  11M 2021-08-19 20:15:05.827968467 +0000 on-demand-dave-s3improper-pool1-gzrql-1629404105
-rw------- 1 root root  11M 2021-08-19 20:12:21.249479282 +0000 on-demand-dave-s3improper-pool1-gzrql-1629403941
-rw------- 1 root root  11M 2021-08-19 20:10:58.608063479 +0000 on-demand-dave-s3improper-pool1-gzrql-1629403858
-rw------- 1 root root  11M 2021-08-19 20:10:16.851263319 +0000 on-demand-dave-s3improper-pool1-gzrql-1629403816
-rw------- 1 root root  11M 2021-08-19 20:09:55.614824901 +0000 on-demand-dave-s3improper-pool1-gzrql-1629403795
-rw------- 1 root root  11M 2021-08-19 20:09:44.606587547 +0000 on-demand-dave-s3improper-pool1-gzrql-1629403784
-rw------- 1 root root  11M 2021-08-19 20:09:38.754458268 +0000 on-demand-dave-s3improper-pool1-gzrql-1629403778
-rw------- 1 root root  11M 2021-08-19 20:09:37.478429777 +0000 on-demand-dave-s3improper-pool1-gzrql-1629403777
-rw------- 1 root root  11M 2021-08-19 20:09:36.766413835 +0000 on-demand-dave-s3improper-pool1-gzrql-1629403776
-rw------- 1 root root 9.3M 2021-08-19 20:09:36.022397144 +0000 on-demand-dave-s3improper-pool1-gzrql-1629403775
-rw------- 1 root root 9.3M 2021-08-19 20:09:34.490362635 +0000 on-demand-dave-s3improper-pool1-gzrql-1629403774
-rw------- 1 root root 9.3M 2021-08-19 20:09:33.738345643 +0000 on-demand-dave-s3improper-pool1-gzrql-1629403773
-rw------- 1 root root 9.2M 2021-08-19 20:09:33.030329615 +0000 on-demand-dave-s3improper-pool1-gzrql-1629403772
-rw------- 1 root root 9.2M 2021-08-19 20:09:31.526295427 +0000 on-demand-dave-s3improper-pool1-gzrql-1629403771
-rw------- 1 root root 9.2M 2021-08-19 20:09:30.750277730 +0000 on-demand-dave-s3improper-pool1-gzrql-1629403770
-rw------- 1 root root 2.1M 2021-08-19 20:00:00.034442671 +0000 etcd-snapshot-dave-s3improper-pool1-gzrql-1629403200
drwx------ 4 root root 4.0K 2021-08-19 19:59:35.194204416 +0000 ..

Notice that initially the on-demand snapshot was taken about 12 times in a short span of seconds apart. After that it seems ot retry approximately every 5 minutes which seems more reasonable. Ideally it would only retry about every 5 minutes right at the start, and fail / give up after perhaps a few tries (and log an error).

If we look at the rancher-system-agent logs we can see that initially around this timeframe it indeed was very quickly taking all these snapshots this loops a lot at first. A snippet is shown below - this repeats rapidly (see full logs attached below).

Aug 19 20:09:25 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:25Z" level=debug msg="[K8s] (kube-apiserver) writing probe status to map"
Aug 19 20:09:25 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:25Z" level=debug msg="[K8s] secret data/string-data did not change, not updating secret"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=debug msg="[K8s] Processing secret dave-s3improper-bootstrap-template-kqddg-machine-plan in namespace fleet-default at generation 0 with resource version 8094"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=debug msg="[K8s] Calculated checksum to be 9bbf9da0f14780d4c8e310656242120c83cea58208e1a4b5b6d4f7ead84d50dd"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=debug msg="[K8s] Remote plan had an applied checksum value of 134aeacac334133ae2d2616b36215edfa2cff4e1d74cf8d83f4860bb5ad27f79"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=debug msg="[K8s] Calling Applyinator to apply the plan"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="Applying plan with checksum 9bbf9da0f14780d4c8e310656242120c83cea58208e1a4b5b6d4f7ead84d50dd"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=debug msg="Writing applied calculated plan contents to historical plan directory /var/lib/rancher/agent/applied"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=debug msg="reconciling file permissions for /var/lib/rancher/agent/applied/20210819-200930 to 0:0 384"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=debug msg="Writing file /etc/rancher/agent/registries.yaml"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=debug msg="requested file permission for /etc/rancher/agent/registries.yaml was , defaulting to 384"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=debug msg="file /etc/rancher/agent/registries.yaml does not need to be written"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=debug msg="reconciling file permissions for /etc/rancher/agent/registries.yaml to 0:0 384"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=debug msg="Cleaning working directory before applying /var/lib/rancher/agent/work"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=debug msg="Executing instruction 0 for plan 9bbf9da0f14780d4c8e310656242120c83cea58208e1a4b5b6d4f7ead84d50dd"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="No image provided, creating empty working directory /var/lib/rancher/agent/work/20210819-200930-applied.plan/9bbf9da0f14780d4c8e310656242120c83cea58208e1a4b5b6d4f7ead84d50dd_0"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=debug msg="requested file permission for /var/lib/rancher/agent/work/20210819-200930-applied.plan/9bbf9da0f14780d4c8e310656242120c83cea58208e1a4b5b6d4f7ead84d50dd_0 was , defaulting to 493"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=debug msg="reconciling file permissions for /var/lib/rancher/agent/work/20210819-200930-applied.plan/9bbf9da0f14780d4c8e310656242120c83cea58208e1a4b5b6d4f7ead84d50dd_0 to 0:0 493"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="Running command: rke2 [etcd-snapshot --node-name=dave-s3improper-pool1-gzrql --s3 --s3-bucket=dave-fakebucket --s3-access-key=REDACTED_AWS_KEY --s3-region=us-east-2 --s3-folder=dave-fakefolder --s3-endpoint=s3.us-east-2.amazonaws.com --s3-skip-ssl-verify]"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stderr]: time=\"2021-08-19T20:09:30Z\" level=info msg=\"Managed etcd cluster bootstrap already complete and initialized\""
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stderr]: W0819 20:09:30.620735   15143 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stderr]: W0819 20:09:30.646278   15143 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stderr]: W0819 20:09:30.670290   15143 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stderr]: I0819 20:09:30.682065   15143 client.go:360] parsed scheme: \"endpoint\""
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stderr]: I0819 20:09:30.682109   15143 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://127.0.0.1:2379  <nil> 0 <nil>}]"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stderr]: I0819 20:09:30.682211   15143 client.go:360] parsed scheme: \"passthrough\""
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stderr]: I0819 20:09:30.682247   15143 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{https://127.0.0.1:2379  <nil> 0 <nil>}] <nil> <nil>}"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stderr]: I0819 20:09:30.682268   15143 clientconn.go:948] ClientConn switching balancer to \"pick_first\""
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stderr]: I0819 20:09:30.682314   15143 clientconn.go:897] blockingPicker: the picked transport is not ready, loop back to repick"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stderr]: time=\"2021-08-19T20:09:30Z\" level=info msg=\"Saving etcd snapshot to /var/lib/rancher/rke2/server/db/snapshots/on-demand-dave-s3improper-pool1-gzrql-1629403770\""
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stderr]: I0819 20:09:30.686770   15143 client.go:360] parsed scheme: \"endpoint\""
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stderr]: I0819 20:09:30.686792   15143 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://127.0.0.1:2379  <nil> 0 <nil>}]"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stdout]: {\"level\":\"info\",\"msg\":\"created temporary db file\",\"path\":\"/var/lib/rancher/rke2/server/db/snapshots/on-demand-dave-s3improper-pool1-gzrql-1629403770.part\"}"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stderr]: {\"level\":\"info\",\"ts\":\"2021-08-19T20:09:30.689Z\",\"caller\":\"clientv3/maintenance.go:200\",\"msg\":\"opened snapshot stream; downloading\"}"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stdout]: {\"level\":\"info\",\"msg\":\"fetching snapshot\",\"endpoint\":\"https://127.0.0.1:2379\"}"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stderr]: {\"level\":\"info\",\"ts\":\"2021-08-19T20:09:30.753Z\",\"caller\":\"clientv3/maintenance.go:208\",\"msg\":\"completed snapshot read; closing\"}"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stdout]: {\"level\":\"info\",\"msg\":\"fetched snapshot\",\"endpoint\":\"https://127.0.0.1:2379\",\"size\":\"9.6 MB\",\"took\":\"77.981212ms\"}"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stdout]: {\"level\":\"info\",\"msg\":\"saved\",\"path\":\"/var/lib/rancher/rke2/server/db/snapshots/on-demand-dave-s3improper-pool1-gzrql-1629403770\"}"
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stderr]: time=\"2021-08-19T20:09:30Z\" level=info msg=\"Saving etcd snapshot on-demand-dave-s3improper-pool1-gzrql-1629403770 to S3\""
Aug 19 20:09:30 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:30Z" level=info msg="[stderr]: time=\"2021-08-19T20:09:30Z\" level=info msg=\"Checking if S3 bucket dave-fakebucket exists\""
Aug 19 20:09:31 dave-s3improper-pool1-gzrql rancher-system-agent[3167]: time="2021-08-19T20:09:31Z" level=info msg="[stderr]: time=\"2021-08-19T20:09:31Z\" level=fatal msg=\"bucket: dave-fakebucket does not exist\""

The logs will show entries like this and it will loop quickly and keep taking snapshots on disk.

FULL rancher-system-agent logs here:
rancher-system-agent.log

Desired Result:
In an ideal world, I would like RKE2 (or maybe the rancher-system-agent?) to validate that the credentials provided are valid and to check if the bucket indeed exists first, if not then perhaps do not attempt to upload, rather snapshot locally and spit out an error in logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants