Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] When a large number of pods created using juicefs sc are deleted within a similar period of time, the juicefs-csi-controller-0 will crash #883

Open
huangguoqiang opened this issue Feb 29, 2024 · 4 comments
Labels
kind/bug Something isn't working

Comments

@huangguoqiang
Copy link

What happened:

volume for claim "senseauto-prod/df-30276-batch-82015-gh22t-juicefs-pv"
I0229 02:12:42.613307 1 controller.go:562] CreateVolumeRequest {Name:pvc-b31a44e3-13e3-4738-bd85-bb66453cb7bf CapacityRange:required_bytes:322122547200 VolumeCapabilities:[mount:<fs_type:"ext4" mount_flags:"cache-size=0" > access_mode:<mode:MULTI_NODE_MULTI_WRITER > ] Parameters:map[csi.storage.k8s.io/controller-expand-secret-name:juicefs-sc-secret csi.storage.k8s.io/controller-expand-secret-namespace:juicefs csi.storage.k8s.io/node-publish-secret-name:juicefs-sc-secret csi.storage.k8s.io/node-publish-secret-namespace:juicefs csi.storage.k8s.io/provisioner-secret-name:juicefs-sc-secret csi.storage.k8s.io/provisioner-secret-namespace:juicefs juicefs/mount-cpu-limit:5000m juicefs/mount-cpu-request:1000m juicefs/mount-image:juicedata/mount:ce-v1.1.1 juicefs/mount-memory-limit:32Gi juicefs/mount-memory-request:1Gi] Secrets:map[] VolumeContentSource: AccessibilityRequirements: XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0229 02:12:42.811373 1 request.go:565] Throttling request took 12.391808769s, request: GET:https://10.43.0.1:443/api/v1/persistentvolumes/pvc-51291945-deab-4050-9dc8-02885613300b
I0229 02:12:43.011378 1 request.go:565] Throttling request took 12.591789695s, request: GET:https://10.43.0.1:443/api/v1/persistentvolumes/pvc-14ee2521-2d7e-4f5f-9560-53f890f0e924
I0229 02:12:43.211369 1 request.go:565] Throttling request took 12.79177088s, request: GET:https://10.43.0.1:443/api/v1/persistentvolumes/pvc-cb8bad6b-496f-47c8-a080-5894b8ea9402
I0229 02:12:43.411376 1 request.go:565] Throttling request took 12.991757135s, request: GET:https://10.43.0.1:443/api/v1/persistentvolumes/pvc-defd0b4d-c1bb-448f-a38f-6baa4ee01f37
I0229 02:12:43.611363 1 request.go:565] Throttling request took 13.191731181s, request: GET:https://10.43.0.1:443/api/v1/persistentvolumes/pvc-dec9d20d-e2c9-4465-9dc4-3ac6c6c2dfd0
I0229 02:12:43.811379 1 request.go:565] Throttling request took 13.391737676s, request: GET:https://10.43.0.1:443/api/v1/persistentvolumes/pvc-7056fccb-dc36-4652-831c-dce277d8a148
I0229 02:12:44.011365 1 request.go:565] Throttling request took 13.591723711s, request: GET:https://10.43.0.1:443/api/v1/persistentvolumes/pvc-c52611fe-88e3-4176-9110-1c397a56f161
I0229 02:12:44.211372 1 request.go:565] Throttling request took 13.791720147s, request: GET:https://10.43.0.1:443/api/v1/persistentvolumes/pvc-e0cb880e-580e-43c1-a48a-6ee50919aac2
I0229 02:12:44.411375 1 request.go:565] Throttling request took 13.991711372s, request: GET:https://10.43.0.1:443/api/v1/persistentvolumes/pvc-fa22183b-7814-4d3d-8208-df85d66978bc
I0229 02:12:44.611376 1 request.go:565] Throttling request took 14.191697747s, request: GET:https://10.43.0.1:443/api/v1/persistentvolumes/pvc-83e71675-9edc-4a2f-aee3-be1263f72ff8
I0229 02:12:44.811372 1 request.go:565] Throttling request took 14.38727893s, request: GET:https://10.43.0.1:443/api/v1/namespaces/juicefs/secrets/juicefs-sc-secret
I0229 02:12:44.814678 1 connection.go:182] GRPC call: /csi.v1.Controller/DeleteVolume
I0229 02:12:44.814689 1 connection.go:183] GRPC request: {"secrets":"stripped","volume_id":"pvc-cdf0b1f4-6292-4b4e-b215-b3730a535af8"}
I0229 02:12:45.011381 1 request.go:565] Throttling request took 14.587279976s, request: GET:https://10.43.0.1:443/api/v1/namespaces/juicefs/secrets/juicefs-sc-secret
I0229 02:12:45.013416 1 connection.go:182] GRPC call: /csi.v1.Controller/DeleteVolume
I0229 02:12:45.013432 1 connection.go:183] GRPC request: {"secrets":"stripped","volume_id":"pvc-9111cf4e-516d-41de-9475-afd7ec7e7372"}
I0229 02:12:45.211412 1 request.go:565] Throttling request took 14.786592611s, request: GET:https://10.43.0.1:443/api/v1/namespaces/juicefs/secrets/juicefs-sc-secret
I0229 02:12:45.214167 1 connection.go:182] GRPC call: /csi.v1.Controller/DeleteVolume
I0229 02:12:45.214185 1 connection.go:183] GRPC request: {"secrets":"stripped","volume_id":"pvc-dcb330d4-b39f-4a40-ab3f-4f25858c5a30"}
I0229 02:12:45.222050 1 leaderelection.go:288] failed to renew lease juicefs/csi-juicefs-com: failed to tryAcquireOrRenew context deadline exceeded
F0229 02:12:45.222073 1 leader_election.go:169] stopped leading
juicefs-csi-controller-0_csi-provisioner.log

Environment:

JuiceFS CSI Driver version (which image tag did your CSI Driver use):
v0.23.2
Kubernetes version (e.g. kubectl version):
v1.21.11 +rke2r1
Object storage (cloud provider and region):
ceph
Metadata engine info (version, cloud provider managed or self maintained):
redis sentinel

@huangguoqiang huangguoqiang added the kind/bug Something isn't working label Feb 29, 2024
@huangguoqiang huangguoqiang changed the title [BUG] When a large number of PVs are deleted at a similar time, the JFS controller will crash [BUG] When a large number of pods created using juicefs sc are deleted within a similar period of time, the juicefs-csi-controller-0 will crash Feb 29, 2024
@zxh326
Copy link
Member

zxh326 commented Feb 29, 2024

It looks like an issue with csi-provisioner.

Would you be willing to try using the juicefs provisioner to see if there are any problems in this case?

@huangguoqiang
Copy link
Author

csi-provisioner

Setting -- - provisioner=true , will it affect the running mount pod?

@zxh326
Copy link
Member

zxh326 commented Feb 29, 2024

csi-provisioner

Setting -- - provisioner=true , will it affect the running mount pod?

The provisioner only works when creating dynamic PVs.

@huangguoqiang
Copy link
Author

huangguoqiang commented Feb 29, 2024

There is another phenomenon in the previous failures: deleting PV will fail, and after deleting too many failed ones, the CSI provisioner Throttling request token will be triggered for 14.786592611s. If you want to restore, you need to manually delete the released status of the PV. May I ask if it is necessary to first solve the problem of PV deletion failure @zxh326

======================
W0229 09:15:17.988118 1 controller.go:965] Retrying syncing volume "pvc-531990c9-fc9f-4401-a6e6-a06c5941e299", failure 1
E0229 09:15:17.988143 1 controller.go:983] error syncing volume "pvc-531990c9-fc9f-4401-a6e6-a06c5941e299": rpc error: code = Internal desc = Could not delVol in juicefs: waitUtilJobCompleted: job juicefs-eb886745-delvol isn't completed: 2024/02/29 09:14:38.614049 juicefs[7] : Data use s3://sh1984_juicefs_storage/adjfs/ [format.go:471]

WX20240229-185840

delete pod logs:

[root@SH-IDC1-10-198-5-32 ~]# kubectl logs -n juicefs juicefs-8ff6b8e3-delvol-8s7n4 -f
2024/02/29 11:19:33.940436 juicefs[7] : Meta address: redis://:@mymaster,redis-node-0.redis-headless.juicefs.svc.cluster.local,redis-node-1.redis-headless.juicefs.svc.cluster.local,redis-node-2.redis-headless.juicefs.svc.cluster.local:26379/2 [interface.go:497]
redis: 2024/02/29 11:19:33 sentinel.go:685: sentinel: new master="mymaster" addr="redis-node-1.redis-headless.juicefs.svc.cluster.local:6379"
2024/02/29 11:19:33.953854 juicefs[7] : Ping redis latency: 913.101µs [redis.go:3593]
2024/02/29 11:19:33.954954 juicefs[7] : Data use s3://sh1984_juicefs_storage/adjfs/ [format.go:471]
2024/02/29 11:19:34.009501 juicefs[7] : Volume is formatted as {
"Name": "adjfs",
"UUID": "cd03947d-b64e-4af0-b5ee-bfdb4b41f097",
"Storage": "s3",
"Bucket": "xxxx",
"AccessKey": "xxxx",
"SecretKey": "removed",
"BlockSize": 4096,
"Compression": "none",
"KeyEncrypted": true,
"TrashDays": 0,
"MetaVersion": 1
} [format.go:508]
2024/02/29 11:19:34.464557 juicefs[32] : Meta address: redis://:
@mymaster,redis-node-0.redis-headless.juicefs.svc.cluster.local,redis-node-1.redis-headless.juicefs.svc.cluster.local,redis-node-2.redis-headless.juicefs.svc.cluster.local:26379/2 [interface.go:497]
redis: 2024/02/29 11:19:34 sentinel.go:685: sentinel: new master="mymaster" addr="redis-node-1.redis-headless.juicefs.svc.cluster.local:6379"
2024/02/29 11:19:34.470655 juicefs[32] : Ping redis latency: 786.09µs [redis.go:3593]
2024/02/29 11:19:34.472244 juicefs[32] : Data use s3://sh1984_juicefs_storage/adjfs/ [mount.go:605]
2024/02/29 11:19:34.472303 juicefs[32] : cache-size is 0, writeback and prefetch will be disabled [cached_store.go:540]
2024/02/29 11:19:35.475704 juicefs[32] : OK, adjfs is ready at /mnt/jfs [mount_unix.go:48]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants