Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove watch timeout to allow call staggering #1296

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

michaely-cb
Copy link

@michaely-cb michaely-cb commented Jun 12, 2024

The watch calls from multus were reconnecting to the API server every minute, due to a one-minute timeout specified on the rest config. Reconnecting every minute imposes unnecessary load on the api server and watches with fixed timeouts won't be temporally staggered to make the api server load even. For watch calls, we should completely delegate the reconnections to client-go. Watches from other components (kubelet, kube-scheduler, cilium) are doing this delegation already.

Reference: https://github.com/kubernetes/client-go/blob/03443e7ede0e50d195b8669103ce082e735c6b94/tools/cache/reflector.go#L52-L56

Pod watch:

// prior to this change
2024-06-07T17:49:38.929150Z -> 2024-06-07T17:50:38.929483Z -> multus-daemon/v0.0.0 (linux/amd64) kubernetes/$Format -> watch -> /api/v1/pods?allowWatchBookmarks=true&fieldSelector=spec.nodeName%3Dpytest-ci-1717689938-worker&resourceVersion=312906&timeout=8m14s&timeoutSeconds=494&watch=true -> 200
2024-06-07T17:50:38.929684Z -> 2024-06-07T17:51:38.930434Z -> multus-daemon/v0.0.0 (linux/amd64) kubernetes/$Format -> watch -> /api/v1/pods?allowWatchBookmarks=true&fieldSelector=spec.nodeName%3Dpytest-ci-1717689938-worker&resourceVersion=312906&timeout=9m12s&timeoutSeconds=552&watch=true -> 200

// with this change
2024-06-12T03:44:13.024297Z -> 2024-06-12T03:53:26.025634Z -> multus-daemon/v0.0.0 (linux/amd64) kubernetes/$Format -> watch -> /api/v1/pods?allowWatchBookmarks=true&fieldSelector=spec.nodeName%3Dpytest-ci-1718094202-worker&resourceVersion=219877&timeout=9m13s&timeoutSeconds=553&watch=true -> 200
2024-06-12T03:53:26.026164Z -> 2024-06-12T03:58:38.028134Z -> multus-daemon/v0.0.0 (linux/amd64) kubernetes/$Format -> watch -> /api/v1/pods?allowWatchBookmarks=true&fieldSelector=spec.nodeName%3Dpytest-ci-1718094202-worker&resourceVersion=219883&timeout=5m12s&timeoutSeconds=312&watch=true -> 200

Nad watch:

// prior to this change
2024-06-07T17:47:38.871806Z -> 2024-06-07T17:48:38.871976Z -> multus-daemon/v0.0.0 (linux/amd64) kubernetes/$Format -> watch -> /apis/k8s.cni.cncf.io/v1/network-attachment-definitions?allowWatchBookmarks=true&resourceVersion=310731&timeout=8m50s&timeoutSeconds=530&watch=true -> 200
2024-06-07T17:48:38.872269Z -> 2024-06-07T17:49:38.873034Z -> multus-daemon/v0.0.0 (linux/amd64) kubernetes/$Format -> watch -> /apis/k8s.cni.cncf.io/v1/network-attachment-definitions?allowWatchBookmarks=true&resourceVersion=310731&timeout=7m32s&timeoutSeconds=452&watch=true -> 200

// with this change
2024-06-13T09:36:07.248638Z -> 2024-06-13T09:44:26.253022Z -> multus-daemon/v0.0.0 (linux/amd64) kubernetes/$Format -> watch -> /apis/k8s.cni.cncf.io/v1/network-attachment-definitions?allowWatchBookmarks=true&resourceVersion=550160&timeout=8m19s&timeoutSeconds=499&watch=true -> 200
2024-06-13T09:44:26.253582Z -> 2024-06-13T09:54:11.256301Z -> multus-daemon/v0.0.0 (linux/amd64) kubernetes/$Format -> watch -> /apis/k8s.cni.cncf.io/v1/network-attachment-definitions?allowWatchBookmarks=true&resourceVersion=552157&timeout=9m45s&timeoutSeconds=585&watch=true -> 200

@michaely-cb
Copy link
Author

Hi @dougbtv @s1061123. Can I get a review on this PR please? Thanks!

@@ -19,7 +19,7 @@ require (
gopkg.in/natefinch/lumberjack.v2 v2.0.0
k8s.io/api v0.29.0
k8s.io/apimachinery v0.29.0
k8s.io/client-go v1.5.2
k8s.io/client-go v0.29.0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this downgrade maybe undesired

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They changed their naming convention. v0.29.0 represents 1.29 in Kubernetes: https://github.com/kubernetes/client-go?tab=readme-ov-file#versioning

Noticed rest of the Kubernetes related dependencies in the multus project had been upgraded, but not client-go.

@dougbtv
Copy link
Member

dougbtv commented Jun 20, 2024

This sure sounds like an excellent fix, and overall I'm in favor of it -- is there any way that we can validate that it does indeed operate as expected by reducing the API calls? e.g. via end to end tests, or, even manually? thanks!

@michaely-cb
Copy link
Author

is there any way that we can validate that it does indeed operate as expected by reducing the API calls? e.g. via end to end tests, or, even manually?

What I did is to manually turn on the API server audit logs and see the call pattern changes. I have captured the call patterns before and after in the PR description, where we could see the minutely reconnections were happening prior to this change and not after. In the later calls, we can also see the reconnection time aligns with the random timeout client-go was specifying in the request parameters.

@michaely-cb
Copy link
Author

@dougbtv Mind taking another look and rerun CI please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants