Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image pull backoff due to race with image-pull-secret Secret creation #101

Closed
antoineco opened this issue Jul 29, 2021 · 7 comments
Closed
Labels
bug Something isn't working e2e Issues related to End-to-End testing

Comments

@antoineco
Copy link
Contributor

antoineco commented Jul 29, 2021

Problem

It seems like the creation of the image-pull-secret Secret and/or its addition to new Service Accounts takes quite some time after the creation of a namespace or a ServiceAccount object. In my experiments, I observed delays of up to 90 seconds.

This is causing a fair amount of timeouts in tests (Error waiting for resource ...) due to the following error (example):

{
  "message": "Revision \"awstarget-sqs-00001\" failed with message: Unable to fetch image
    \"gcr.io/triggermesh-private/aws-target-adapter:v1.7.0\": failed to resolve image to digest:
    HEAD https://gcr.io/v2/triggermesh-private/aws-target-adapter/manifests/v1.7.0:
    unexpected status code 401 Unauthorized (HEAD responses have no body, use GET for details)."
}

(another example I've seen was the gcr.io/triggermesh-private/awseventbridgetarget:v1.7.0 image)

On top of that, Kubernetes doesn't immediately propagate changes to ServiceAccounts' imagePullSecrets, and keeps trying to pull private images without credentials (at least for some time which I couldn't determine) if those credentials weren't referenced when the first pull attempt occurred.

Solution

The difficulty here is that each integration may potentially be using its own Service Account (I'm thinking of multi-tenant sources, especially), so we can't make generic assumptions, such as always checking that the default Service Accounthas someimagePullSecrets`.

Instead, it's probably best to plug some specific logic to each test that requires image pull secrets, e.g. "watch this specific Service Account and wait until it references some imagePullSecret".
Not super portable, but those tests were designed to run against our own infra anyway.

We could also look into Knative's image cache. If I remember correctly, it is possible to make it work with private repos too, given the right credentials are passed to the serving controller.

@antoineco antoineco added the bug Something isn't working label Jul 29, 2021
@sameersbn
Copy link
Contributor

imagepullsecrets-patcher creates a secret in image-pull-secret in each namespace. could we somehow mount the secret in the pods which will effective block the pods until the secret is ready?

@antoineco
Copy link
Contributor Author

@sameersbn which component is responsible for patching Service Accounts with that secret? Is it also imagepullsecrets-patcher?

could we somehow mount the secret in the pods [...] ?

No because images are pulled by kubelets, not by Pods.

@sameersbn
Copy link
Contributor

Yes imagesecret-patcher is the one that patches the service accounts

@antoineco
Copy link
Contributor Author

antoineco commented Jul 29, 2021

I'm thinking it may be time to write our own imagepullsecret-patcher, because the way they iterate over namespaces on a schedule is slow due to the amount of GET requests which are sent constantly: https://github.com/titansoft-pte-ltd/imagepullsecret-patcher/blob/bdf0891920920d3e789a5b5bbf0ea041ad385746/main.go#L94-L130

We would be much better off having an informer watching ServiceAccounts across the cluster, and reacting to changes in real time. Someone tried that at titansoft-pte-ltd/imagepullsecret-patcher#22 actually.

Anyway, it seems like waiting long enough™ does eventually trigger a pull, but that's usually well over 6 min.

@antoineco antoineco added the e2e Issues related to End-to-End testing label Jul 29, 2021
@antoineco antoineco changed the title E2E: Image pull backoff due to race with image-pull-secret Secret creation Image pull backoff due to race with image-pull-secret Secret creation Jul 29, 2021
@sameersbn
Copy link
Contributor

could we try setting CONFIG_LOOP_DURATION to less than the 10s default?

@antoineco
Copy link
Contributor Author

That would probably be counter productive, because the loop is already taking much longer than that to complete in prod.

I made a quick test: created a ServiceAccount in my namespace, started a watch on it, and imagePullSecrets was only populated after 90 sec.

It seems like the patcher is struggling with the amount of namespaces we have in prod.

@antoineco
Copy link
Contributor Author

Closing because images are now public, and the aforementioned patcher was removed from our standard installation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working e2e Issues related to End-to-End testing
Projects
None yet
Development

No branches or pull requests

2 participants