Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix celery worker profile for s3 access #333

Merged
merged 3 commits into from
Oct 19, 2023

Conversation

saiatmakuri
Copy link
Contributor

@saiatmakuri saiatmakuri commented Oct 19, 2023

Follow-up to #327

Pull Request Summary

The celery task queue is instantiated with aws_profile provided in this line. Tasks were completed but the final state was not written to s3 and silently failed since the profile used was not the one with s3 write permissions. The real issue is only with the Celery forwarder that is created with the wrong profile. Change only for its instantiation.

Testing Plan

Create a new deployment in our clusters and launched a batch job. The pods were built correctly and the job successfully completed.

@saiatmakuri saiatmakuri added the bug Something isn't working label Oct 19, 2023
@saiatmakuri saiatmakuri self-assigned this Oct 19, 2023
@@ -504,7 +504,7 @@ def _get_backend_url_and_conf(
elif backend_protocol == "s3":
backend_url = "s3://"
if aws_role is None:
aws_profile = os.getenv("AWS_PROFILE", infra_config().profile_ml_worker)
aws_profile = os.getenv("S3_WRITE_AWS_PROFILE", infra_config().profile_ml_worker)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have S3_WRITE_AWS_PROFILE inside the celery forwarder? just want to make sure that this works for (gateway, endpoint builder, celery forwarder, batch job orchestration pod)

Copy link
Member

@seanshi-scale seanshi-scale Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this s3_write_aws_profile means a few things, e.g. in the gateway I think it has to do with some s3 repository for storing llm fine tune jobs, in celery workers it means where completed tasks get written, I think this is fine though since it's all related to s3

guess the perms for this are actually "read, write, maybe list" as opposed to just "write" as well, which probably is fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed offline. reverting back to profile_ml_worker

Copy link
Member

@seanshi-scale seanshi-scale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you test getting async task results from the gateway as well? I think if we know that works then that + (endpoint builds successfully, batch job completes successfully with all the task results) should be sufficient for testing. LGTM once that's done

@saiatmakuri saiatmakuri merged commit fe24d63 into main Oct 19, 2023
5 checks passed
@saiatmakuri saiatmakuri deleted the saiatmakuri/fix-celery-app-profile-pt2 branch October 19, 2023 22:59
yixu34 added a commit that referenced this pull request Oct 23, 2023
yixu34 added a commit that referenced this pull request Oct 23, 2023
saiatmakuri added a commit that referenced this pull request Oct 23, 2023
saiatmakuri added a commit that referenced this pull request Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants