Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tell StorageProviders where to store data #3053

Merged
merged 7 commits into from
Dec 6, 2023

Conversation

rossjones
Copy link
Contributor

@rossjones rossjones commented Nov 28, 2023

Currently storage providers create a temporary folder underneath the system configured storage path, where the folder include the provider's name. This means all of the data is indexed on the provider, rather than the execution. Most of the StorageProvider implementations determine these paths when they are first created.

This PR switches to telling the provider where to store the data, so that it is keyed by job/execution instead (also within the configured data folder). This provides a single location for all inputs affecting an execution (or all local executions for a Job).

@rossjones rossjones marked this pull request as ready for review December 4, 2023 09:33
@rossjones rossjones changed the title Fix StorageProviders to store where they are told Tell StorageProviders where to store content Dec 4, 2023
@rossjones rossjones changed the title Tell StorageProviders where to store content Tell StorageProviders where to store data Dec 4, 2023
@@ -216,7 +221,13 @@ func (e *BaseExecutor) Start(ctx context.Context, execution *models.Execution) (
return
}

args, cleanup, err := PrepareRunArguments(ctx, e.Storages, execution, resultFolder)
executionStorage := filepath.Join(e.storageDirectory, fmt.Sprintf("%s/%s", execution.JobID, execution.ID))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just filepath.Join(e.StorageDirectory, execution.JobID, execution.ID) ?

@@ -216,7 +221,13 @@ func (e *BaseExecutor) Start(ctx context.Context, execution *models.Execution) (
return
}

args, cleanup, err := PrepareRunArguments(ctx, e.Storages, execution, resultFolder)
executionStorage := filepath.Join(e.storageDirectory, fmt.Sprintf("%s/%s", execution.JobID, execution.ID))
if err := os.MkdirAll(executionStorage, os.ModePerm); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sets 777 permissions, is that what we want? Would we prefer 755 or 500?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have set to 0755 and a local constant. I've left the exec bit for now, meaning users can download a binary and run it if they wish (within the container theyre in). Is this something we want to allow?

Comment on lines -92 to -97
cm.RegisterCallback(func() error {
if err := os.RemoveAll(dir); err != nil {
return fmt.Errorf("unable to clean up S3 storage directory: %w", err)
}
return nil
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So where does cleanup happen now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than tests, PrepareStorage is only ever called from ParallelPrepareStorage, and ParallelPrepareStorage is only used by the base executor (and the wasm executor) and each time returns a function that will call ParallelCleanup. This is a step in the direction of us cleaning up the resources when they're no longer used so that storage is responsible for fetching the data, but the user is responsible for cleaning it up.

Currently storage providers create a temporary folder underneath the
system configured storage path, where the folder include the provider's
name.  This means all of the data is indexed on the provider, rather
than the execution.

This PR switches to telling the provider where to store the data, so
that it is keyed by job/execution instead (also within the configured
data folder).
The parallel tests were expecting to cleanup, and at the same time not
checking the existence of files it thinks it had downloaded
@rossjones rossjones merged commit dfb3721 into main Dec 6, 2023
13 checks passed
@rossjones rossjones deleted the give-storageproviders-input branch December 6, 2023 09:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants