Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Fix VertexAICustomTrainingJob failing to cancel #205

Closed
wants to merge 14 commits into from

Conversation

jeremy-thomas-roc
Copy link
Contributor

Uses the existing status code check at the end of run to call kill, which has seemingly been implemented, but never called.

I'm not exactly sure how I would go about writing a test for this, happy to do it if provided some direction for the best way to do this.

Closes
PrefectHQ/prefect#13056

Screenshots

No update to the docs

Checklist

  • References any related issue by including "Closes #" or "Closes ".
    • If no issue exists and your change is not a small fix, please create an issue first.
  • Includes tests or only affects documentation.
  • Passes pre-commit checks.
    • Run pre-commit install && pre-commit run --all locally for formatting and linting.
  • Includes screenshots of documentation updates.
    • Run mkdocs serve view documentation locally.
  • Summarizes PR's changes in CHANGELOG.md

@jeremy-thomas-roc jeremy-thomas-roc requested a review from a team as a code owner August 3, 2023 13:42
@desertaxle
Copy link
Member

Thanks for opening a PR @jeremy-thomas-roc! If this error occurs when canceling via the Prefect UI, then my hunch is that the bug is in the kill method of the VertexAICustomTrainingJob. Do you have any log output from a canceled flow run that you can share? That might help us determine if the call to Vertex to cancel the job is failing.

@jeremy-thomas-roc
Copy link
Contributor Author

@desertaxle

So in going through the Vertex logs, this is what I see. Note the change in timestamps on the left.
image

The exception doesn't occur until the job is canceled via the Vertex UI. Prior to that, it logs nothing.

I have attached the exception here as well, it seems that the prefect cancelation is there, but again, not until the job is manually canceled in the Vertex UI.
downloaded-logs-20230808-120705.txt

Did I miss where the kill method gets called, if not where I added it?

@desertaxle
Copy link
Member

Did I miss where the kill method gets called, if not where I added it?

Agents are responsible for calling the kill method on infrastructure blocks like VertexAICustomTrainingJob. However, kill is only called when a flow run is canceled via the Prefect UI. You can see where the agent code calls kill here.

We need more info on why these Vertex jobs are hanging. I'd expect Vertex to teardown the job if they succeeded or failed. In the meantime, we can expose a timeout field on the VertexAICustomTrainingJob block so that jobs don't hang indefinitely. We can use the timeout that Vertex offers so that Vertex maintains control of the job lifecycle unless a cancellation comes in from the Prefect side.

@jeremy-thomas-roc
Copy link
Contributor Author

jeremy-thomas-roc commented Aug 24, 2023

@desertaxle great, this looks like it will work perfectly. I will work on implementing this instead of the explicit kill call

Actually, this already exists in the block. maximum_run_time can be set, so this is unnecessary. Closing this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants