Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

feat: add kill_infrastructure implementation for vertex ai worker #213

Merged
merged 6 commits into from
Sep 22, 2023

Conversation

parkedwards
Copy link
Contributor

@parkedwards parkedwards commented Sep 21, 2023

Closes #212

Followup to #211

Example

Screenshots

image
image
image

Checklist

  • References any related issue by including "Closes #" or "Closes ".
    • If no issue exists and your change is not a small fix, please create an issue first.
  • Includes tests or only affects documentation.
  • Passes pre-commit checks.
    • Run pre-commit install && pre-commit run --all locally for formatting and linting.
  • Includes screenshots of documentation updates.
    • Run mkdocs serve view documentation locally.
  • Summarizes PR's changes in CHANGELOG.md

@parkedwards parkedwards requested a review from a team as a code owner September 21, 2023 22:12

# Vertex will include an error message upon valid
# flow cancellations, so we'll avoid raising an error in that case
if error_msg and "CANCELED" not in error_msg:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had to exclude cancellation messages, as this will throw the runtime error

image

mock = job_config.credentials.job_service_client.create_custom_job
# the CancelCustomJobRequest class seems to reject a MagicMock value
# so here, we'll use a SimpleNamespace as the mocked return values
mock.return_value = SimpleNamespace(
Copy link
Contributor Author

@parkedwards parkedwards Sep 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im not fully understanding why, but returning a plain MagicMock here with the correct name and state keys would cause the test to fail, somewhere in the CancelCustomJobRequest class

self = <[AttributeError('Unknown field for CancelCustomJobRequest: _pb') raised in repr()] CancelCustomJobRequest object at 0x168bb4490>
mapping = {'name': <MagicMock name='mock_name.name' id='5749201248'>}, ignore_unknown_fields = False, kwargs = {'name': <MagicMock name='mock_name.name' id='5749201248'>}
params = {'name': <MagicMock name='mock_name.name' id='5749201248'>}, marshal = <proto.marshal.marshal.Marshal object at 0x106f4c790>, key = 'name'
value = <MagicMock name='mock_name.name' id='5749201248'>, pb_value = <MagicMock name='mock_name.name' id='5749201248'>

so, here i just use a SimpleNamespace to return the right key values that can be access via dot notation when it's being passed into CancelCustomJobRequest downstream

title="Accelerator Count",
description=(
"The number of accelerators to attach to the machine. "
"See https://cloud.google.com/vertex-ai/docs/reference/rest/v1/MachineSpec"
),
default=0,
example=1,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is unrelated to the kill infra path, but something i found while testing this further

both the accelerator_* params are optional, and they have an XOR relationship - both have to exist, or both have to be null. so, i swapped these to be optional + removed the keys from the job body validation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these to be optional in the UI, they'll need a default value of None.

Copy link
Contributor Author

@parkedwards parkedwards Sep 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im cool to add the default None, but with the current configuration, the UI looks like this (the fields are empty btw, the screenshot shows the example text):

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok gotcha. Do you get errors in the worker when you don't fill in those fields? I'd expect Pydantic to complain that no value is provided if there's no default value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's so weird, i would expect that as well. but i just created a new work pool and started the worker, and:

$ prefect work-pool create 'a-new-pool' --type vertex-ai
Created work pool 'a-new-pool'.

$ prefect worker start -p a-new-pool
Discovered type 'vertex-ai' for work pool 'a-new-pool'.
Worker 'VertexAIWorker b1b27bbe-1070-4f5a-9ccc-f63a6e7adad1' started!

here are the work pool variables in the UI
image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You wouldn't see the error until a flow run is prepared to be run because that's when a configuration object is created.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i didn't see it at all during my e2e testing, where i created a bunch of vertex jobs. does it make a difference that these properties are on the variables class, not the configuration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw i added the default None values in, this thread is more for my understanding / sanity

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦‍♂️ I didn't realize this is the variables class. No objects are ever created from this class, so you wouldn't see that error. It all makes sense now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok whew, sanity restored

Copy link
Member

@desertaxle desertaxle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks great! I have one small comment and I'm going to push some changelog updates to this branch so that we can release once this is merged!

title="Accelerator Count",
description=(
"The number of accelerators to attach to the machine. "
"See https://cloud.google.com/vertex-ai/docs/reference/rest/v1/MachineSpec"
),
default=0,
example=1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these to be optional in the UI, they'll need a default value of None.

Copy link
Member

@desertaxle desertaxle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🚀

@parkedwards parkedwards merged commit a21aca4 into main Sep 22, 2023
10 checks passed
@parkedwards parkedwards deleted the feat/vertex-worker-kill-infra branch September 22, 2023 16:36
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add kill_infrastructure support for Vertex AI Worker
2 participants