You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.
Using the VertexAICustomTrainingJob class to attach a GPU to a custom training job would work. It turns out that the MachineSpec submitted does not include an accelerator_count, which means that specifying an accelerator_type breaks this block
Traceback / Example
"Submission failed. Traceback (most recent call last): File "/usr/local/lib/python3.10/site-packages/google/api_core/grpc_helpers.py", line 72, in error_remapped_callable return callable_(*args, **kwargs) File "/usr/local/lib/python3.10/site-packages/grpc/_channel.py", line 1030, in call return _end_unary_response_blocking(state, call, False, None) File "/usr/local/lib/python3.10/site-packages/grpc/_channel.py", line 910, in _end_unary_response_blocking raise _InactiveRpcError(state) # pytype: disable=not-instantiable grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode.INVALID_ARGUMENT details = "List of found errors: 1.Field: job_spec.worker_pool_specs[0].machine_spec.accelerator_type; Message: Both accelerator_type and accelerator_count should be specified or none. " debug_error_string = "UNKNOWN:Error received from peer ipv4:74.125.69.95:443 {created_time:"2023-04-24T13:37:52.582665137+00:00", grpc_status:3, grpc_message:"List of found errors:\t1.Field: job_spec.worker_pool_specs[0].machine_spec.accelerator_type; Message: Both accelerator_type and accelerator_count should be specified or none.\t"}" > The above exception was the direct cause of the following exception: google.api_core.exceptions.InvalidArgument: 400 List of found errors: 1.Field: job_spec.worker_pool_specs[0].machine_spec.accelerator_type; Message: Both accelerator_type and accelerator_count should be specified or none. [field_violations { field: "job_spec.worker_pool_specs[0].machine_spec.accelerator_type" description: "Both accelerator_type and accelerator_count should be specified or none." } ]"
I would like to help contribute a pull request to resolve this!
Expectation / Proposal
Using the
VertexAICustomTrainingJob
class to attach a GPU to a custom training job would work. It turns out that theMachineSpec
submitted does not include an accelerator_count, which means that specifying an accelerator_type breaks this blockTraceback / Example
"Submission failed. Traceback (most recent call last): File "/usr/local/lib/python3.10/site-packages/google/api_core/grpc_helpers.py", line 72, in error_remapped_callable return callable_(*args, **kwargs) File "/usr/local/lib/python3.10/site-packages/grpc/_channel.py", line 1030, in call return _end_unary_response_blocking(state, call, False, None) File "/usr/local/lib/python3.10/site-packages/grpc/_channel.py", line 910, in _end_unary_response_blocking raise _InactiveRpcError(state) # pytype: disable=not-instantiable grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode.INVALID_ARGUMENT details = "List of found errors: 1.Field: job_spec.worker_pool_specs[0].machine_spec.accelerator_type; Message: Both
accelerator_type
andaccelerator_count
should be specified or none. " debug_error_string = "UNKNOWN:Error received from peer ipv4:74.125.69.95:443 {created_time:"2023-04-24T13:37:52.582665137+00:00", grpc_status:3, grpc_message:"List of found errors:\t1.Field: job_spec.worker_pool_specs[0].machine_spec.accelerator_type; Message: Bothaccelerator_type
andaccelerator_count
should be specified or none.\t"}" > The above exception was the direct cause of the following exception: google.api_core.exceptions.InvalidArgument: 400 List of found errors: 1.Field: job_spec.worker_pool_specs[0].machine_spec.accelerator_type; Message: Bothaccelerator_type
andaccelerator_count
should be specified or none. [field_violations { field: "job_spec.worker_pool_specs[0].machine_spec.accelerator_type" description: "Bothaccelerator_type
andaccelerator_count
should be specified or none." } ]"I opened #174 for this fix
The text was updated successfully, but these errors were encountered: