-
Notifications
You must be signed in to change notification settings - Fork 364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make k8s job backoff limit configurable for RayJob #2091
Conversation
Signed-off-by: Jiajun Yao <[email protected]>
@@ -437,7 +442,7 @@ func (r *RayJobReconciler) createNewK8sJob(ctx context.Context, rayJobInstance * | |||
// is attempted 3 times at the maximum, but still mitigates the case of unrecoverable | |||
// application-level errors, where the maximum number of retries is reached, and the job | |||
// completion time increases with no benefits, but wasted resource cycles. | |||
BackoffLimit: pointer.Int32(2), | |||
BackoffLimit: backoffLimit, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I think this backoff limit should be part of the RayJob spec, what do you think?
Some previous discussion on backoff limits here #1902 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @kevin85421
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this provides the most flexibility. Do we have use case where people want to set different limits for different RayJob?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I think this backoff limit should be part of the RayJob spec, what do you think?
Currently, the purpose of the retry mechanism at this moment is to handle network instability in the Kubernetes cluster. For example, if the submitter temporarily cannot connect to the Ray head due to network issues, the retry mechanism may alleviate the problem by resubmitting the request. If the Ray head has a running or terminated Ray job, the retry will fail due to a conflict with the submission ID.
Therefore, it relates to the Kubernetes cluster that the RayJob CRs running in rather than to different Ray applications. That's why this PR sets the value for all RayJob CRs in the same Kubernetes cluster.
#1902 (comment)
checks if there is job running with a submission id
- if yes, starts tailing the logs => Ray currently doesn't support this.
Currently, Ray does not support tailing logs for an existing submission ID using ray job submit
. If Ray were to support this feature, the retry mechanism could handle issues that arise after the Ray job starts running. In this case, making the backoffLimit
a part of the RayJob spec makes sense to me because different Ray applications may require varying lengths of time to finish, but currently, Ray doesn't support that.
Do we have use case where people want to set different limits for different RayJob?
Currently, this is not possible due to the ray job submit limitations I mentioned above.
My current thought is:
- Step 1: Support cluster-wide
backoffLimit
(i.e., this PR). - Step 2: Enable
ray job submit
to tail logs if the submission ID exists. - Step 3: Support
backoffLimit
as part of the RayJob spec, which can overwrite cluster-wide configuration.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If (3) is the eventual goal, should we just do step 3 and then step 2 and skip step 1? Once we have the per job configuration, do we still want the cluster-wide configuration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me then change this PR to do step 3 directly @andrewsykim ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the Ray head has a running or terminated Ray job, the retry will fail due to a conflict with the submission ID.
I forgot about this problem. If we allow users to configure the back off limit, we need to also let them configure whether a new submissionID is generated for each retry. That probably needs a new field submissionRetryPolicy
or something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me then change this PR to do step 3 directly
LGTM since the retry policy can be done in a separate PR from this change anyways
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we allow users to configure the back off limit, we need to also let them configure whether a new submissionID is generated for each retry.
If the retry is caused by application-level logic, we should delete the RayCluster and create a new one for the retry. This is based on our internal experience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack -- let's continue discussing more advanced RayJob retry policy in #1902
This reverts commit 83885ec.
Signed-off-by: Jiajun Yao <[email protected]>
cc @andrewsykim @kevin85421 updated based on our discussion: could you take a look at the new configs. If it looks good, I'll polish the PR. |
The new CRD looks good to me. |
submitterBackoffLimit: | ||
format: int32 | ||
type: integer | ||
type: object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we default to 2 at the CRD level too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jjyao you should add a kubebuilder marker and then generate CRD again.
// +kubebuilder:default:=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm able to set default for BackOffLimit
but not the outer SubmitterConfig
struct due to kubebuilder bug kubernetes-sigs/controller-tools#622
@kevin85421 @jjyao what do you think about this API, which also addresses some feature requests in #1902 spec:
retryConfig:
policy: RetryWithSameSubmissionID # future values: RetryWithNewSubmissionID and RetryWithNewCluster
backOffLimit: 2 (or something like this) |
Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. We manually test this PR today. @jjyao will open a follow up PR to test this behavior.
Why are these changes needed?
Allow users to specify BackoffLimit of the submitter k8s job of a RayJob
Related issue number
Closes #2058
Checks