Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make k8s job backoff limit configurable for RayJob #2091

Merged
merged 6 commits into from
May 1, 2024

Conversation

jjyao
Copy link
Contributor

@jjyao jjyao commented Apr 20, 2024

Why are these changes needed?

Allow users to specify BackoffLimit of the submitter k8s job of a RayJob

Related issue number

Closes #2058

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@kevin85421 kevin85421 self-requested a review April 20, 2024 06:54
@kevin85421 kevin85421 self-assigned this Apr 20, 2024
@@ -437,7 +442,7 @@ func (r *RayJobReconciler) createNewK8sJob(ctx context.Context, rayJobInstance *
// is attempted 3 times at the maximum, but still mitigates the case of unrecoverable
// application-level errors, where the maximum number of retries is reached, and the job
// completion time increases with no benefits, but wasted resource cycles.
BackoffLimit: pointer.Int32(2),
BackoffLimit: backoffLimit,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I think this backoff limit should be part of the RayJob spec, what do you think?

Some previous discussion on backoff limits here #1902 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this provides the most flexibility. Do we have use case where people want to set different limits for different RayJob?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I think this backoff limit should be part of the RayJob spec, what do you think?

Currently, the purpose of the retry mechanism at this moment is to handle network instability in the Kubernetes cluster. For example, if the submitter temporarily cannot connect to the Ray head due to network issues, the retry mechanism may alleviate the problem by resubmitting the request. If the Ray head has a running or terminated Ray job, the retry will fail due to a conflict with the submission ID.

Therefore, it relates to the Kubernetes cluster that the RayJob CRs running in rather than to different Ray applications. That's why this PR sets the value for all RayJob CRs in the same Kubernetes cluster.

#1902 (comment)
checks if there is job running with a submission id

  • if yes, starts tailing the logs => Ray currently doesn't support this.

Currently, Ray does not support tailing logs for an existing submission ID using ray job submit. If Ray were to support this feature, the retry mechanism could handle issues that arise after the Ray job starts running. In this case, making the backoffLimit a part of the RayJob spec makes sense to me because different Ray applications may require varying lengths of time to finish, but currently, Ray doesn't support that.

Do we have use case where people want to set different limits for different RayJob?

Currently, this is not possible due to the ray job submit limitations I mentioned above.

My current thought is:

  • Step 1: Support cluster-wide backoffLimit (i.e., this PR).
  • Step 2: Enable ray job submit to tail logs if the submission ID exists.
  • Step 3: Support backoffLimit as part of the RayJob spec, which can overwrite cluster-wide configuration.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If (3) is the eventual goal, should we just do step 3 and then step 2 and skip step 1? Once we have the per job configuration, do we still want the cluster-wide configuration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me then change this PR to do step 3 directly @andrewsykim ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the Ray head has a running or terminated Ray job, the retry will fail due to a conflict with the submission ID.

I forgot about this problem. If we allow users to configure the back off limit, we need to also let them configure whether a new submissionID is generated for each retry. That probably needs a new field submissionRetryPolicy or something like that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me then change this PR to do step 3 directly

LGTM since the retry policy can be done in a separate PR from this change anyways

Copy link
Member

@kevin85421 kevin85421 Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we allow users to configure the back off limit, we need to also let them configure whether a new submissionID is generated for each retry.

If the retry is caused by application-level logic, we should delete the RayCluster and create a new one for the retry. This is based on our internal experience.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack -- let's continue discussing more advanced RayJob retry policy in #1902

@jjyao
Copy link
Contributor Author

jjyao commented Apr 23, 2024

cc @andrewsykim @kevin85421 updated based on our discussion: could you take a look at the new configs. If it looks good, I'll polish the PR.

@jjyao jjyao requested a review from andrewsykim April 23, 2024 20:58
@kevin85421
Copy link
Member

The new CRD looks good to me.

ray-operator/apis/ray/v1/rayjob_types.go Outdated Show resolved Hide resolved
ray-operator/apis/ray/v1/rayjob_types.go Outdated Show resolved Hide resolved
ray-operator/apis/ray/v1/rayjob_types.go Outdated Show resolved Hide resolved
submitterBackoffLimit:
format: int32
type: integer
type: object
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we default to 2 at the CRD level too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jjyao you should add a kubebuilder marker and then generate CRD again.

// +kubebuilder:default:=0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm able to set default for BackOffLimit but not the outer SubmitterConfig struct due to kubebuilder bug kubernetes-sigs/controller-tools#622

@andrewsykim
Copy link
Contributor

andrewsykim commented Apr 24, 2024

@kevin85421 @jjyao what do you think about this API, which also addresses some feature requests in #1902

spec:
  retryConfig:
    policy: RetryWithSameSubmissionID # future values: RetryWithNewSubmissionID and RetryWithNewCluster 
    backOffLimit: 2

(or something like this)

Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
@jjyao jjyao marked this pull request as ready for review April 30, 2024 23:16
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We manually test this PR today. @jjyao will open a follow up PR to test this behavior.

@kevin85421 kevin85421 merged commit 9662bd9 into ray-project:master May 1, 2024
24 checks passed
@jjyao jjyao deleted the jjyao/second branch May 1, 2024 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Support k8s job backoff limit configuration for KubeRay jobs
3 participants