Make k8s job backoff limit configurable for RayJob #2091

jjyao · 2024-04-20T05:01:55Z

Why are these changes needed?

Allow users to specify BackoffLimit of the submitter k8s job of a RayJob

Related issue number

Closes #2058

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

Signed-off-by: Jiajun Yao <[email protected]>

andrewsykim · 2024-04-21T00:37:15Z

ray-operator/controllers/ray/rayjob_controller.go

@@ -437,7 +442,7 @@ func (r *RayJobReconciler) createNewK8sJob(ctx context.Context, rayJobInstance *
 			// is attempted 3 times at the maximum, but still mitigates the case of unrecoverable
 			// application-level errors, where the maximum number of retries is reached, and the job
 			// completion time increases with no benefits, but wasted resource cycles.
-			BackoffLimit: pointer.Int32(2),
+			BackoffLimit: backoffLimit,


Personally I think this backoff limit should be part of the RayJob spec, what do you think?

Some previous discussion on backoff limits here #1902 (comment)

cc @kevin85421

I agree this provides the most flexibility. Do we have use case where people want to set different limits for different RayJob?

Personally I think this backoff limit should be part of the RayJob spec, what do you think?

Currently, the purpose of the retry mechanism at this moment is to handle network instability in the Kubernetes cluster. For example, if the submitter temporarily cannot connect to the Ray head due to network issues, the retry mechanism may alleviate the problem by resubmitting the request. If the Ray head has a running or terminated Ray job, the retry will fail due to a conflict with the submission ID.

Therefore, it relates to the Kubernetes cluster that the RayJob CRs running in rather than to different Ray applications. That's why this PR sets the value for all RayJob CRs in the same Kubernetes cluster.

#1902 (comment)
checks if there is job running with a submission id

if yes, starts tailing the logs => Ray currently doesn't support this.

Currently, Ray does not support tailing logs for an existing submission ID using ray job submit. If Ray were to support this feature, the retry mechanism could handle issues that arise after the Ray job starts running. In this case, making the backoffLimit a part of the RayJob spec makes sense to me because different Ray applications may require varying lengths of time to finish, but currently, Ray doesn't support that.

Do we have use case where people want to set different limits for different RayJob?

Currently, this is not possible due to the ray job submit limitations I mentioned above.

My current thought is:

Step 1: Support cluster-wide backoffLimit (i.e., this PR).

Step 2: Enable ray job submit to tail logs if the submission ID exists.

Step 3: Support backoffLimit as part of the RayJob spec, which can overwrite cluster-wide configuration.

What do you think?

If (3) is the eventual goal, should we just do step 3 and then step 2 and skip step 1? Once we have the per job configuration, do we still want the cluster-wide configuration?

Let me then change this PR to do step 3 directly @andrewsykim ?

If the Ray head has a running or terminated Ray job, the retry will fail due to a conflict with the submission ID.

I forgot about this problem. If we allow users to configure the back off limit, we need to also let them configure whether a new submissionID is generated for each retry. That probably needs a new field submissionRetryPolicy or something like that.

Let me then change this PR to do step 3 directly

LGTM since the retry policy can be done in a separate PR from this change anyways

If we allow users to configure the back off limit, we need to also let them configure whether a new submissionID is generated for each retry.

If the retry is caused by application-level logic, we should delete the RayCluster and create a new one for the retry. This is based on our internal experience.

ack -- let's continue discussing more advanced RayJob retry policy in #1902

This reverts commit 83885ec.

Signed-off-by: Jiajun Yao <[email protected]>

jjyao · 2024-04-23T20:58:36Z

cc @andrewsykim @kevin85421 updated based on our discussion: could you take a look at the new configs. If it looks good, I'll polish the PR.

kevin85421 · 2024-04-23T21:49:49Z

The new CRD looks good to me.

ray-operator/apis/ray/v1/rayjob_types.go

andrewsykim · 2024-04-24T00:42:46Z

ray-operator/config/crd/bases/ray.io_rayjobs.yaml

+                  submitterBackoffLimit:
+                    format: int32
+                    type: integer
+                type: object


Should we default to 2 at the CRD level too?

@jjyao you should add a kubebuilder marker and then generate CRD again.

kuberay/ray-operator/apis/ray/v1/rayjob_types.go

Line 78 in 4836d01

// +kubebuilder:default:=0

I'm able to set default for BackOffLimit but not the outer SubmitterConfig struct due to kubebuilder bug kubernetes-sigs/controller-tools#622

andrewsykim · 2024-04-24T15:00:53Z

@kevin85421 @jjyao what do you think about this API, which also addresses some feature requests in #1902

spec:
  retryConfig:
    policy: RetryWithSameSubmissionID # future values: RetryWithNewSubmissionID and RetryWithNewCluster 
    backOffLimit: 2

(or something like this)

Signed-off-by: Jiajun Yao <[email protected]>

kevin85421

LGTM. We manually test this PR today. @jjyao will open a follow up PR to test this behavior.

Make k8s job backoff limit configurable for RayJob

83885ec

Signed-off-by: Jiajun Yao <[email protected]>

kevin85421 self-requested a review April 20, 2024 06:54

kevin85421 self-assigned this Apr 20, 2024

andrewsykim reviewed Apr 21, 2024

View reviewed changes

andrewsykim mentioned this pull request Apr 22, 2024

[Feature] Make RayJob recover automatically from K8S submitter job and Ray cluster head node failures #1902

Closed

2 tasks

jjyao added 2 commits April 23, 2024 18:06

Revert "Make k8s job backoff limit configurable for RayJob"

4dc1a98

This reverts commit 83885ec.

up

4aedc1e

Signed-off-by: Jiajun Yao <[email protected]>

jjyao requested a review from andrewsykim April 23, 2024 20:58

andrewsykim reviewed Apr 24, 2024

View reviewed changes

ray-operator/apis/ray/v1/rayjob_types.go Outdated Show resolved Hide resolved

ray-operator/apis/ray/v1/rayjob_types.go Outdated Show resolved Hide resolved

ray-operator/apis/ray/v1/rayjob_types.go Outdated Show resolved Hide resolved

andrewsykim reviewed Apr 24, 2024

View reviewed changes

jjyao added 3 commits April 30, 2024 19:56

up

149dca1

Signed-off-by: Jiajun Yao <[email protected]>

up

91de977

Signed-off-by: Jiajun Yao <[email protected]>

up

6779485

Signed-off-by: Jiajun Yao <[email protected]>

jjyao marked this pull request as ready for review April 30, 2024 23:16

kevin85421 approved these changes May 1, 2024

View reviewed changes

kevin85421 merged commit 9662bd9 into ray-project:master May 1, 2024
24 checks passed

jjyao deleted the jjyao/second branch May 1, 2024 16:07

jjyao mentioned this pull request May 10, 2024

Add test for configurable k8s job backoff limit #2134

Merged

4 tasks

andrewsykim mentioned this pull request Jul 15, 2024

[Feature] TTL Delete RayJob CRD After Job Termination #1944

Closed

2 tasks

MortalHappiness mentioned this pull request Sep 1, 2024

[Docs][KubeRay] Update RayJob doc for backoffLimit and DELETE_RAYJOB_CR_AFTER_JOB_FINISHES ray-project/ray#47445

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make k8s job backoff limit configurable for RayJob #2091

Make k8s job backoff limit configurable for RayJob #2091

jjyao commented Apr 20, 2024 •

edited

Loading

andrewsykim Apr 21, 2024

jjyao Apr 21, 2024

jjyao Apr 21, 2024

kevin85421 Apr 21, 2024

jjyao Apr 22, 2024

jjyao Apr 22, 2024

andrewsykim Apr 22, 2024

andrewsykim Apr 22, 2024

kevin85421 Apr 22, 2024 •

edited

Loading

andrewsykim Apr 22, 2024

jjyao commented Apr 23, 2024

kevin85421 commented Apr 23, 2024

andrewsykim Apr 24, 2024

kevin85421 Apr 25, 2024

jjyao Apr 30, 2024

andrewsykim commented Apr 24, 2024 •

edited

Loading

kevin85421 left a comment

Make k8s job backoff limit configurable for RayJob #2091

Make k8s job backoff limit configurable for RayJob #2091

Conversation

jjyao commented Apr 20, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevin85421 Apr 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jjyao commented Apr 23, 2024

kevin85421 commented Apr 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewsykim commented Apr 24, 2024 • edited Loading

kevin85421 left a comment

Choose a reason for hiding this comment

jjyao commented Apr 20, 2024 •

edited

Loading

kevin85421 Apr 22, 2024 •

edited

Loading

andrewsykim commented Apr 24, 2024 •

edited

Loading