discovery: Add dynamic timeout for the orchestrator discovery #2309

leszko · 2022-03-14T13:42:28Z

What does this pull request do? Explain your changes. (required)

Add dynamic timeout for the orchestrator discovery process.

Specific updates (required)

How did you test each of these updates (required)

Tested with local geth. Introduced artificial delay in O and checked the logs in B.

Does this pull request close any open issues?

fix #2306

Checklist:

Read the contribution guide
make runs successfully
All tests in ./test.sh pass
~~README and other documentation updated~~
Pending changelog updated

yondonfu

IIUC the current implementation will re-send requests to all Os after each timeout until B reaches the max timeout. I'm not sure that B should re-send requests to all Os after each timeout because if the O is taking awhile to respond to the first request (and has not returned an error) re-sending the request doesn't seem like it would help. I actually was thinking that B would just extend the timeout for in-flight requests until the max timeout such that there is only ever a single request per O during a discovery process. WDYT? Curious how you'd compare the benefits of re-sending the requests against the approach I mentioned.

If we implement this, I think the top level context could be used and a timer + timer.Reset() may come in handy. Something like this:

        currTimeout := getOrchestratorsCutoffTimeout
	timer := time.NewTimer(currTimeout)
	for i := 0; i < numAvailableOrchs && len(infos) < numOrchestrators && !timeout; i++ {
		select {
		case info := <-infoCh:
			if penalty := suspender.Suspended(info.Transcoder); penalty == 0 {
				infos = append(infos, info)
			} else {
				heap.Push(suspendedInfos, &suspension{info, penalty})
			}
			nbResp++
		case <-errCh:
			nbResp++
		case <-timer.C:
			// The top level context also timed out after maxGetOrchestratorCutoffTimeout
			// Set timeout to true and continue so we exit on next loop iteration
			select {
			case <-ctx.Done():
				timeout = true
			default:
				continue
			}

			// The current timer timed out and we received at least 1 response
			// Set timeout to true and continue so we exit on next loop iteration
			if nbResp > 0 {
				timeout = true
				continue
			}

			// The top level context did not time out yet
			// The current timer timed out, but we did not receive at least response
			// Reset the timer to extend the timeout
                         currTimeout = currTimeout * 2
			timer.Reset(currTimeout)
		}
	}
	cancel()

EDIT: I think the benefit of your approach is that if a lot of the Os return errors during discovery, re-sending the requests could give those Os an opportunity to return a valid response if they were encountering ephemeral issues which can be helpful in the scenario where there are no other Os to work with. However, as-is, the approach would also involve re-sending requests to Os that B already received valid responses from previously.

leszko · 2022-03-15T09:07:16Z

the approach would also involve re-sending requests to Os that B already received valid responses from previously.

That is actually not the case, because we only re-send requests if no orchestrator replied with a valid response.

leszko · 2022-03-15T09:15:12Z

IIUC the current implementation will re-send requests to all Os after each timeout until B reaches the max timeout. I'm not sure that B should re-send requests to all Os after each timeout because if the O is taking awhile to respond to the first request (and has not returned an error) re-sending the request doesn't seem like it would help. I actually was thinking that B would just extend the timeout for in-flight requests until the max timeout such that there is only ever a single request per O during a discovery process. WDYT? Curious how you'd compare the benefits of re-sending the requests against the approach I mentioned.

That is a very good point! And right, your proposed solution has the benefits that we don't start again with the requests from scratch, but just extend the waiting time.

I see the following benefits of each solution.

Benefits of re-sending requests	Benefits of extending the waiting time
gives a chance to orchestrators which returned errors	less requests received by the orchestrators, won't overload with requests
gives a chance to orchestrators which didn't even receive the request (e.g. request was lost in the network)	lower overall latency, because e.g. if O replies in `3s`, then we wait `4s`, don't need to wait `7.5s = 0.5s + 1s + 2s + 4s`
gives a chance to orchestrators which were temporary down

Each solution will work for us. And each solution has its benefits, the first one is more tolerant, the second optimizes for the load and latency. Our context is that we won't actually see much retries in the real prod env, because the starting 500ms is already a high number for at least 1 O to return its response. So, I see this feature for two use-cases:

teststreams
some network anomaly (probably temporary and limited to a region)

Looking into these 2 cases, I'll probably not optimize for load or latency, but for a more "bullet proof" solution, so keeping it as it is. WDYT? @yondonfu

yondonfu

Would it be possible to add a unit test for the new discovery behavior by stubbing serverGetOrchInfo() as is done here?

discovery/discovery.go

yondonfu · 2022-03-21T16:06:06Z

@leszko

Looking into these 2 cases, I'll probably not optimize for load or latency, but for a more "bullet proof" solution, so keeping it as it is. WDYT?

Sounds reasonable to me.

leszko · 2022-03-22T12:50:02Z

Would it be possible to add a unit test for the new discovery behavior by stubbing serverGetOrchInfo() as is done here?

Technically it's possible, but the approach with calling time.sleep() in tests comes with drawbacks and we need to weigh if they don't exceed the benefits. The drawbacks are:

Every call to time.sleep() effectively increases the test suite execution time
Every call to time.sleep() makes the tests (more) flaky

So, personally, I'd use reserve using sleeping in tests for the cases where we really need it (e.g. push_test.go) and not test the retries. In this case, a simple retry tests add 500ms, but if we want to test the whole exp. backoff spectrum, we add > 10s.

leszko · 2022-03-22T12:50:19Z

@yondonfu Addressed your comments. PTAL.

yondonfu · 2022-03-22T13:02:12Z

@leszko

Agreed that your second point on making tests more flaky should be weighed as a cost relative to the benefit of introducing the suggested unit test. FWIW I think the first point could be addressed by setting a lower custom value for getOrchestratorsCutoffTimeout (that can be reset to its original value at the end of the test). With that being said, setting a lower custom value probably will exacerbate second point as the test could become flakier with lower timeouts and values passed to time.Sleep().

I'm comfortable moving forward without the unit test. Do you think the retry behavior could be a good candidate for an e2e test scenario once we're ready to add e2e tests?

yondonfu

LGTM

leszko · 2022-03-22T14:36:43Z

FWIW I think the first point could be addressed by setting a lower custom value for getOrchestratorsCutoffTimeout
Yeah, agree, we could mitigate it with overriding getOrchestratorsCutoffTimeout in tests.

I'm comfortable moving forward without the unit test. Do you think the retry behavior could be a good candidate for an e2e test scenario once we're ready to add e2e tests?
Yes, I think it's maybe not exactly he scenario for this timeout, but some scenarios for reliability in terms of discovery.

discovery: Add dynamic timeout for the orchstrator discovery

76134aa

leszko requested a review from yondonfu March 14, 2022 13:42

leszko changed the title ~~discovery: Add dynamic timeout for the orchstrator discovery~~ discovery: Add dynamic timeout for the orchestrator discovery Mar 14, 2022

Update CHANGELOG_PENDING.md

f8e4091

yondonfu reviewed Mar 14, 2022

View reviewed changes

leszko requested a review from yondonfu March 15, 2022 09:15

yondonfu requested changes Mar 21, 2022

View reviewed changes

discovery/discovery.go Show resolved Hide resolved

leszko requested a review from yondonfu March 22, 2022 12:50

yondonfu approved these changes Mar 22, 2022

View reviewed changes

leszko merged commit b4f0520 into livepeer:master Mar 22, 2022

leszko deleted the 2306-dynamic-discovery-timeout branch March 22, 2022 14:37

leszko mentioned this pull request Mar 24, 2022

Fix dynamic discovery timeout to not retry sending requests, but wait for the same request to complete #2337

Merged

5 tasks

leszko mentioned this pull request May 10, 2022

Release v0.5.30 #2393

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

discovery: Add dynamic timeout for the orchestrator discovery #2309

discovery: Add dynamic timeout for the orchestrator discovery #2309

leszko commented Mar 14, 2022 •

edited

Loading

yondonfu left a comment •

edited

Loading

leszko commented Mar 15, 2022

leszko commented Mar 15, 2022

yondonfu left a comment

yondonfu commented Mar 21, 2022

leszko commented Mar 22, 2022

leszko commented Mar 22, 2022

yondonfu commented Mar 22, 2022

yondonfu left a comment

leszko commented Mar 22, 2022

discovery: Add dynamic timeout for the orchestrator discovery #2309

discovery: Add dynamic timeout for the orchestrator discovery #2309

Conversation

leszko commented Mar 14, 2022 • edited Loading

yondonfu left a comment • edited Loading

Choose a reason for hiding this comment

leszko commented Mar 15, 2022

leszko commented Mar 15, 2022

yondonfu left a comment

Choose a reason for hiding this comment

yondonfu commented Mar 21, 2022

leszko commented Mar 22, 2022

leszko commented Mar 22, 2022

yondonfu commented Mar 22, 2022

yondonfu left a comment

Choose a reason for hiding this comment

leszko commented Mar 22, 2022

leszko commented Mar 14, 2022 •

edited

Loading

yondonfu left a comment •

edited

Loading