Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

discovery: Add dynamic timeout for the orchestrator discovery #2309

Merged
merged 2 commits into from
Mar 22, 2022

Conversation

leszko
Copy link
Contributor

@leszko leszko commented Mar 14, 2022

What does this pull request do? Explain your changes. (required)

Add dynamic timeout for the orchestrator discovery process.

Specific updates (required)

How did you test each of these updates (required)

Tested with local geth. Introduced artificial delay in O and checked the logs in B.

Does this pull request close any open issues?

fix #2306

Checklist:

@leszko leszko requested a review from yondonfu March 14, 2022 13:42
@leszko leszko changed the title discovery: Add dynamic timeout for the orchstrator discovery discovery: Add dynamic timeout for the orchestrator discovery Mar 14, 2022
Copy link
Member

@yondonfu yondonfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC the current implementation will re-send requests to all Os after each timeout until B reaches the max timeout. I'm not sure that B should re-send requests to all Os after each timeout because if the O is taking awhile to respond to the first request (and has not returned an error) re-sending the request doesn't seem like it would help. I actually was thinking that B would just extend the timeout for in-flight requests until the max timeout such that there is only ever a single request per O during a discovery process. WDYT? Curious how you'd compare the benefits of re-sending the requests against the approach I mentioned.

If we implement this, I think the top level context could be used and a timer + timer.Reset() may come in handy. Something like this:

        currTimeout := getOrchestratorsCutoffTimeout
	timer := time.NewTimer(currTimeout)
	for i := 0; i < numAvailableOrchs && len(infos) < numOrchestrators && !timeout; i++ {
		select {
		case info := <-infoCh:
			if penalty := suspender.Suspended(info.Transcoder); penalty == 0 {
				infos = append(infos, info)
			} else {
				heap.Push(suspendedInfos, &suspension{info, penalty})
			}
			nbResp++
		case <-errCh:
			nbResp++
		case <-timer.C:
			// The top level context also timed out after maxGetOrchestratorCutoffTimeout
			// Set timeout to true and continue so we exit on next loop iteration
			select {
			case <-ctx.Done():
				timeout = true
			default:
				continue
			}

			// The current timer timed out and we received at least 1 response
			// Set timeout to true and continue so we exit on next loop iteration
			if nbResp > 0 {
				timeout = true
				continue
			}

			// The top level context did not time out yet
			// The current timer timed out, but we did not receive at least response
			// Reset the timer to extend the timeout
                         currTimeout = currTimeout * 2
			timer.Reset(currTimeout)
		}
	}
	cancel()

EDIT: I think the benefit of your approach is that if a lot of the Os return errors during discovery, re-sending the requests could give those Os an opportunity to return a valid response if they were encountering ephemeral issues which can be helpful in the scenario where there are no other Os to work with. However, as-is, the approach would also involve re-sending requests to Os that B already received valid responses from previously.

@leszko
Copy link
Contributor Author

leszko commented Mar 15, 2022

the approach would also involve re-sending requests to Os that B already received valid responses from previously.

That is actually not the case, because we only re-send requests if no orchestrator replied with a valid response.

@leszko
Copy link
Contributor Author

leszko commented Mar 15, 2022

IIUC the current implementation will re-send requests to all Os after each timeout until B reaches the max timeout. I'm not sure that B should re-send requests to all Os after each timeout because if the O is taking awhile to respond to the first request (and has not returned an error) re-sending the request doesn't seem like it would help. I actually was thinking that B would just extend the timeout for in-flight requests until the max timeout such that there is only ever a single request per O during a discovery process. WDYT? Curious how you'd compare the benefits of re-sending the requests against the approach I mentioned.

That is a very good point! And right, your proposed solution has the benefits that we don't start again with the requests from scratch, but just extend the waiting time.

I see the following benefits of each solution.

Benefits of re-sending requests Benefits of extending the waiting time
gives a chance to orchestrators which returned errors less requests received by the orchestrators, won't overload with requests
gives a chance to orchestrators which didn't even receive the request (e.g. request was lost in the network) lower overall latency, because e.g. if O replies in 3s, then we wait 4s, don't need to wait 7.5s = 0.5s + 1s + 2s + 4s
gives a chance to orchestrators which were temporary down

Each solution will work for us. And each solution has its benefits, the first one is more tolerant, the second optimizes for the load and latency. Our context is that we won't actually see much retries in the real prod env, because the starting 500ms is already a high number for at least 1 O to return its response. So, I see this feature for two use-cases:

  • teststreams
  • some network anomaly (probably temporary and limited to a region)

Looking into these 2 cases, I'll probably not optimize for load or latency, but for a more "bullet proof" solution, so keeping it as it is. WDYT? @yondonfu

@leszko leszko requested a review from yondonfu March 15, 2022 09:15
Copy link
Member

@yondonfu yondonfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to add a unit test for the new discovery behavior by stubbing serverGetOrchInfo() as is done here?

discovery/discovery.go Show resolved Hide resolved
@yondonfu
Copy link
Member

@leszko

Looking into these 2 cases, I'll probably not optimize for load or latency, but for a more "bullet proof" solution, so keeping it as it is. WDYT?

Sounds reasonable to me.

@leszko
Copy link
Contributor Author

leszko commented Mar 22, 2022

Would it be possible to add a unit test for the new discovery behavior by stubbing serverGetOrchInfo() as is done here?

Technically it's possible, but the approach with calling time.sleep() in tests comes with drawbacks and we need to weigh if they don't exceed the benefits. The drawbacks are:

  • Every call to time.sleep() effectively increases the test suite execution time
  • Every call to time.sleep() makes the tests (more) flaky

So, personally, I'd use reserve using sleeping in tests for the cases where we really need it (e.g. push_test.go) and not test the retries. In this case, a simple retry tests add 500ms, but if we want to test the whole exp. backoff spectrum, we add > 10s.

@leszko leszko requested a review from yondonfu March 22, 2022 12:50
@leszko
Copy link
Contributor Author

leszko commented Mar 22, 2022

@yondonfu Addressed your comments. PTAL.

@yondonfu
Copy link
Member

@leszko

Agreed that your second point on making tests more flaky should be weighed as a cost relative to the benefit of introducing the suggested unit test. FWIW I think the first point could be addressed by setting a lower custom value for getOrchestratorsCutoffTimeout (that can be reset to its original value at the end of the test). With that being said, setting a lower custom value probably will exacerbate second point as the test could become flakier with lower timeouts and values passed to time.Sleep().

I'm comfortable moving forward without the unit test. Do you think the retry behavior could be a good candidate for an e2e test scenario once we're ready to add e2e tests?

Copy link
Member

@yondonfu yondonfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@leszko
Copy link
Contributor Author

leszko commented Mar 22, 2022

FWIW I think the first point could be addressed by setting a lower custom value for getOrchestratorsCutoffTimeout
Yeah, agree, we could mitigate it with overriding getOrchestratorsCutoffTimeout in tests.

I'm comfortable moving forward without the unit test. Do you think the retry behavior could be a good candidate for an e2e test scenario once we're ready to add e2e tests?
Yes, I think it's maybe not exactly he scenario for this timeout, but some scenarios for reliability in terms of discovery.

@leszko leszko merged commit b4f0520 into livepeer:master Mar 22, 2022
@leszko leszko deleted the 2306-dynamic-discovery-timeout branch March 22, 2022 14:37
@leszko leszko mentioned this pull request May 10, 2022
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dynamic timeout for Orchestrator discovery in Broadcaster
2 participants