-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
discovery: Add dynamic timeout for the orchestrator discovery #2309
discovery: Add dynamic timeout for the orchestrator discovery #2309
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC the current implementation will re-send requests to all Os after each timeout until B reaches the max timeout. I'm not sure that B should re-send requests to all Os after each timeout because if the O is taking awhile to respond to the first request (and has not returned an error) re-sending the request doesn't seem like it would help. I actually was thinking that B would just extend the timeout for in-flight requests until the max timeout such that there is only ever a single request per O during a discovery process. WDYT? Curious how you'd compare the benefits of re-sending the requests against the approach I mentioned.
If we implement this, I think the top level context could be used and a timer + timer.Reset()
may come in handy. Something like this:
currTimeout := getOrchestratorsCutoffTimeout
timer := time.NewTimer(currTimeout)
for i := 0; i < numAvailableOrchs && len(infos) < numOrchestrators && !timeout; i++ {
select {
case info := <-infoCh:
if penalty := suspender.Suspended(info.Transcoder); penalty == 0 {
infos = append(infos, info)
} else {
heap.Push(suspendedInfos, &suspension{info, penalty})
}
nbResp++
case <-errCh:
nbResp++
case <-timer.C:
// The top level context also timed out after maxGetOrchestratorCutoffTimeout
// Set timeout to true and continue so we exit on next loop iteration
select {
case <-ctx.Done():
timeout = true
default:
continue
}
// The current timer timed out and we received at least 1 response
// Set timeout to true and continue so we exit on next loop iteration
if nbResp > 0 {
timeout = true
continue
}
// The top level context did not time out yet
// The current timer timed out, but we did not receive at least response
// Reset the timer to extend the timeout
currTimeout = currTimeout * 2
timer.Reset(currTimeout)
}
}
cancel()
EDIT: I think the benefit of your approach is that if a lot of the Os return errors during discovery, re-sending the requests could give those Os an opportunity to return a valid response if they were encountering ephemeral issues which can be helpful in the scenario where there are no other Os to work with. However, as-is, the approach would also involve re-sending requests to Os that B already received valid responses from previously.
That is actually not the case, because we only re-send requests if no orchestrator replied with a valid response. |
That is a very good point! And right, your proposed solution has the benefits that we don't start again with the requests from scratch, but just extend the waiting time. I see the following benefits of each solution.
Each solution will work for us. And each solution has its benefits, the first one is more tolerant, the second optimizes for the load and latency. Our context is that we won't actually see much retries in the real prod env, because the starting
Looking into these 2 cases, I'll probably not optimize for load or latency, but for a more "bullet proof" solution, so keeping it as it is. WDYT? @yondonfu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to add a unit test for the new discovery behavior by stubbing serverGetOrchInfo()
as is done here?
Sounds reasonable to me. |
Technically it's possible, but the approach with calling
So, personally, I'd use reserve using sleeping in tests for the cases where we really need it (e.g. |
@yondonfu Addressed your comments. PTAL. |
Agreed that your second point on making tests more flaky should be weighed as a cost relative to the benefit of introducing the suggested unit test. FWIW I think the first point could be addressed by setting a lower custom value for I'm comfortable moving forward without the unit test. Do you think the retry behavior could be a good candidate for an e2e test scenario once we're ready to add e2e tests? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
What does this pull request do? Explain your changes. (required)
Add dynamic timeout for the orchestrator discovery process.
Specific updates (required)
How did you test each of these updates (required)
Tested with local geth. Introduced artificial delay in O and checked the logs in B.
Does this pull request close any open issues?
fix #2306
Checklist:
make
runs successfully./test.sh
passREADME and other documentation updated