Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gracefully notify orchestrator in case of a panic in transcoder #2094

Merged
merged 6 commits into from
Nov 9, 2021

Conversation

leszko
Copy link
Contributor

@leszko leszko commented Nov 5, 2021

What does this pull request do? Explain your changes. (required)

Fix transcoder to send an error to orchestrator before crashing. Before, when the transcoder crashed after panic(), the orchestrator had to time out before trying another transcoder. The change is related only to the split O+T topology.

Specific updates (required)

  • Add recover() in to the transcoder functions (both for Nvidia and the standard CPU transcoding)
  • Add handling UnrecoverableError in ot_rpc.go to send the error message to orchestrator and only then to panic and crash

How did you test each of these updates (required)

  1. Add an artificial panic(errors.New("Some error")) inside transcoder.go#Transcode() function
  2. Build the project
  3. Start orchestrator + 2 transcoders
  4. Start broadcaster and a start streaming a video
  5. The first transcoder crashes, orchestrator is notified and selects the second transcoder without the timeout

The same test without this PR, makes orchestrator wait for the timeout before selecting the second transcoder.

Does this pull request close any open issues?

fix #2088

Checklist:

Copy link
Member

@yondonfu yondonfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just a few minor nit comments.

Also, I believe the changes in LocalTranscoder and NvidiaTranscoder should be sufficient, but I wanted to just note that when the node does GPU transcoding, it actually uses the LoadBalancingTranscoder type which also implements the Transcoder interface. Under the hood, the LoadBalancingTranscoder uses NvidiaTranscoder [1]. When LoadBalancingTranscoder's Transcode() method is used, there is actually a nested NvidiaTranscoder.Transcode() call in a goroutine. I believe a panic in NvidiaTranscoder's Transcode call would be caught and returned as an error to LoadBalancingTranscoder and then it would be handled in the ot_rpc.go code the same way that it would be handled if the error was returned directly by NvidiaTranscoder (but feel free to correct me here if there is any disagreement on the expected behavior!).

[1] See this comment for more details #2070 (comment)

core/transcoder.go Show resolved Hide resolved
server/ot_rpc.go Show resolved Hide resolved
core/transcoder_test.go Show resolved Hide resolved
@leszko
Copy link
Contributor Author

leszko commented Nov 8, 2021

I believe a panic in NvidiaTranscoder's Transcode call would be caught and returned as an error to LoadBalancingTranscoder and then it would be handled in the ot_rpc.go code the same way that it would be handled if the error was returned directly by NvidiaTranscoder (but feel free to correct me here if there is any disagreement on the expected behavior!).

Yes, you're right, it'll work exactly as you expected. I just double-checked it. We could consider adding a unit test in lb_test.go, however, I think we already test returning a custom error in lb_test.go#TestLB_ConcurrentSessionErrors(). So, IMO the scenario you described is already covered.

Copy link
Contributor Author

@leszko leszko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @yondonfu . I addressed your comments. PTAL.

core/transcoder.go Show resolved Hide resolved
core/transcoder_test.go Show resolved Hide resolved
server/ot_rpc.go Show resolved Hide resolved
@leszko leszko requested a review from yondonfu November 8, 2021 10:11
Copy link
Member

@yondonfu yondonfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Last thing before merging...

We typically like to prefix commit messages with the package that the commit updates (see this doc - we loosely follow the guidelines described) as its helpful when sifting through commit history later on. Would be great to update the commits accordingly and to also squash the commits updating CHANGELOG_PENDING into a single one [1]. Then we're good to go!

[1] We've typically saved this process for the end of the PR review process and at that time rebasing/modifying commit history is considered fair game (generally we try to avoid doing so in middle of the review process just for the reviewer's convenience although there are exceptions).

@leszko
Copy link
Contributor Author

leszko commented Nov 9, 2021

We typically like to prefix commit messages with the package that the commit updates (see this doc - we loosely follow the guidelines described) as its helpful when sifting through commit history later on. Would be great to update the commits accordingly and to also squash the commits updating CHANGELOG_PENDING into a single one [1]. Then we're good to go!

[1] We've typically saved this process for the end of the PR review process and at that time rebasing/modifying commit history is considered fair game (generally we try to avoid doing so in middle of the review process just for the reviewer's convenience although there are exceptions).

Ok, good to know. Then, I think that this PR should be a single commit with the message [core+server] Gracefully notify orchestrator in case of panic in transcoder. When the PR is approved (by both @yondonfu and @jailuthra), I'll just click "Squash and merge" and update the commit message there. Is that ok?

@leszko leszko requested a review from yondonfu November 9, 2021 09:00
Copy link
Contributor

@jailuthra jailuthra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🚢 🥳

When the PR is approved [..], I'll just click "Squash and merge" and update the commit message there.

Yeah 👍 I think it's okay to squash and merge using github UI and edit the commit message there, given the changes in this PR don't touch many things.

For more complicated changes it's preferred to run a local rebase+squash to split independent things in separate commits, and force-push on the PR branch before merging.

I think that this PR should be a single commit with the message [core+server] Gracefully notify orchestrator in case of panic in transcoder

The message can probably be formatted as core,server: Gracefully notify.. or core+server: Gracefully notify.. to be closer to rest of the commit history. I don't think we've used the square brackets [] before.

Copy link
Member

@yondonfu yondonfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Return error from T to O before panic due to unrecoverable LPMS error
3 participants