core: Fix standalone orchestrator not crashing under UnrecoverableError #2352

leszko · 2022-04-05T13:38:18Z

What does this pull request do? Explain your changes. (required)

Fix the issue that O does not fail after the unrecoverable error from lpms (ffmpeg).

The issue is described in this Discord thread. In short, when GPU crashes, NVIDIA recommends restarting the whole process, since it's not possible to recover. It worked that way until the Gracefully notify orchestrator in case of a panic in transcoder PR. In the mentioned PR, the following logic was implemented:

Recover from LPMS (ffmpeg) panic
Send a message T->O
Panic

The logic is fine for the split O/T topology, however for the combined OT, there is a separate code which also needs to implement Point 3. Panic.

Specific updates (required)

How did you test each of these updates (required)

Artificially introduce panic() and check that a standalone O crashes.

Does this pull request close any open issues?

Checklist:

Read the contribution guide
make runs successfully
All tests in ./test.sh pass
~~README and other documentation updated~~
Pending changelog updated

victorges

Nice! Just a few questions/discussions on the crashing logic

core/orchestrator.go

victorges

LGTM

…or (livepeer#2352)

core: Fix standalone orchestrator not crashing under UnrecoverableError

e2a7882

leszko requested review from iameli, victorges and yondonfu April 5, 2022 13:38

Update CHANGELOG_PENDING.md

8dbd0de

victorges approved these changes Apr 5, 2022

View reviewed changes

core/orchestrator.go Show resolved Hide resolved

core/orchestrator.go Show resolved Hide resolved

victorges approved these changes Apr 6, 2022

View reviewed changes

leszko merged commit 77d9e0f into livepeer:master Apr 7, 2022

leszko deleted the fix-handling-unrecoverable-error branch April 7, 2022 08:26

ad-astra-video pushed a commit to ad-astra-video/go-livepeer that referenced this pull request May 25, 2022

core: Fix standalone orchestrator not crashing under UnrecoverableErr…

c2a0a5e

…or (livepeer#2352)

yondonfu mentioned this pull request Nov 14, 2022

Only mark CUDA_ERROR_ILLEGAL_ADDRESS errors as unrecoverable errors livepeer/lpms#356

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core: Fix standalone orchestrator not crashing under UnrecoverableError #2352

core: Fix standalone orchestrator not crashing under UnrecoverableError #2352

leszko commented Apr 5, 2022 •

edited

Loading

victorges left a comment

victorges left a comment

core: Fix standalone orchestrator not crashing under UnrecoverableError #2352

core: Fix standalone orchestrator not crashing under UnrecoverableError #2352

Conversation

leszko commented Apr 5, 2022 • edited Loading

victorges left a comment

Choose a reason for hiding this comment

victorges left a comment

Choose a reason for hiding this comment

leszko commented Apr 5, 2022 •

edited

Loading