Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed a transcoding bug that occurred when remote transcoder was removed #2747

Merged
merged 9 commits into from
Feb 14, 2023

Conversation

eliteprox
Copy link
Contributor

What does this pull request do? Explain your changes. (required)
This change fixes a bug that caused an orchestrator to lose all transcoders when one of it's transcoders disconnects unexpectedly.

Specific updates (required)

  • Simplified logic in removeFromRemoteTranscoders so that correct transcoder is removed from RemoteTranscoderManager

How did you test each of these updates (required)
We tested up to 4 streams going to an orchestrator with up to 4 remote transcoders. We stopped one of the transcoders in the following scenarios and observed streams move to another transcoder without disconnecting any other transcoders.

  • 1 stream on each of the 4 remote transcoders
  • 2 streams on each of 2 remote transcoders
  • 4 streams on 1 transcoder with 3 additional transcoders having 0 load.

In every case we observed the streams safely move to another available transcoder without causing other transcoders to break.

Does this pull request close any open issues?

#2605
#2706

Checklist:

@codecov
Copy link

codecov bot commented Feb 7, 2023

Codecov Report

Merging #2747 (641026a) into master (f07d69d) will decrease coverage by 0.02053%.
The diff coverage is 100.00000%.

❗ Current head 641026a differs from pull request most recent head ea030cf. Consider uploading reports for the commit ea030cf to get more accurate results

Impacted file tree graph

@@                 Coverage Diff                 @@
##              master       #2747         +/-   ##
===================================================
- Coverage   56.34825%   56.32772%   -0.02053%     
===================================================
  Files             88          88                 
  Lines          19147       19138          -9     
===================================================
- Hits           10789       10780          -9     
  Misses          7767        7767                 
  Partials         591         591                 
Impacted Files Coverage Δ
core/orchestrator.go 77.93296% <100.00000%> (-0.27394%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0ec1972...ea030cf. Read the comment docs.

Impacted Files Coverage Δ
core/orchestrator.go 77.93296% <100.00000%> (-0.27394%) ⬇️

@eliteprox eliteprox marked this pull request as ready for review February 7, 2023 06:34
@Titan-Node
Copy link

Works great, did many tests with removing and adding multiple Ts without any dropped streams.

@thomshutt thomshutt requested a review from leszko February 7, 2023 13:58
@thomshutt
Copy link
Contributor

This looks like a great simplification and as far as I can tell, the original code is just convoluted because of micro-optimisations that we don't need. Could you also please add a unit test so that we can avoid breaking this again in the future?

@eliteprox
Copy link
Contributor Author

eliteprox commented Feb 7, 2023

This looks like a great simplification and as far as I can tell, the original code is just convoluted because of micro-optimisations that we don't need. Could you also please add a unit test so that we can avoid breaking this again in the future?

Yep! I will write some tests, was thinking the same. This is a critical functionality. Thanks for looking it over 😎

Copy link
Contributor

@leszko leszko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, great simplification.

+1 for adding unit tests.

Also wondering, what's the functional difference between the old and the new code? Isn't it working exactly the same, how does it fix the issue you encountered?

CHANGELOG_PENDING.md Show resolved Hide resolved
@stronk-dev
Copy link
Contributor

Already has a unit test BTW:

func TestRemoveFromRemoteTranscoders(t *testing.T) {
remoteTranscoderList := []*RemoteTranscoder{}
assert := assert.New(t)
// Create 6 transcoders
tr := make([]*RemoteTranscoder, 5)
for i := 0; i < 5; i++ {
tr[i] = &RemoteTranscoder{addr: "testAddress" + strconv.Itoa(i)}
}
// Add to list
remoteTranscoderList = append(remoteTranscoderList, tr...)
assert.Len(remoteTranscoderList, 5)
// Remove transcoder froms head of the list
remoteTranscoderList = removeFromRemoteTranscoders(tr[0], remoteTranscoderList)
assert.Equal(remoteTranscoderList[0], tr[1])
assert.Equal(remoteTranscoderList[1], tr[2])
assert.Equal(remoteTranscoderList[2], tr[3])
assert.Equal(remoteTranscoderList[3], tr[4])
assert.Len(remoteTranscoderList, 4)
// Remove transcoder from the middle of the list
remoteTranscoderList = removeFromRemoteTranscoders(tr[3], remoteTranscoderList)
assert.Equal(remoteTranscoderList[0], tr[1])
assert.Equal(remoteTranscoderList[1], tr[2])
assert.Equal(remoteTranscoderList[2], tr[4])
assert.Len(remoteTranscoderList, 3)
// Remove transcoder from the middle of the list
remoteTranscoderList = removeFromRemoteTranscoders(tr[2], remoteTranscoderList)
assert.Equal(remoteTranscoderList[0], tr[1])
assert.Equal(remoteTranscoderList[1], tr[4])
assert.Len(remoteTranscoderList, 2)
// Remove transcoder from the end of the list
remoteTranscoderList = removeFromRemoteTranscoders(tr[4], remoteTranscoderList)
assert.Equal(remoteTranscoderList[0], tr[1])
assert.Len(remoteTranscoderList, 1)
// Remove the last transcoder
remoteTranscoderList = removeFromRemoteTranscoders(tr[1], remoteTranscoderList)
assert.Len(remoteTranscoderList, 0)
// Remove a transcoder when list is empty
remoteTranscoderList = removeFromRemoteTranscoders(tr[1], remoteTranscoderList)
emptyTList := []*RemoteTranscoder{}
assert.Equal(remoteTranscoderList, emptyTList)
}

@thomshutt
Copy link
Contributor

thomshutt commented Feb 12, 2023

Good point @stronk-dev - I'd still want to update that test such that it would've caught the case that was breaking though

@eliteprox
Copy link
Contributor Author

eliteprox commented Feb 13, 2023

Also wondering, what's the functional difference between the old and the new code? Isn't it working exactly the same, how does it fix the issue you encountered?

It appears some funky logic or race condition led to it. I couldn't follow the logic there. I discovered this after monitoring my own node. We then tested with real test streams until it was a confirmed bug, and then worked through the solution. The issue occurred only when there was 2 or more streams on the transcoder that was disconnected.

I wrote a new test which reproduced the issue on the old version of removeRemoteTranscoder and passed on the new code.

func TestRemoveFromRemoteTranscoders_StreamFailover(t *testing.T) {
rtm := NewRemoteTranscoderManager()
assert := assert.New(t)
capabilities := NewCapabilities(append(DefaultCapabilities(), OptionalCapabilities()...), []Capability{})
var transcoder_1 = &StubTranscoderServer{manager: rtm}
var transcoder_2 = &StubTranscoderServer{manager: rtm}
assert.Nil(rtm.liveTranscoders[transcoder_1])
assert.Empty(rtm.remoteTranscoders)
// register 2 transcoders, 8 stream capacity each
go func() { rtm.Manage(transcoder_1, 8, capabilities.ToNetCapabilities()) }()
time.Sleep(1 * time.Millisecond) // allow time for first stream to register
go func() { rtm.Manage(transcoder_2, 8, capabilities.ToNetCapabilities()) }()
time.Sleep(1 * time.Millisecond) // allow time for second stream to register
//Validate two transcoders exist
assert.NotNil(rtm.liveTranscoders[transcoder_1])
assert.NotNil(rtm.liveTranscoders[transcoder_2])
assert.Len(rtm.remoteTranscoders, 2)
//Load up streams on transcoders
for i := 0; i < 8; i++ {
testSessionId := fmt.Sprintf("%s%d", "testID", i)
rtm.selectTranscoder(testSessionId, nil)
time.Sleep(1 * time.Millisecond) // allow time for second stream to register
}
t1 := rtm.liveTranscoders[transcoder_1]
t2 := rtm.liveTranscoders[transcoder_2]
//Validate 4 streams per transcoder
assert.Equal(4, t1.load)
assert.Equal(4, t2.load)
//Break connection to transcoder by deleting live transcoder map entry
rtm.RTmutex.Lock()
delete(rtm.liveTranscoders, t1.stream)
rtm.RTmutex.Unlock()
//Repeat selectTranscoder for existing session ids - will remove transcoder and shift streams to another transcoder
for i := 0; i < 8; i++ {
testSessionId := fmt.Sprintf("%s%d", "testID", i)
rtm.selectTranscoder(testSessionId, nil)
time.Sleep(1 * time.Millisecond) // allow time for second stream to register
}
t1 = rtm.liveTranscoders[transcoder_1]
t2 = rtm.liveTranscoders[transcoder_2]
//Validate all 8 streams on opposite transcoder
assert.Empty(t1)
assert.Equal(8, t2.load)
}

In the test which uses 2 remote transcoders and 8 streams:
Old code = Only 5 streams made it to the 2nd transcoder
New code = All 8 streams went to the 2nd transcoder

@eliteprox
Copy link
Contributor Author

eliteprox commented Feb 13, 2023

Added new test TestRemoveFromRemoteTranscoders_StreamFailover. I left the previous test because the new one is more focused on the high-level result of shifting streams to the opposite transcoder and uses the selectTranscoder function to validate this.

@eliteprox
Copy link
Contributor Author

@leszko @thomshutt Any chance this can make it into 0.5.38? #2753


// register 2 transcoders, 8 stream capacity each
go func() { rtm.Manage(transcoder_1, 8, capabilities.ToNetCapabilities()) }()
time.Sleep(1 * time.Millisecond) // allow time for first stream to register
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this test be flaky? I think depending on timing (especially in ms) can make this test sometime pass, sometimes not. I'd avoid having time-dependent tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied the wait commands from another test. I think they are not needed. I didn't see anything in my development that indicates they are required, so I can try removing them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, then I suggest removing them if not needed. Then, Do you need to run these functions above in separate goroutines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, apparently the separate go routine is required in this case, if not done with the go routine, the test will timeout. I tried removing the time.sleep between them, but it caused test to move on without transcoder connected at all.

I pushed a commit to cleanup a few related lines.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, THB, in my opinion, it's better to not have tested with time.Sleep(), because they will sometimes fail sometimes pass in our CI. So, if it's not possible (or easy) to write a test without sleep, then I suggest removing this test and just merging without it. @thomshutt wdyt?

The improvement to the code is already good, so should be safe to merge.

@leszko leszko mentioned this pull request Feb 13, 2023
@thomshutt
Copy link
Contributor

@leszko @eliteprox I don't understand - if the problem is this method being called by multiple goroutines in an unsafe way then shouldn't the problem be fixed further up the stack?

If the problem is inside removeFromRemoteTranscoders then we should be able to write a test that just calls that method, without any concurrency code.

@leszko
Copy link
Contributor

leszko commented Feb 13, 2023

@leszko @eliteprox I don't understand - if the problem is this method being called by multiple goroutines in an unsafe way then shouldn't the problem be fixed further up the stack?

If the problem is inside removeFromRemoteTranscoders then we should be able to write a test that just calls that method, without any concurrency code.

From my current understanding, there are 2 issues:

  • this function is unnecessarily overcomplicated => this PR solves this issue
  • this function is used in an unsafe manner by multiple goroutines => this PR does not solve this issue (though might make it less visible)

Then, I suggest doing the following:

  1. For this PR, write a very simple sequential unit test just to test the function removeFromRemoteTranscoders(), nothing more
  2. Do not include the test TestRemoveFromRemoteTranscoders_StreamFailover, since I believe it's flaky
  3. Merge this PR
  4. Create a GH Issue to solve the concurrency issue (unless it's simple, then @eliteprox feel free to fix it in this PR)

@eliteprox
Copy link
Contributor Author

  • For this PR, write a very simple sequential unit test just to test the function removeFromRemoteTranscoders(), nothing more

  • Do not include the test TestRemoveFromRemoteTranscoders_StreamFailover, since I believe it's flaky

  • Merge this PR

  • Create a GH Issue to solve the concurrency issue (unless it's simple, then @eliteprox feel free to fix it in this PR)

Can we simply continue using the existing TestRemoveFromRemoteTranscoders test for now until the tests can be refactored, and merge this PR? This test properly covers removing transcoders within it's own scope, imho.

I've removed TestRemoveFromRemoteTranscoders_StreamFailover. I created GH issue #2756 for the refactoring of tests.

@leszko
Copy link
Contributor

leszko commented Feb 13, 2023

  • For this PR, write a very simple sequential unit test just to test the function removeFromRemoteTranscoders(), nothing more
  • Do not include the test TestRemoveFromRemoteTranscoders_StreamFailover, since I believe it's flaky
  • Merge this PR
  • Create a GH Issue to solve the concurrency issue (unless it's simple, then @eliteprox feel free to fix it in this PR)

Can we simply continue using the existing TestRemoveFromRemoteTranscoders test for now until the tests can be refactored, and merge this PR? This test properly covers removing transcoders within it's own scope, imho.

I've removed TestRemoveFromRemoteTranscoders_StreamFailover. I created GH issue #2756 for the refactoring of tests.

I'm ok with this. I think this change is safe as it is and it simplifies the code a lot. @thomshutt wdyt?

Copy link
Contributor

@leszko leszko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Let's wait for @thomshutt to comment/approve and then we can merge.

@Titan-Node
Copy link

Been running this upgrade on prod for 4 days with no errors to report

@thomshutt
Copy link
Contributor

LGTM, thanks for the fix and for responding to all our questions @eliteprox!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants