`GetOrchestrator` slowness when PM Sender data is not cached #2849

stronk-dev · 2023-08-30T22:08:31Z

What does this pull request do?

Titan node had some curious insights into the relationship between your Eth RPC provider and the response times to GetOrchestrator requests, and started investigating. @ad-astra-video assisted and they found out that during a GetOrchestrator request, it was possible that a blocking RPC call would be made to check the float of the Broadcaster

This commit is pulled from @ad-astra-video fork of go-livepeer and it increases the timeout for how long B info gets cached and does some explicit calls to refresh data in the background every 10 minutes

My questions for the person reviewing this:

Are these explicit update calls on the cleanupInterval required, or are the blockwatchers already keeping all relevant B data up-to-date as long as the PM sender is cached?

go-livepeer/pm/sendermonitor.go

Lines 329 to 332 in 951f6e6

    
           case <-ticker.C: 
        
           	sm.smgr.UpdateSenderInfos() 
        
           	sm.smgr.UpdateClaimedReserves(sm.cfg.Claimant) 
        
           	sm.cleanup()

If there is data we can keep cached this way, what would be a sensible cleanupInterval and are there any other calls we might want to do?

… loop

ad-astra-video · 2023-08-30T22:37:31Z

Want to add log lines pulled from Titan Node orchestrator when testing this. These log lines have been removed but are relevant to where in the code the OrchestratorInfo response generation slowed down. You can also see every other step of generating the response takes 4ms of the 484ms.

When autoAdjustPrice is not set to false the same SenderInfo and ClaimedReserve calls are the slowdown point.

log (2).txt

Titan-Node · 2023-08-30T22:50:17Z

Been running this update for 48 hours, all streams and node stats look fine.

thomshutt · 2023-09-05T12:00:14Z

Thanks @stronk-dev, I'm going to get someone to start looking at this today

thomshutt · 2023-09-05T12:20:53Z

Looks like we have genuine test failures here, could you take a look please @stronk-dev?

ad-astra-video · 2023-09-05T14:46:11Z

Looks like we have genuine test failures here, could you take a look please @stronk-dev?

The tests were not updated yet. We were looking for some feedback on moving the eth rpc calls in the OrchestratorInfo response generation to the background and change in cache time from 60 seconds to 1 week. Also on the cleanup interval changed from 1 minute to 10 minutes. There appears to be a blockWatcher that accounts for broadcaster reserve changes but the cache was so short at 60 seconds that it did not operate very long. With Titan's testing at the 1-2 minute mark the OrchestratorInfo response would slow down because of the eth rpc calls.

The gist is this change can significantly speed up OrchestratorInfo response times that are required at each start of stream and periodically during streams. The goal is to provide more reliable response times, and possibly, shorten the initial session pool building to speed up first segment transcoding.

leszko · 2023-09-06T08:51:25Z

I'm trying to catch up on this issue.

So, if I understand correctly, there is a bug that when we have GetOrchestrator call and an RPC request at the same time, then the GetOrchestrator response is delayed for some reason? Is that correct?

I'm actually not sure how changing the timeouts solves this issue. Could you provide more detailed description on that?

About your questions, I don't know, would need to dig into the code, but first I'm trying to understand the actual issue.

stronk-dev · 2023-09-06T09:00:18Z

There's no bug, but due to the timeout being only 1 minute, it happens often that an Orchestrator has to do a blocking RPC call in order to respond to a GetOrchestrator request. In this case the Orchestrator needs to have up-to-date reserve info on the B making the GetOrchestrator request in order to respond

Depending on the speed, distance and availability of the L2 RPC provider (which can be expecially problematic for regions like Singapore) this causes long fluctuations in response times to GetOrchestrator requests. Sometimes it's high enough for the Orchestrator to go above the threshold a B has set to populate their working set of Orchestrators. You can see here when Titan applied this patch and his response times immediately became excellent

Doing blocking RPC calls during a request like that is inefficient anyway. It's a good idea to make sure any data an O needs to respond to B requests is always up-to-date and available. We're trying to figure out here:

is increasing the timeout enough: will the block watchers keep any data we need up-to-date already? Or do we require the explicit calls added in this PR to the cleanup-check-loop? If so, is there other data we want to pre-fetch?

leszko · 2023-09-07T07:25:01Z

Ok, I understand, thanks for the explanation Marco.

One thing I don't understand is why GetOrchestrator needs to know anything about the Broadcaster / Sender? Do we use any of this info in getOrchestrator()?

Other than that, my understanding is that, in theory, you should never need to refresh the cache because it's updated with the block feed here. Not sure actually why we're cleaning up this cache at all 🤔 Saying that, I this PR is ok, to be on the safe side, we can refresh the cache from time to time.

I'll put some comments inline, other than that, I'd love to understand my question above ☝️

leszko

Added the comments. In general, I understand the issue and this PR, so I think we should get it merged.

cmd/livepeer/starter/starter.go

pm/sendermonitor.go

stronk-dev · 2023-09-07T08:46:07Z

One thing I don't understand is why GetOrchestrator needs to know anything about the Broadcaster / Sender? Do we use any of this info in getOrchestrator()?

It generates B specific ticket parameters and has a special reject case if the Broadcaster has insufficient reserves to do a payout:

go-livepeer/server/rpc.go

Lines 315 to 326 in d8ad5c7

    
           func getOrchestrator(orch Orchestrator, req *net.OrchestratorRequest) (*net.OrchestratorInfo, error) { 
        
           	addr := ethcommon.BytesToAddress(req.Address) 
        
           	if err := verifyOrchestratorReq(orch, addr, req.Sig); err != nil { 
        
           		return nil, fmt.Errorf("Invalid orchestrator request: %v", err) 
        
           	} 
        
           	if _, err := authenticateBroadcaster(addr.Hex()); err != nil { 
        
           		return nil, fmt.Errorf("authentication failed: %v", err) 
        
           	} 
        
           	// currently, orchestrator == transcoder 
        
           	return orchestratorInfo(orch, addr, orch.ServiceURI().String())

go-livepeer/server/rpc.go

Lines 348 to 354 in d8ad5c7

    
           func orchestratorInfo(orch Orchestrator, addr ethcommon.Address, serviceURI string) (*net.OrchestratorInfo, error) { 
        
           	priceInfo, err := getPriceInfo(orch, addr) 
        
           	if err != nil { 
        
           		return nil, err 
        
           	} 
        
           	params, err := orch.TicketParams(addr, priceInfo)

go-livepeer/core/orchestrator.go

Lines 226 to 236 in f350fba

    
           func (orch *orchestrator) TicketParams(sender ethcommon.Address, priceInfo *net.PriceInfo) (*net.TicketParams, error) { 
        
           	if orch.node == nil || orch.node.Recipient == nil { 
        
           		return nil, nil 
        
           	} 
        
           	ratPriceInfo, err := common.RatPriceInfo(priceInfo) 
        
           	if err != nil { 
        
           		return nil, err 
        
           	} 
        
           	params, err := orch.node.Recipient.TicketParams(sender, ratPriceInfo)

go-livepeer/pm/recipient.go

Lines 192 to 201 in f350fba

    
           func (r *recipient) TicketParams(sender ethcommon.Address, price *big.Rat) (*TicketParams, error) { 
        
           	randBytes := RandBytes(32) 
        
           	seed := new(big.Int).SetBytes(randBytes) 
        
           	faceValue := big.NewInt(0) 
        
           	// If price is 0 face value, win prob and EV are 0 because no payments are required 
        
           	if price.Num().Cmp(big.NewInt(0)) > 0 { 
        
           		var err error 
        
           		faceValue, err = r.faceValue(sender)

go-livepeer/pm/recipient.go

Lines 264 to 274 in f350fba

    
           func (r *recipient) faceValue(sender ethcommon.Address) (*big.Int, error) { 
        
           	txCost := r.txCost() 
        
           	// faceValue = txCost * txCostMultiplier 
        
           	faceValue := new(big.Int).Mul(txCost, big.NewInt(int64(r.cfg.TxCostMultiplier))) 
        
           	if faceValue.Cmp(r.cfg.EV) < 0 { 
        
           		faceValue = new(big.Int).Mul(r.cfg.EV, evMultiplier) 
        
           	} 
        
           	// Fetch current max float for sender 
        
           	maxFloat, err := r.sm.MaxFloat(sender)

go-livepeer/pm/sendermonitor.go

Lines 155 to 162 in f350fba

    
           func (sm *LocalSenderMonitor) MaxFloat(addr ethcommon.Address) (*big.Int, error) { 
        
           	sm.mu.Lock() 
        
           	defer sm.mu.Unlock() 
        
           	sm.ensureCache(addr) 
        
           	return sm.maxFloat(addr) 
        
           }

Not sure how important these ticket parameters are for the B. But if the SenderMonitor keeps this data up-to-date anyway, increasing the timeout would be a nice first step

stronk-dev · 2023-11-01T15:10:28Z

In order to get this PR ready for merge, I think it's just 1 minor question ATM:

Since handleRoundEvent already calls GetSenderInfo and ClaimedReserve on a new round, should the newly added code in this PR which calls these functions simply be dropped from startCleanupLoop? Seems like everything should be fine, otherwise we could still at the extra timed loop to refresh that data specificially
cc @leszko

leszko · 2023-11-02T13:47:21Z

startCleanupLoop

Yes, in my opinion, we can remove it from startCleanupLoop.

stronk-dev · 2023-11-03T13:45:27Z

Modified. I guess the added print for when a sender info gets cleared could be nice to keep in there

pm/sendermonitor.go

leszko · 2023-11-06T09:49:38Z

Waiting for the CI build and then I'll merge! Thanks for the PR @stronk-dev

codecov · 2023-11-06T10:00:31Z

Codecov Report

Merging #2849 (26be254) into master (063b1ea) will not change coverage.
The diff coverage is 100.00000%.

@@              Coverage Diff              @@
##              master       #2849   +/-   ##
=============================================
  Coverage   56.53269%   56.53269%           
=============================================
  Files             89          89           
  Lines          19456       19456           
=============================================
  Hits           10999       10999           
  Misses          7849        7849           
  Partials         608         608

Files	Coverage Δ
cmd/livepeer/starter/starter.go	`4.73422% <ø> (ø)`
pm/sendermonitor.go	`88.81579% <100.00000%> (ø)`

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 063b1ea...26be254. Read the comment docs.

Files	Coverage Δ
cmd/livepeer/starter/starter.go	`4.73422% <ø> (ø)`
pm/sendermonitor.go	`88.81579% <100.00000%> (ø)`

…r#2849)

move SenderInfo and ClaimedReserve to background refreshes in cleanup…

951f6e6

… loop

stronk-dev changed the title ~~move SenderInfo and ClaimedReserve to background refreshes in cleanup…~~ GetOrchestrator slowness when PM Sender data is not cached Aug 30, 2023

Merge branch 'master' into md/adastra-fix

143b72b

leszko reviewed Sep 7, 2023

View reviewed changes

cmd/livepeer/starter/starter.go Outdated Show resolved Hide resolved

pm/sendermonitor.go Outdated Show resolved Hide resolved

Set timeout to clear idle B data to 2 days

540af3b

stronk-dev force-pushed the md/adastra-fix branch from 84bec1f to 540af3b Compare October 23, 2023 20:42

Remove update funcs which are already handled by handleRoundEvent

1b3e583

stronk-dev requested a review from leszko November 3, 2023 13:45

leszko reviewed Nov 6, 2023

View reviewed changes

pm/sendermonitor.go Outdated Show resolved Hide resolved

leszko added 2 commits November 6, 2023 10:47

Merge branch 'master' into md/adastra-fix

f240f49

Update pm/sendermonitor.go

26be254

leszko approved these changes Nov 6, 2023

View reviewed changes

leszko merged commit 8053582 into master Nov 6, 2023
18 checks passed

leszko deleted the md/adastra-fix branch November 6, 2023 12:23

eliteprox pushed a commit to eliteprox/go-livepeer that referenced this pull request Feb 21, 2024

GetOrchestrator slowness when PM Sender data is not cached (livepee…

a68617e

…r#2849)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`GetOrchestrator` slowness when PM Sender data is not cached #2849

`GetOrchestrator` slowness when PM Sender data is not cached #2849

stronk-dev commented Aug 30, 2023 •

edited

Loading

ad-astra-video commented Aug 30, 2023

Titan-Node commented Aug 30, 2023

thomshutt commented Sep 5, 2023

thomshutt commented Sep 5, 2023

ad-astra-video commented Sep 5, 2023

leszko commented Sep 6, 2023 •

edited

Loading

stronk-dev commented Sep 6, 2023 •

edited

Loading

leszko commented Sep 7, 2023

leszko left a comment

stronk-dev commented Sep 7, 2023

stronk-dev commented Nov 1, 2023

leszko commented Nov 2, 2023

stronk-dev commented Nov 3, 2023

leszko commented Nov 6, 2023

codecov bot commented Nov 6, 2023

	case <-ticker.C:
	sm.smgr.UpdateSenderInfos()
	sm.smgr.UpdateClaimedReserves(sm.cfg.Claimant)
	sm.cleanup()

GetOrchestrator slowness when PM Sender data is not cached #2849

GetOrchestrator slowness when PM Sender data is not cached #2849

Conversation

stronk-dev commented Aug 30, 2023 • edited Loading

ad-astra-video commented Aug 30, 2023

Titan-Node commented Aug 30, 2023

thomshutt commented Sep 5, 2023

thomshutt commented Sep 5, 2023

ad-astra-video commented Sep 5, 2023

leszko commented Sep 6, 2023 • edited Loading

stronk-dev commented Sep 6, 2023 • edited Loading

leszko commented Sep 7, 2023

leszko left a comment

Choose a reason for hiding this comment

stronk-dev commented Sep 7, 2023

stronk-dev commented Nov 1, 2023

leszko commented Nov 2, 2023

stronk-dev commented Nov 3, 2023

leszko commented Nov 6, 2023

codecov bot commented Nov 6, 2023

Codecov Report

`GetOrchestrator` slowness when PM Sender data is not cached #2849

`GetOrchestrator` slowness when PM Sender data is not cached #2849

stronk-dev commented Aug 30, 2023 •

edited

Loading

leszko commented Sep 6, 2023 •

edited

Loading

stronk-dev commented Sep 6, 2023 •

edited

Loading