Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backoff for websocket connection retry #338

Merged
merged 8 commits into from
Mar 24, 2018
Merged

Backoff for websocket connection retry #338

merged 8 commits into from
Mar 24, 2018

Conversation

ericxtang
Copy link
Member

We've been using Infura, and they recommend to "treat the connection as it could be broken at anytime" - which is not great.

I added some reconnection and exponential backoff logic here so we can re-establish the connections better, but I think we need a bigger refactor of the eventMonitor code soon to address all of the issues.

@ericxtang ericxtang requested review from j0sh, yondonfu and dob and removed request for yondonfu March 17, 2018 14:50
Copy link
Member

@dob dob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exponential backoff wrapping looks good to me. Would feel better about merging if someone a little closer to the go-code these days also gave the thumbs up :)

Copy link
Contributor

@j0sh j0sh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the max retries. For websocket connections to Infura, I think we really should retry indefinitely. We don't want to force user interaction if all they need to do is re-establish the connection. That is something our code can do on its own. The only thing I'm uncertain about at the moment is how indefinite retries would affect the code that's responsible for consuming the event monitor interface.

However, we certainly should still notify the user if we lose the connection for some period of time, just so they are appraised of the websocket status. But if we can continue to retry, we should.

circle.yml Outdated
@@ -15,7 +15,7 @@ dependencies:
- "$HOME/ffmpeg"
- "$HOME/compiled"
override:
- go get github.com/livepeer/go-livepeer/cmd/livepeer
# - go get github.com/livepeer/go-livepeer/cmd/livepeer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace with git clone ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

}
bo := backoff.NewExponentialBackOff()
bo.MaxElapsedTime = time.Second * 15
if err := backoff.Retry(getBlock, backoff.WithMaxRetries(bo, SubscribeRetry)); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of websocket connections to Infura, we really should retry indefinitely. We don't want to force user interaction if all they need to do is re-establish the connection.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. My concern was around spamming the Infura network, but I guess they are there for a reason and we should always try to reconnect.

// segs := make([]int64, 0)
// for k, _ := range c.unclaimedSegs {
// segs = append(segs, k)
// }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extraneous?

glog.Errorf("SubscribeNewRound error: %v. Retrying...", err)
return err
} else {
glog.Infof("SubscribeNewRound successful.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a one-time message or would it lead to additional logging after each round?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a one-time message per connection (it will re-print if there is a re-connection)

Copy link
Member

@yondonfu yondonfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the use of backoff strategies, but I feel like it might be better if the backoff retry logic is placed in the RPC client https://github.com/livepeer/go-livepeer/blob/master/vendor/github.com/ethereum/go-ethereum/rpc/client.go since all connection based failures will originate from some operation in that client. This would also avoid additional duplication of the retry set up code everywhere we need to make an RPC request. The current retry code in the RPC client here: https://github.com/livepeer/go-livepeer/blob/master/vendor/github.com/ethereum/go-ethereum/rpc/client.go#L261 could be modified to use some backoff strategy and similar backoff retry logic could be placed in the subscription based functions in the RPC client.


return nil
}
if err := backoff.Retry(getBlock, backoff.WithMaxRetries(backoff.NewConstantBackOff(time.Second), SubscribeRetry)); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use a constant backoff here, but an exponential backoff strategy in CreateTranscodeJob in livepeernode.go?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I removed all the exponential backoffs.

@@ -35,6 +40,7 @@ type eventMonitor struct {
backend *ethclient.Client
contractAddrMap map[string]common.Address
eventSubMap map[string]*EventSubscription
latestBlock *big.Int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used anywhere?

@ericxtang
Copy link
Member Author

ericxtang commented Mar 21, 2018

@yondonfu - the reason I implemented the backoff strategies in the user of the RPC client is because we may want different backoff strategies for different types of connections. For example, @j0sh brought up the fact that the transcoder should keep retrying forever. But I'm not sure if all RPC requests should have that behavior.

@ericxtang ericxtang changed the title Exponential backoff for websocket connection retry Backoff for websocket connection retry Mar 23, 2018
@yondonfu
Copy link
Member

Ah make sense. I think we change the appropriate retry strategy for a transcoder in a separate PR

var job *lpTypes.Job
getJob := func() error {
j, err := s.node.Eth.GetJob(jid)
if j.StreamId == "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm so when using Infura, an event can come back with empty fields? Weird. I think there is an edge case where someone actually creates a job with an empty streamID, but I guess in that scenario, after a number of retries the transcoder will give up and just drop the new job event

Copy link
Member Author

@ericxtang ericxtang Mar 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's really strange... But I've seen that a few times and it crashed the transcoder. #342

job = j
return err
}
if err := backoff.Retry(getJob, backoff.NewConstantBackOff(time.Second*2)); err != nil {
glog.Errorf("Error getting job info: %v", err)
return false, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One additional thing I thought of - in that edge case described in the above comment, the transcoder would retry a number of times and then give up and actually stop watching for new job events since it returns false, err

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current logic here is to indefinitely retry every 2 seconds. I'm actually not sure when we will get an error here, but we do end up getting one, something unexpected has happened.

@ericxtang ericxtang merged commit 967ea35 into master Mar 24, 2018
@ericxtang ericxtang deleted the et/monitor-fix branch March 24, 2018 21:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants