Backoff for websocket connection retry #338

ericxtang · 2018-03-17T14:50:16Z

We've been using Infura, and they recommend to "treat the connection as it could be broken at anytime" - which is not great.

I added some reconnection and exponential backoff logic here so we can re-establish the connections better, but I think we need a bigger refactor of the eventMonitor code soon to address all of the issues.

dob

The exponential backoff wrapping looks good to me. Would feel better about merging if someone a little closer to the go-code these days also gave the thumbs up :)

j0sh

About the max retries. For websocket connections to Infura, I think we really should retry indefinitely. We don't want to force user interaction if all they need to do is re-establish the connection. That is something our code can do on its own. The only thing I'm uncertain about at the moment is how indefinite retries would affect the code that's responsible for consuming the event monitor interface.

However, we certainly should still notify the user if we lose the connection for some period of time, just so they are appraised of the websocket status. But if we can continue to retry, we should.

j0sh · 2018-03-19T17:19:51Z

circle.yml

@@ -15,7 +15,7 @@ dependencies:
    - "$HOME/ffmpeg"
    - "$HOME/compiled"
  override:
-    - go get github.com/livepeer/go-livepeer/cmd/livepeer
+    # - go get github.com/livepeer/go-livepeer/cmd/livepeer


Replace with git clone ?

j0sh · 2018-03-19T17:24:04Z

core/livepeernode.go

+	}
+	bo := backoff.NewExponentialBackOff()
+	bo.MaxElapsedTime = time.Second * 15
+	if err := backoff.Retry(getBlock, backoff.WithMaxRetries(bo, SubscribeRetry)); err != nil {


In the case of websocket connections to Infura, we really should retry indefinitely. We don't want to force user interaction if all they need to do is re-establish the connection.

Fair enough. My concern was around spamming the Infura network, but I guess they are there for a reason and we should always try to reconnect.

j0sh · 2018-03-19T17:25:33Z

eth/claimmanager.go

+	// segs := make([]int64, 0)
+	// for k, _ := range c.unclaimedSegs {
+	// 	segs = append(segs, k)
+	// }


extraneous?

j0sh · 2018-03-19T17:29:01Z

eth/eventmonitor.go

+			glog.Errorf("SubscribeNewRound error: %v. Retrying...", err)
+			return err
+		} else {
+			glog.Infof("SubscribeNewRound successful.")


Is this a one-time message or would it lead to additional logging after each round?

It's a one-time message per connection (it will re-print if there is a re-connection)

yondonfu

I like the use of backoff strategies, but I feel like it might be better if the backoff retry logic is placed in the RPC client https://github.com/livepeer/go-livepeer/blob/master/vendor/github.com/ethereum/go-ethereum/rpc/client.go since all connection based failures will originate from some operation in that client. This would also avoid additional duplication of the retry set up code everywhere we need to make an RPC request. The current retry code in the RPC client here: https://github.com/livepeer/go-livepeer/blob/master/vendor/github.com/ethereum/go-ethereum/rpc/client.go#L261 could be modified to use some backoff strategy and similar backoff retry logic could be placed in the subscription based functions in the RPC client.

yondonfu · 2018-03-21T19:12:21Z

eth/client.go

+
+		return nil
+	}
+	if err := backoff.Retry(getBlock, backoff.WithMaxRetries(backoff.NewConstantBackOff(time.Second), SubscribeRetry)); err != nil {


Why use a constant backoff here, but an exponential backoff strategy in CreateTranscodeJob in livepeernode.go?

I think I removed all the exponential backoffs.

yondonfu · 2018-03-21T19:13:08Z

eth/eventmonitor.go

@@ -35,6 +40,7 @@ type eventMonitor struct {
 	backend         *ethclient.Client
 	contractAddrMap map[string]common.Address
 	eventSubMap     map[string]*EventSubscription
+	latestBlock     *big.Int


Is this used anywhere?

ericxtang · 2018-03-21T23:34:06Z

@yondonfu - the reason I implemented the backoff strategies in the user of the RPC client is because we may want different backoff strategies for different types of connections. For example, @j0sh brought up the fact that the transcoder should keep retrying forever. But I'm not sure if all RPC requests should have that behavior.

yondonfu · 2018-03-23T16:56:34Z

Ah make sense. I think we change the appropriate retry strategy for a transcoder in a separate PR

yondonfu · 2018-03-23T16:40:18Z

eth/eventservices/jobservice.go

+		var job *lpTypes.Job
+		getJob := func() error {
+			j, err := s.node.Eth.GetJob(jid)
+			if j.StreamId == "" {


Hm so when using Infura, an event can come back with empty fields? Weird. I think there is an edge case where someone actually creates a job with an empty streamID, but I guess in that scenario, after a number of retries the transcoder will give up and just drop the new job event

Yeah it's really strange... But I've seen that a few times and it crashed the transcoder. #342

yondonfu · 2018-03-23T17:00:04Z

eth/eventservices/jobservice.go

+			job = j
+			return err
+		}
+		if err := backoff.Retry(getJob, backoff.NewConstantBackOff(time.Second*2)); err != nil {
 			glog.Errorf("Error getting job info: %v", err)
 			return false, err


One additional thing I thought of - in that edge case described in the above comment, the transcoder would retry a number of times and then give up and actually stop watching for new job events since it returns false, err

I think the current logic here is to indefinitely retry every 2 seconds. I'm actually not sure when we will get an error here, but we do end up getting one, something unexpected has happened.

better waiting pattern for websockets

f05b6af

ericxtang requested review from j0sh, yondonfu and dob and removed request for yondonfu March 17, 2018 14:50

ericxtang added 2 commits March 17, 2018 11:00

vendor backoff package

61f334d

change circle setup

fd68e33

dob reviewed Mar 19, 2018

View reviewed changes

j0sh reviewed Mar 19, 2018

View reviewed changes

ericxtang added 2 commits March 20, 2018 18:35

correct vendor error

471f24b

add vendored package

4834f26

yondonfu requested changes Mar 21, 2018

View reviewed changes

changes to address comments in code review

30d9293

remove unused variable

350de2f

ericxtang changed the title ~~Exponential backoff for websocket connection retry~~ Backoff for websocket connection retry Mar 23, 2018

ericxtang mentioned this pull request Mar 23, 2018

Transcoder Crash slice bounds out of range #342

Closed

backoff for getting jobs - sometimes the job comes back as empty

4537221

yondonfu approved these changes Mar 23, 2018

View reviewed changes

yondonfu reviewed Mar 23, 2018

View reviewed changes

ericxtang merged commit 967ea35 into master Mar 24, 2018

ericxtang deleted the et/monitor-fix branch March 24, 2018 21:41

This was referenced Apr 5, 2018

Websocket connection reset by peer #330

Closed

Error checking for assignment: not found #325

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backoff for websocket connection retry #338

Backoff for websocket connection retry #338

ericxtang commented Mar 17, 2018

dob left a comment •

edited

Loading

j0sh left a comment •

edited

Loading

j0sh Mar 19, 2018

ericxtang Mar 21, 2018

j0sh Mar 19, 2018

ericxtang Mar 21, 2018

j0sh Mar 19, 2018

j0sh Mar 19, 2018

ericxtang Mar 21, 2018

yondonfu left a comment

yondonfu Mar 21, 2018

ericxtang Mar 21, 2018

yondonfu Mar 21, 2018

ericxtang commented Mar 21, 2018 •

edited

Loading

yondonfu commented Mar 23, 2018

yondonfu Mar 23, 2018

ericxtang Mar 24, 2018 •

edited

Loading

yondonfu Mar 23, 2018

ericxtang Mar 24, 2018

Backoff for websocket connection retry #338

Backoff for websocket connection retry #338

Conversation

ericxtang commented Mar 17, 2018

dob left a comment • edited Loading

Choose a reason for hiding this comment

j0sh left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yondonfu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericxtang commented Mar 21, 2018 • edited Loading

yondonfu commented Mar 23, 2018

Choose a reason for hiding this comment

ericxtang Mar 24, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dob left a comment •

edited

Loading

j0sh left a comment •

edited

Loading

ericxtang commented Mar 21, 2018 •

edited

Loading

ericxtang Mar 24, 2018 •

edited

Loading