Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endless buffering of outbound traffic when network is loaded #181

Closed
daonb opened this issue Feb 1, 2021 · 19 comments · Fixed by #303
Closed

Endless buffering of outbound traffic when network is loaded #181

daonb opened this issue Feb 1, 2021 · 19 comments · Fixed by #303
Assignees

Comments

@daonb
Copy link
Contributor

daonb commented Feb 1, 2021

Your environment.

  • Version: 1.7.11
  • Browser: iPadOS 14.3

What did you do?

I developed a terminal server over pion/webrtc, using data channels to pipe input & output. Things work well on a clean network and on a mobile one. I even used netem to emulate high jitter scenario and all works well. Only when I use my home network - a small pipe with two zoom sessions going on in parallel - the terminal stops receiving data.

I'm attaching a log where I have three data channels opened on the same peer. From the logs it looks like the buffer keeps growing and growing and growing

frozen2.log

@enobufs
Copy link
Member

enobufs commented Feb 14, 2021

@enobufs
Copy link
Member

enobufs commented Feb 14, 2021

Related topic: #77

@daonb
Copy link
Contributor Author

daonb commented Feb 18, 2021

@enobufs thanks for the reply. I don't think that's the issue I'm seeing for two reasons:

  • I have a very low data sending rate - < 1Kbps
  • The channel compeletly stops sending data and doesn't resume even after congestion is cleared

I believe there's a bug in congestion recovery but I have no tests to prove it :-( To test it I need a client that can be in "buffering mode" where outgoing datagrams are buffered for XXX ms rather than sent. This way we can write tests that simulate congestion and better test the congestion control code.

@enobufs
Copy link
Member

enobufs commented Feb 19, 2021

@daonb It's hard to debug without reproducible case... What I saw:

Many data channels are used

% cat frozen2.log|grep 'data channel'
info	Deleting data channel 22
info	Ading data channel 27
info	Deleting data channel 24
info	Deleting data channel 27
info	Deleting data channel 25
info	Deleting data channel 26

SACK stops coming in from the peer after 23:23:33.054958

Last TSN sack'ed was 903304684
But DATA chunk is coming in.
The log covers the last 23 seconds. During the time, total of 37 DATA chunks were received

% cat frozen2.log |grep '] DATA' |wc -l
      37

One T3-rtx timeout is in the log but it was the 5th RTO

% cat frozen2.log |grep '] T3-rtx'
sctp TRACE: 23:23:33.054718 association.go:1523: [0xc000091380] T3-rtx timer start (pt3)
sctp TRACE: 23:23:33.054935 association.go:1523: [0xc000091380] T3-rtx timer start (pt3)
sctp TRACE: 23:23:33.054979 association.go:1523: [0xc000091380] T3-rtx timer start (pt3)
sctp DEBUG: 23:23:45.856419 association.go:2161: [0xc000091380] T3-rtx timed out: nRtos=5 cwnd=1228 ssthresh=4912

Two retransmission occur right after the T3-rtx timeout, but there's no SACK

  • TSN=903304685
  • TSN=903304692

Two errors

% cat frozen2.log |grep 'ERROR'
sctp ERROR: 2021/01/30 23:23:45 [0xc000091380] stream 5 not found)
sctp ERROR: 2021/01/30 23:23:45 [0xc000091380] stream 5 not found)

My Thoughts

  • DATA chunk is coming in, so inbound traffic appears to be working
  • The retransmission should cause SACK to come back but not happening. Outbound traffic is not reaching the peer?

@daonb I have no clue other than the peer stop sending SACK (Selective ACK). Can you describe more about your environment?

  • How do use data channels?
  • Do you use any "partial reliability" options? (maxRetransmits, maxPacketLifetime, etc)
  • What is the network configuration (any routers, NATs in between the two endpoints, etc)
  • How often would this happen?
  • Only with iPadOS? How about with other browsers or pion-to-pion?

Recently, I came across a bug in usrsctp (most of the browsers use) that occurred when the partial reliability option was used over a lossy connection - See #138 - possibly related. I am not sure if iPadOS uses it, and I don't know if the fix has already been applied to the browsers out there.

@daonb
Copy link
Contributor Author

daonb commented Feb 19, 2021

Thanks for diving into this, I appreciate the help. Here are the answers to your questions, followed by one question I have.

  • How do use data channels?

I'm using pion for a backend - webexec - used by terminal7 - a hybrid tablet app. Each terminal pane has its own data channel used to pipe data in & out. In addition there's one data channel used for control. The 37 incoming messages are 37 keystrokes. Every time my server gets this kind of a message it passes it to the shell which causes the shell to print it on its pseudo terminal and webexec sends it back to the client. I know my server calls sctp to send the data but it doesn't get to

  • Do you use any "partial reliability" options? (maxRetransmits, maxPacketLifetime, etc)

I only set the ICETimeouts and the LoggerFactory

  • What is the network configuration (any routers, NATs in between the two endpoints, etc)

I use a home network to connect to a low latency host server (latest speettest - 24.42Mbps download, 2.8 Mbps upload)

  • How often would this happen?

It happens only when the network is overloaded wihch is quite often: I share the network with my two daughters and sometime they are both on zoom at the same time. I'm sure my uplink is badly congested and the SACKs suffer huge delays and sometimes dropped.

  • Only with iPadOS? How about with other browsers or pion-to-pion?

I should have it running on chrome-android soon so I'll test it once I have it

Two retransmission occur right after the T3-rtx timeout, but there's no SACK

Why only two retransmissions? I suspect the client sends the SACKs but they don't get to the server and the same happens with the ACKs on both retransmits. Shouldn't rtcp keep retransmitting until it gets an ACK?

@enobufs
Copy link
Member

enobufs commented Feb 19, 2021

Why only two retransmissions? I suspect the client sends the SACKs but they don't get to the server and the same happens with the ACKs on both retransmits. Shouldn't rtcp keep retransmitting until it gets an ACK?

Looking at this line:

sctp DEBUG: 23:23:45.856419 association.go:2161: [0xc000091380] T3-rtx timed out: nRtos=5 cwnd=1228 ssthresh=4912

The retransmission on the log was made after the 5th attempt (nRtos: the number of retransmission timeouts) and the log you gave me only had the last 23 seconds. Pion/sctp implements backoff algorithm as RFC says, doubling the interval time every time T3-rtx expire, so at that point, the interval was 32 seconds at least (depends on the measured RTT).

@daonb
Copy link
Contributor Author

daonb commented Feb 20, 2021

Great! We've solved this.

IMO, the doubling of retransmission timeout is wrong for real time communications. I'm not saying you should change it, it's in the RFC:

The computation and management of RTO in SCTP follow closely how TCP manages its retransmission timer.

I think the authors were too lazy here. TCP was made for file transfer, not real-time communication. Luckily, in C7 the RFC defines RTO.max which let's me limit the retransmission timeout.

Reading the code I learned we do have an rtoMax but it's a constant with a value of one minute. For webexec, I want to set it to 3 seconds and I believe it will solve my problem.

I guess I need to add RTOMax to webrtc.DataChannelParameters and make sure it gets all the way down to rtxTimer. Right?

@Sean-Der
Copy link
Member

@daonb we don't control those, I don't know if we can add stuff. Here are all the values. Since we also compile to WASM mode it will cause portability issues.

I think the best thing we could do is create an issue against webrtc-pc and see if there is any interest ideas. I would call out that webrtc-nv matches up with your use case.

If they completely push back then I would say adding a SettingEngine option is the best thing we can do. Then it is explicit that it is non-portable. I just don't want to add proprietary extensions and set the WebRTC space :(

@enobufs
Copy link
Member

enobufs commented Feb 20, 2021

It was the fifth retransmission got timed out, and the resulting retransmission also failed (I don't see SACK for it). Could you tell me why you'd think reducing max RTO would help? I think something else is going on.

We could make the max RTO configurable, but once a "pipe" is clogged, it usually takes a long time for the congestion to be cleared and I cannot think of any other way than the exponential backoff particularly when you cannot rule out the timeout was caused by congestion. I think you should find out why the browser is not responding to all these retransmissions in the first place.

Maybe you could modify the value by hand (using replace in go.mod, etc) and see what happens first?

@daonb
Copy link
Contributor Author

daonb commented Feb 24, 2021

In the old TCP days everyone was transferring files so congestion at the edge router was common and the backoff made sense. It was a way to wait for other transfers to finish. Usually, large files were transferred in the background and the back off timer worked great. We didn't really care when the transfer ends as long as the magic happens and the files appear on the target server.

Today's congestion is completely different. It's a result of rate controlled live video streaming and waiting for it to end does not make any sense. What's more, My application takes very little bandwidth compared to video streaming. With RTOMax of 2-3 seconds I have a better chance of getting some packets through before the user is frustrated.

I know that 60 seconds for RTOMax comes from the standard but it's wrong for real time applications. No one will wait 60 seconds to see the key they hit appear on the screen.

I wanted to set RTOMax by the channel because I want to add file transfer and in that case I need the exponential backoff. No matter, I'll cross the bridge when I get there. For now, I I think I should open an issue in pion/webrtc to add RTOMax to SettingEngine as Sean suggested and close this one.

@Sean-Der
Copy link
Member

@daonb you can skip right to the PR even (and link to this issue in the commit). I am happy to merge this right away so you can get back to building :)

I would love to bring it up in the W3C. I bet lots of other devs are hitting it, they just don't understand this stuff at the level you are. Even in the WebRTC working group only a small subset of people understand SCTP.

@daonb
Copy link
Contributor Author

daonb commented Feb 24, 2021

I would love to take it up with W3C. It's fun to be reminded of a products I developed back in the '90s and what the net was like back in the last millennium.

@kylecarbs
Copy link
Member

kylecarbs commented Jun 15, 2021

@enobufs I've found a case where this seems to consistently fail.

See coder/coder-v1-cli@5ff4902

If you copy the DataChannel to io.Discard on the server-side, the issue never happens.

https://github.com/cdr/coder-cli/pull/368/files#diff-62f8c8d3fb7e16200851008a80e52e4c000482ebe616712a6018431924c67866R333

Running tests with

go test -v -bench=. cdr.dev/coder-cli/wsnet

Should result in output similar to...

...
sctp DEBUG: 05:42:02.761286 association.go:756: [0xc00019e700] sending SACK: SACK cumTsnAck=3461170072 arwnd=1026976 dupTsn=[]
sctp DEBUG: 05:42:02.761330 association.go:756: [0xc00019e700] sending SACK: SACK cumTsnAck=3461170075 arwnd=1023376 dupTsn=[]
sctp DEBUG: 05:42:02.761404 association.go:756: [0xc00019e700] sending SACK: SACK cumTsnAck=3461170077 arwnd=1020976 dupTsn=[]
sctp DEBUG: 05:42:03.761639 association.go:2399: [0xc0003faa80] T3-rtx timed out: nRtos=1 cwnd=1228 ssthresh=531434
sctp DEBUG: 05:42:05.762754 association.go:2399: [0xc0003faa80] T3-rtx timed out: nRtos=2 cwnd=1228 ssthresh=4912
ice INFO: 2021/06/15 05:42:08 Setting new connection state: Disconnected
pc INFO: 2021/06/15 05:42:08 ICE connection state changed: disconnected
pc INFO: 2021/06/15 05:42:08 peer connection state changed: disconnected

@kylecarbs
Copy link
Member

I've isolated it further in this commit:

coder/coder-v1-cli@0dc0e56

Uncomment the following line, and it works as expected:

coder/coder-v1-cli@0dc0e56#diff-34c91981f059bb5d86d78bb53990a009247009fcfc946b00884b03f5c1a63d55R95

My assumption is there's a race condition somewhere when we write a lot of individual messages at the same time... but I'm not exactly sure where to look.

@enobufs
Copy link
Member

enobufs commented Feb 27, 2024

Looking into this now. The commit hash @kylecarbs reported does not seem to exist, but I was able to find a commit that is, I believe, close to the one Kyle referred to:

coder-v1-cli % git rev-parse HEAD
f85e0547ebb020eac38a1bfd748b97d465a573d6

It runs, as he explained, with the following command:

go test -v -bench=. cdr.dev/coder-cli/wsnet

It passes with no issue, but when I enabled DEBUG log, it reproduced what Kyle reported:

PION_LOG_DEBUG=sctp,dtls,ice go test -bench=. cdr.dev/coder-cli/wsnet

What I am seeing is, sender side SCTP is retransmitting (T3 retransmission timeout) packets but receiver side would not respond with SACK at all. Adding a debug log at ice level revealed that the receiver side's packetio.Buffer was full, and all the retransmitted packets were discarded at the receiver (at ICE level).

ice WARNING: 2024/02/26 22:50:14 failed to write packet: packetio.Buffer is full, discarding write

At SCTP level, a.netConn.Read(buffer) is not receiving any packets at all. It seems like a dead-lock situation at DTLS level.

sctp DEBUG: 22:50:14.792820 association.go:521: [0x140000f0540] readLoop: recvTimer expired
sctp DEBUG: 22:50:14.793591 association.go:2426: [0x140000f0540] T3-rtx timed out: nRtos=2 cwnd=1228 ssthresh=4912
sctp DEBUG: 22:50:14.794174 association.go:560: [0x140000f0540] sending 1 packets
ice WARNING: 2024/02/26 22:50:14 failed to write packet: packetio.Buffer is full, discarding write

Next, I will try to simplify the reproducible code...
(to be continued...)

@enobufs
Copy link
Member

enobufs commented Feb 27, 2024

@daonb @Sean-Der Do you have any thoughts on what could cause ice WARNING: 2024/02/26 22:50:14 failed to write packet: packetio.Buffer is full, discarding write?

FYI: I made a slight modification to the log line like this:

ice % git diff
diff --git a/candidate_base.go b/candidate_base.go
index c4f6f1b..5a7420f 100644
--- a/candidate_base.go
+++ b/candidate_base.go
@@ -256,7 +256,7 @@ func handleInboundCandidateMsg(ctx context.Context, c Candidate, buffer []byte,
 
        // NOTE This will return packetio.ErrFull if the buffer ever manages to fill up.
        if _, err := c.agent().buffer.Write(buffer); err != nil {
-               log.Warnf("failed to write packet")
+               log.Warnf("failed to write packet: %v", err)
        }
 }

@enobufs enobufs self-assigned this Feb 27, 2024
@daonb
Copy link
Contributor Author

daonb commented Feb 27, 2024

@enobufs I don't believe it's related. It could be that the buffer is too small or that your code doesn't read it and it fills up.

Thanks for the comment, it's the nudge I needed to start working on a PR and close this.

daonb added a commit that referenced this issue Feb 27, 2024
This change adds RTOMax to Config letting the user cap the
retransmition timer without breaking compatability.
If RTOMax is 0 the current value of 60 seconds is used.
daonb added a commit to daonb/sctp that referenced this issue Feb 27, 2024
This change adds RTOMax to Config letting the user cap the
retransmition timer without breaking compatability.
If RTOMax is 0 the current value of 60 seconds is used.
daonb added a commit to daonb/sctp that referenced this issue Feb 27, 2024
This change adds RTOMax to Config letting the user cap the
retransmission timer without breaking compatibility.
If RTOMax is 0 the current value of 60 seconds is used.
@daonb daonb linked a pull request Feb 27, 2024 that will close this issue
@edaniels
Copy link
Member

I've also put up pion/transport#278 which unlimits the buffer and code that care already puts a limit on it. Avoiding dropping packets feels nice when there's already multiple layers of buffering. pion/datachannel#196 As well since it messes with ordering/unordering testing

@enobufs
Copy link
Member

enobufs commented Feb 27, 2024

It could be that the buffer is too small or that your code doesn't read it

@daonb Thans for looking into this. Just to be clear.... what I saw is receiver side sctp.Association's readLoop is blocked on a.nextConn.Read() where netConn is DTLS when the receiver receives a retransmitted packet. Sender side, as I explained, kept sending the retransmission on every T3 timeout. This observation was based on @kylecarbs 's benchmark test. (in case we were looking at different things)
(as the Kyle's code is 3 y/o, I am going to create a simpler reproducible code.)

edaniels pushed a commit that referenced this issue Feb 27, 2024
This change adds RTOMax to Config letting the user cap the
retransmission timer without breaking compatibility.
If RTOMax is 0 the current value of 60 seconds is used.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants