Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long delay before reconnect despite PING timeout if remote doesn't cooperate #148

Closed
davyhollevoet opened this issue Nov 28, 2019 · 0 comments
Assignees
Labels

Comments

@davyhollevoet
Copy link

This issue can be triggered by connecting to a server, and then dropping all incoming traffic from said server with a firewall.

The "ping" and "write" loop tasks exit cleanly, but the "read" loop tasks stays alive because the remote server isn't cooperating towards closing the TCP connection. Only after the local TCP stack abandons all hope, the connection times out, the socket closes, and the "read" loop exits.

Expect log

2019-11-28 15:57:05,011 INFO - _receive_packet: Received packet PONG data None
2019-11-28 15:57:09,983 INFO - _send_packet: Sending packet PING data None
2019-11-28 15:57:14,984 INFO - _ping_loop: PONG response has not been received, aborting
2019-11-28 15:57:14,985 INFO - _ping_loop: Exiting ping task
2019-11-28 15:57:14,985 WARNING - _read_loop_websocket: WebSocket connection was closed, aborting
2019-11-28 15:57:14,986 INFO - _read_loop_websocket: Waiting for write loop task to end
2019-11-28 15:57:14,986 INFO - _write_loop: Exiting write loop task
2019-11-28 15:57:14,986 INFO - _read_loop_websocket: Waiting for ping loop task to end
2019-11-28 15:57:14,987 INFO - _handle_eio_disconnect: Engine.IO connection dropped
2019-11-28 15:57:14,987 INFO - _handle_reconnect: Connection failed, new attempt in 1.03 seconds
2019-11-28 15:57:14,988 INFO - _read_loop_websocket: Exiting read loop task
2019-11-28 15:57:16,017 INFO - _connect_polling: Attempting polling connection to ...

Remark the short time span between "Exiting ping task" and "Attempting polling connection"

Actual log

2019-11-28 15:45:21,516 INFO - _receive_packet: Received packet PONG data None
2019-11-28 15:45:26,489 INFO - _send_packet: Sending packet PING data None
2019-11-28 15:45:31,490 INFO - _ping_loop: PONG response has not been received, aborting
2019-11-28 15:45:31,490 INFO - _ping_loop: Exiting ping task
2019-11-28 15:45:31,491 INFO - _write_loop: Exiting write loop task
2019-11-28 15:51:25,302 INFO - _read_loop_websocket: Unexpected error "[Errno 110] Connection timed out", aborting
2019-11-28 15:51:25,303 INFO - _read_loop_websocket: Waiting for write loop task to end
2019-11-28 15:51:25,303 INFO - _read_loop_websocket: Waiting for ping loop task to end
2019-11-28 15:51:25,303 INFO - _handle_eio_disconnect: Engine.IO connection dropped
2019-11-28 15:51:25,303 INFO - _handle_reconnect: Connection failed, new attempt in 0.59 seconds
2019-11-28 15:51:25,303 INFO - _read_loop_websocket: Exiting read loop task
2019-11-28 15:51:25,891 INFO - _connect_polling: Attempting polling connection to ...

Remark that nearly 6 minutes pass before the connection times out. This is after setting net.ipv4.tcp_retries2=10 (linux sysctl). With the default value of 15, the timeout is approx 15 minutes.

The following patch (modified revert) changes this behaviour. It shuts down the socket, causing the read loop to exit properly. A timeout of zero is given to skip the "close reason" dance, because we know that the connection is failing anyway.

diff --git a/engineio/client.py b/engineio/client.py
index f4ab222..44c0a59 100644
--- a/engineio/client.py
+++ b/engineio/client.py
@@ -522,7 +522,7 @@ class Client(object):
                 self.logger.info(
                     'PONG response has not been received, aborting')
                 if self.ws:
-                    self.ws.shutdown()
+                    self.ws.close(timeout=0)
                 self.queue.put(None)
                 break
             self.pong_received = False

Additonal context: We first noticed this issue on a cellular connection (Orange, Belgium). After a reconnect (on cellular level), the connection would take very long to restore. The reason for this turned out to be that RST packets are dropped (normal operation: new IP address, remote server doesn't recognize us, sends a RST which closes the connection on our end, we then reconnect). In absence of those, the connection needs to time out before the read loop exits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants