Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change graceful_shutdown function behavior. #3729

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ionut-slaveanu
Copy link

@ionut-slaveanu ionut-slaveanu commented Aug 8, 2024

@seanmonstar
Copy link
Member

Why is this needed? If handshaking is taking a long time, it should still get closed, in my opinion.

Regarding the greater motivation, I believe it should be possible to implement a max age timer without any change to hyper. If any changes are needed, it's best to open an issue to discuss and propose a design, before coding happens.

@ionut-slaveanu
Copy link
Author

ionut-slaveanu commented Aug 8, 2024

Why is this needed? If handshaking is taking a long time, it should still get closed, in my opinion.

Regarding the greater motivation, I believe it should be possible to implement a max age timer without any change to hyper. If any changes are needed, it's best to open an issue to discuss and propose a design, before coding happens.

The problem is that the requests sent during handshake are lost. This can be tested using ghz tool that I used here as a client:

  • Added support for grpc max_connection_age_ms tonic#1865
    And with a little change in examples/src/max_connection_age/server.rs. I changed:
    .max_connection_age_grace_ms(30000) to .max_connection_age_grace_ms(5)
    These are the results:
    ./ghz --insecure --proto /opt/work3/tonic/examples/proto/helloworld/helloworld.proto --call helloworld.Greeter.SayHello -d '{"name":"Tonic"}' [::1]:50051 -n 100000 --connections=500 --cpus=16 --concurrency=500

Summary:
Count: 100248
Total: 2.22 s
Slowest: 1.15 s
Fastest: 0.03 ms
Average: 3.63 ms
Requests/sec: 45106.59

Response time histogram:
0.032 [1] |
115.077 [23482] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
230.122 [148] |
345.167 [9] |
460.212 [0] |
575.257 [0] |
690.302 [0] |
805.348 [0] |
920.393 [1] |
1035.438 [76] |
1150.483 [19] |

Latency distribution:
10 % in 0.13 ms
25 % in 0.29 ms
50 % in 0.91 ms
75 % in 3.84 ms
90 % in 34.24 ms
95 % in 62.39 ms
99 % in 118.73 ms

Status code distribution:
[OK] 23736 responses
[Unavailable] 76512 responses

Error distribution:
[34] rpc error: code = Unavailable desc = write tcp [::1]:37928->[::1]:50051: write: broken pipe
[41] rpc error: code = Unavailable desc = write tcp [::1]:38566->[::1]:50051: write: broken pipe
[186] rpc error: code = Unavailable desc = write tcp [::1]:53256->[::1]:50051: write: broken pipe
[163] rpc error: code = Unavailable desc = write tcp [::1]:38008->[::1]:50051: write: broken pipe
[111] rpc error: code = Unavailable desc = write tcp [::1]:42796->[::1]:50051: write: broken pipe
[172] rpc error: code = Unavailable desc = write tcp [::1]:54204->[::1]:50051: write: broken pipe
[116] rpc error: code = Unavailable desc = write tcp [::1]:47742->[::1]:50051: write: broken pipe
[74] rpc error: code = Unavailable desc = write tcp [::1]:60426->[::1]:50051: write: broken pipe
[116] rpc error: code = Unavailable desc = write tcp [::1]:42470->[::1]:50051: write: broken pipe
[235] rpc error: code = Unavailable desc = write tcp [::1]:37904->[::1]:50051: write: broken pipe
[166] rpc error: code = Unavailable desc = write tcp [::1]:39546->[::1]:50051: write: broken pipe
[67] rpc error: code = Unavailable desc = write tcp [::1]:52276->[::1]:50051: write: broken pipe
[98] rpc error: code = Unavailable desc = write tcp [::1]:54342->[::1]:50051: write: broken pipe
[43] rpc error: code = Unavailable desc = write tcp [::1]:38538->[::1]:50051: write: broken pipe
[128] rpc error: code = Unavailable desc = write tcp [::1]:38110->[::1]:50051: write: broken pipe
[130] rpc error: code = Unavailable desc = write tcp [::1]:37894->[::1]:50051: write: broken pipe
[94] rpc error: code = Unavailable desc = write tcp [::1]:48438->[::1]:50051: write: broken pipe
[86] rpc error: code = Unavailable desc = write tcp [::1]:59682->[::1]:50051: write: broken pipe
[72] rpc error: code = Unavailable desc = write tcp [::1]:59316->[::1]:50051: write: broken pipe
[55] rpc error: code = Unavailable desc = write tcp [::1]:34356->[::1]:50051: write: broken pipe
[203] rpc error: code = Unavailable desc = write tcp [::1]:54196->[::1]:50051: write: broken pipe
[126] rpc error: code = Unavailable desc = write tcp [::1]:45892->[::1]:50051: write: broken pipe
[55] rpc error: code = Unavailable desc = write tcp [::1]:60098->[::1]:50051: write: broken pipe
[56] rpc error: code = Unavailable desc = write tcp [::1]:59884->[::1]:50051: write: broken pipe
[89] rpc error: code = Unavailable desc = write tcp [::1]:60014->[::1]:50051: write: broken pipe
[129] rpc error: code = Unavailable desc = write tcp [::1]:33936->[::1]:50051: write: broken pipe
[202] rpc error: code = Unavailable desc = write tcp [::1]:58490->[::1]:50051: write: broken pipe
[142] rpc error: code = Unavailable desc = write tcp [::1]:58424->[::1]:50051: write: broken pipe
[282] rpc error: code = Unavailable desc = write tcp [::1]:42666->[::1]:50051: write: broken pipe
[161] rpc error: code = Unavailable desc = write tcp [::1]:48188->[::1]:50051: write: broken pipe
[108] rpc error: code = Unavailable desc = write tcp [::1]:47336->[::1]:50051: write: broken pipe
[63] rpc error: code = Unavailable desc = write tcp [::1]:33534->[::1]:50051: write: broken pipe
[5] rpc error: code = Unavailable desc = write tcp [::1]:45998->[::1]:50051: write: broken pipe
[168] rpc error: code = Unavailable desc = write tcp [::1]:56898->[::1]:50051: write: broken pipe
[136] rpc error: code = Unavailable desc = write tcp [::1]:34526->[::1]:50051: write: broken pipe
[165] rpc error: code = Unavailable desc = write tcp [::1]:40292->[::1]:50051: write: broken pipe
[152] rpc error: code = Unavailable desc = write tcp [::1]:47172->[::1]:50051: write: broken pipe
[73] rpc error: code = Unavailable desc = write tcp [::1]:56186->[::1]:50051: write: broken pipe
[41] rpc error: code = Unavailable desc = write tcp [::1]:38942->[::1]:50051: write: broken pipe
[21] rpc error: code = Unavailable desc = write tcp [::1]:42604->[::1]:50051: write: broken pipe
[64] rpc error: code = Unavailable desc = write tcp [::1]:59666->[::1]:50051: write: broken pipe
[143] rpc error: code = Unavailable desc = write tcp [::1]:50442->[::1]:50051: write: broken pipe
[227] rpc error: code = Unavailable desc = write tcp [::1]:50780->[::1]:50051: write: broken pipe
[218] rpc error: code = Unavailable desc = write tcp [::1]:34108->[::1]:50051: write: broken pipe
So in this case we have:
Status code distribution:
[OK] 23736 responses

only 23736 OK responses, instead of having 100000 OK responses like here:
hyperium/tonic#1428
So, I tried not to miss any responses.

PS: i'm open to any design proposal discussion. Thank you!

@seanmonstar
Copy link
Member

The unavailable error is because of graceful shutdown was triggered during the handshake? I'm struggling to make the connection, but perhaps there's a detail I'm not noticing.

@ionut-slaveanu
Copy link
Author

ionut-slaveanu commented Aug 9, 2024

The unavailable error is because of graceful shutdown was triggered during the handshake? I'm struggling to make the connection, but perhaps there's a detail I'm not noticing.

The unavailable error is not a problem. So, If I have sent 100k requests (-n 100000), I expect to receive 100k OK responses like here:
hyperium/tonic#1865
Status code distribution:
[OK] 100000 responses
[Unavailable] 184 responses

Error distribution:
[91] rpc error: code = Unavailable desc = the connection is draining
[93] rpc error: code = Unavailable desc = transport is closing

We received all 100k responses but some of them (184) got Unavailable error.
In the failed test I sent the same 100k requests, but received only 24k OK responses:
Status code distribution:
[OK] 23736 responses
[Unavailable] 76512 responses

Error distribution:
[34] rpc error: code = Unavailable desc = write tcp [::1]:37928->[::1]:50051: write: broken pipe.

So, yes I think this is the problem the graceful shutdown was triggered during the handshake. In this case the client will receive a broken pipe error:

  • [34] rpc error: code = Unavailable desc = write tcp [::1]:37928->[::1]:50051: write: broken pipe.

when trying to write to that stream and both request and response will be lost.
This may never happen in production with with a proper max_connection_age_ms value. I set it to 5 ms, which is a little exaggerated, only for stress testing of this feature. So, yes we could use directly the current graceful_shutdown function without any changes in hyper and hyper-util as you said before, but in this case the client that use max_connection_age_ms configuration option in tonic needs to know to set it to a bigger value like 30000ms (30s) to make sure that the graceful_shutdown does not happen during http2 handshake.

Anyway, I think the best option is to fix somehow also the case when the client of tonic that use this library sets max_connection_age_ms to a small value like 5ms and in that case tonic should not miss any OK responses. Or maybe to not permit to set it below 30s or same minimum default value.

@seanmonstar
Copy link
Member

But this change fixes the problem? It seems this just cancels graceful shutdown if triggered while a connection is in the process of starting.

@ionut-slaveanu
Copy link
Author

ionut-slaveanu commented Aug 14, 2024

Yes, this is what I'm trying to do, to not shutdown the connections that are in handshaking state and to try the shutdown again later, after the handshake completes. That's because, in this case the responses will be lost. Is there any external method to know that the http2 connection is in handshaking state? I'm thinking to change my tonic PR to something like this:

if connection.state != HANDSHAKING {
connection .graceful_shutdown()
}

@seanmonstar
Copy link
Member

That doesn't seem to me like what most people would want from this method. Consider: as implemented in this PR, if graceful shutdown is triggered a couple milliseconds after the connection starts, and thus gets ignored because of handshaking, then this connection will stay in an "active" state indefinitely, and then you're much more likely to have active streams that get abruptly interrupted when the final shutdown step just kills everything that hasn't stopped. That can be even worse.

I could imagine a solution here: instead of just closing immediately, a flag could be set that as soon as the handshake is done, then it starts the HTTP/2 graceful shutdown. So, literally all requests will get rejected, but as least the client will know from the GOAWAY frame that none of the requests were handled.

@ionut-slaveanu
Copy link
Author

That doesn't seem to me like what most people would want from this method. Consider: as implemented in this PR, if graceful shutdown is triggered a couple milliseconds after the connection starts, and thus gets ignored because of handshaking, then this connection will stay in an "active" state indefinitely, and then you're much more likely to have active streams that get abruptly interrupted when the final shutdown step just kills everything that hasn't stopped. That can be even worse.

I could imagine a solution here: instead of just closing immediately, a flag could be set that as soon as the handshake is done, then it starts the HTTP/2 graceful shutdown. So, literally all requests will get rejected, but as least the client will know from the GOAWAY frame that none of the requests were handled.

I tried to implement your suggested solution.

@ionut-slaveanu ionut-slaveanu changed the title Added graceful_shutdown_v2 function. Change graceful_shutdown function behavior. Aug 16, 2024
@ionut-slaveanu
Copy link
Author

Fixed workflow:
CI

@jeromegn
Copy link

This would be useful to us as we noticed we're closing connections abruptly if the h2 handshake is in progress and we're trying to gracefully shut down the server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants