Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the new async state machine pooling feature on websockets #50921

Closed
davidfowl opened this issue Apr 8, 2021 · 33 comments · Fixed by #56282
Closed

Use the new async state machine pooling feature on websockets #50921

davidfowl opened this issue Apr 8, 2021 · 33 comments · Fixed by #56282
Assignees
Labels
area-System.Net enhancement Product code improvement that does NOT require public API changes/additions tenet-performance Performance related issue
Milestone

Comments

@davidfowl
Copy link
Member

davidfowl commented Apr 8, 2021

Description

Looking at an allocation profile for SignalR, the top allocations are coming from state machines allocated by Kestrel on websocket reads. We should be able to remove these by using the new pooling introduced here #50116.

Regression?

No

Data

image

Analysis

This is the code path:

private async ValueTask<TWebSocketReceiveResult> ReceiveAsyncPrivate<TWebSocketReceiveResultGetter, TWebSocketReceiveResult>(

cc @stephentoub @CarnaViire

@davidfowl davidfowl added the tenet-performance Performance related issue label Apr 8, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.Net untriaged New issue has not been triaged by the area owner labels Apr 8, 2021
@ghost
Copy link

ghost commented Apr 8, 2021

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Looking at an allocation profile for SignalR, the top allocations are coming from state machines allocated by Kestrel on websocket reads. We should be able to remove these by using the new pooling introduced here #50116.

Regression?

No

Data

image

Analysis

This is teh codez:

private async ValueTask<TWebSocketReceiveResult> ReceiveAsyncPrivate<TWebSocketReceiveResultGetter, TWebSocketReceiveResult>(

cc @stephentoub @CarnaViire

Author: davidfowl
Assignees: -
Labels:

area-System.Net, tenet-performance, untriaged

Milestone: -

@stephentoub
Copy link
Member

stephentoub commented Apr 8, 2021

The plan is to use the feature in a few places based on profiling to see that it not only reduces allocations but also improves throughput under load. This is certainly one of the places we'll experiment with.

We haven't done that yet because the feature hasn't been implemented in the compiler yet. There's no action that can be reasonably taken until that happens.

@davidfowl
Copy link
Member Author

The plan is to use the feature in a few places based on profiling to see that it not only reduces allocations but also improves throughput under load.

This is the one that always gets me. CPU time spent in the GC vs CPU time spent in our own pools calling rent/return.

We haven't done that yet because the feature hasn't been implemented in the compiler yet. There's no action that can be reasonably taken until that happens.

Yea, I know, I'm just filing issues and storing profiles.

@stephentoub
Copy link
Member

This is the one that always gets me. CPU time spent in the GC vs CPU time spent in our own pools calling rent/return.

"gets you" meaning you forget that there's a cost to pooling, or "gets you" meaning you think we shouldn't be concerned about such costs and should always pool everything?

@davidfowl
Copy link
Member Author

davidfowl commented Apr 8, 2021

"gets you" meaning you forget that there's a cost to pooling, or "gets you" meaning you think we shouldn't be concerned about such costs and should always pool everything?

The latter assuming it doesn't make things worse because allocations are like paper cuts that add up (peanut butter). There are a couple of things we don't have:

  • IMO not allocating for infrastructure things where it can be easily avoided makes sense
  • As libraries that run in various scenarios, us allocating means the end application needs to tune the GC to meet their performance needs. This means they need to understand the framework and the runtime's allocations.

We don't know how to measure the CPU impact of all of these small pools and that's problematic when comparing to if it's better than GC if you don't see a throughput increase (or working set decrease).

The other very unsatisfying thing IMO is not seeing a working set reduction when reducing allocations... (gen0 ones). It feels like we're just delaying the time to when a GC happens but not saving on overall memory of the application.

@stephentoub
Copy link
Member

stephentoub commented Apr 8, 2021

They're all "allocations"; it's just that only some of them show up in GC allocation profile traces. We could allocate and free native memory, that wouldn't show up on a managed profile, but it could be a worse issue. We could allocate and free by renting and returning from some pool, and that similarly wouldn't show up on a profile. But these things all have costs. In sense, the GC itself is a pool: it grabs some memory from the OS, and then effectively rents out pieces of it... it's just that rather than those pieces being explicitly returned by code, the GC goes and returns them itself. There are many considerations that need to factor in when pooling, and most pool implementations don't, e.g. using objects in a manner less likely to cause cache problems.

We don't know how to measure the CPU impact of all of these small pools and that's problematic when comparing to if it's better than GC if you don't see a throughput increase (or working set decrease).

Right. So if we can't come up with a scenario where we see a meaningful improvement, we're better off not adding the complexity of a custom pool, just to hide it from a managed allocation profile, with no other benefits we're able to measure, especially if that pool doesn't play nicely with all of the GC's knobs, which close to none do.

The latter assuming it doesn't make things worse

This is where we differ. You say "do it as long as it doesn't make things worse". I say "do it if it makes things better".

cc: @Maoni0

@davidfowl
Copy link
Member Author

We could allocate and free by renting and returning from some pool, and that similarly wouldn't show up on a profile. But these things all have costs. In sense, the GC itself is a pool: it grabs some memory from the OS, and then effectively rents out pieces of it... it's just that rather than those pieces being explicitly returned by code, the GC goes and returns them itself. There are many considerations that need to factor in when pooling, and most pool implementations don't, e.g. using objects in a manner less likely to cause cache problems.

This is exactly why we need to be able to measure because we measure scenarios and make decisions, when as a library author, you don't have any control over those scenarios. The peanut butter nature of gen0 allocations is what bugs me and why it "feels right" as a library author to err on the side of reuse as much as possible where it doesn't make the code crazy. Now I'm no advocating for global pools for all the things (I think they suck in the various ways you mention) but its really the only control you have outside of telling the app developer to

Right. So if we can't come up with a scenario where we see a meaningful improvement, we're better off not adding the complexity of a custom pool, just to hide it from a managed allocation profile, with no other benefits we're able to measure, especially if that pool doesn't play nicely with all of the GC's knobs, which close to none do.

Complexity is in the eye of the beholder. I think we just need to be able to measure the cost of these custom pools better. It's the only knob we have as library authors.

This is where we differ. You say "do it as long as it doesn't make things worse". I say "do it if it makes things better".

This is where we disagree 😄, because I plan to pool where possible (and where it seems wasteful) and where it doesn't make things too complex (judgement call obviously). I see it as giving more headroom to the applications where libraries exist as a guest.

@geoffkizer
Copy link
Contributor

using objects in a manner less likely to cause cache problems

I wonder about this specifically re pooling of state machines. The state machines hold the async "stack". Seems like GC allocations would do better at locality than pooling.

@davidfowl
Copy link
Member Author

using objects in a manner less likely to cause cache problems

I'd also like to be able to measure some of this as well because I assume this is also very much like peanut butter.

@davidfowl
Copy link
Member Author

I guess this comes down to the fact that gen0 isn't better than pooling in lots of cases. We've seen evidence in Kestrel that pooling has cumulatively improved the overall throughput. I guess you could write a pool that was really bad and ended up making the performance worse or one that is too complex to maintain but generally well isolated pools, or simple object reuse is better.

@benaadams
Copy link
Member

The other very unsatisfying thing IMO is not seeing a working set reduction when reducing allocations... (gen0 ones). It feels like we're just delaying the time to when a GC happens but not saving on overall memory of the application.

I see this; short of reducing allocs to 0 or adding the settings to cap the GC size, even if your allocations are only 1000 Tasks per second eventually it will tend to a 1GB of heap even if only 5MB is ever in use. I did think there was some heuristic that the GC used to adjust its allocation budget as to collect to match the memory size to allocation rate? https://devblogs.microsoft.com/dotnet/garbage-collection-at-food-courts/

This is the same as GC adjusts the amount of allocations before the next collection. It’s called the allocation budget. And this “adjusting” that GC does is part of its dynamic tuning. As the process runs, GC will modify this allocation budget according to the survival rate it observes to keep itself productive.

Do the startup long surviving allocations throw it off; or maybe I'm misunderstanding the allocation budget. Perhaps it starts off "infinite" and reduces if allocations are high (large heap -> collected faster large heap); rather than starting off low and increasing if allocations are high (small heap -> large heap) /cc @Maoni0

@benaadams
Copy link
Member

eventually it will tend to a 1GB of heap even if only 5MB is ever in use

Well; maybe not 1GB anymore with the segment size changes; but larger than expected as a slow allocation rate waits a very long time for a GC by which time the heap has grown significantly even if nothing is ever particularly in use (on a machine with significant RAM and no GC setting tweaking).

Saying that I haven't experienced any performance issues from this large heap collection; its more of a perception issue from looking at what the OS reports as the app is using, so perhaps its more of a feel good factor? 😅

@Maoni0
Copy link
Member

Maoni0 commented Apr 8, 2021

Do the startup long surviving allocations throw it off; or maybe I'm misunderstanding the allocation budget. Perhaps it starts off "infinite" and reduces if allocations are high (large heap -> collected faster large heap); rather than starting off low and increasing if allocations are high (small heap -> large heap) /cc @Maoni0

the initial gen0 budget is mostly based on LLC (last level cache) size. during startup when gen0 surv rate is very high the budget for gen0 will quickly get adjusted to be high. when the gen0 surv rate becomes much lower during the steady state the budget will be adjusted down to reflect that.

@benaadams
Copy link
Member

the initial gen0 budget is mostly based on LLC size. during startup when gen0 surv rate is very high the budget for gen0 will quickly get adjusted to be high. when the gen0 surv rate becomes much lower during the steady state the budget will be adjusted down to reflect that.

Might be the extremely fast ramp up that exhibits this then? Pushing up the predicted budget; followed by no GC for a very long time where it fills the budget? I'm understand I'm not familiar enough so essentially completely guessing 😅; but if the budgets are revaluated only at the next GC; perhaps a timeout "pseudo" GC if the budget hasn't been hit for a while that can serve to evaluates budget down is the steady state allocation flips into a massively lower allocation rate?

e.g. from my various amortize issues #30168, dotnet/aspnetcore#11940, dotnet/aspnetcore#11940, etc the only allocations for our websocket server are the Read async boxes in the layering chain of reads; but the app will still end up using a lot of memory according to the OS (though the active managed heap is pretty minimal)

image

@stephentoub's "Make "async ValueTask/ValueTask" methods ammortized allocation-free" dotnet/coreclr#26310 (now removed 😢) reduced this to a single asyncbox allocation per read for the steady state (in WebSockets where does an .AsTask()); but the memory would still; very slowly, inch up to a high level before it triggered a GC (the startup ramp up was very fast; large and nearly 100% survives)

I haven't rerun this test on the latest runtime; so perhaps I should also do that and find out what it stops at.

@benaadams
Copy link
Member

perhaps a timeout "pseudo" GC if ....

I could probably also do a GC.Collect() on a N minute timer to test this theory 🤔

@davidfowl
Copy link
Member Author

but the app will still end up using a lot of memory according to the OS (though the active managed heap is pretty minimal)

There's a new counter in .NET 6 dotnet/diagnostics#2139 that might help track down where the rest of the memory is going.

@Maoni0
Copy link
Member

Maoni0 commented Apr 9, 2021

Might be the extremely fast ramp up that exhibits this then? Pushing up the predicted budget; followed by no GC for a very long time where it fills the budget?

if the budget was adjusted very high, yes. it'll need to wait till the next GC to observe that your alloc behavior has changed. in workstation GC we do check more often (but then again WKS GC has a much smaller max budget). we should probably check this for Server GC as well but in general this hasn't been a problem (happy to discuss if it's a problem for you).

@Maoni0
Copy link
Member

Maoni0 commented Apr 9, 2021

of course you can also collect a trace to verify this theory - GC informational events will tell you the budget.

@davidfowl
Copy link
Member Author

davidfowl commented Apr 9, 2021

Saying that I haven't experienced any performance issues from this large heap collection; its more of a perception issue from looking at what the OS reports as the app is using, so perhaps its more of a feel good factor? 😅

It's more than a feel good factor, it's also why you can't tell how much memory you need to run. This shows up when people try to set container limits.

I do think there's a balance to be struck here and that's basically what's being discussed. Easy allocations don't really need a benchmark to remove. We removed extra closures, string, delegate allocations with small changes and new compiler features (like static lambdas). We also agree that pooling large objects is beneficial because they are actually expensive in the GC so reusing them is beneficial. So now we're trying to find the middle ground, how much do we contribute to gen0 as a framework/set of libraries. It's sometimes obvious to do this if there's a throughout benefit but we rarely mention the memory benefit because it's somewhat out of the libraries hands (you just run the GC more often right?).

PS: I'm spending some time trying to better understand the overhead and impact of many object specialized pools in the system:

image

@benaadams
Copy link
Member

benaadams commented Apr 9, 2021

An example in Kestrel/samples/PlaintextApp with the addition of TLS there are essentially 3 allocations after start-up and initial establishment of 256 connections; the Async Boxes for StreamPipeReader+ReadAsync; SslStream+ReadAsyncInternal, DuplexPipeStream+ReadAsyncInternal (all of which should be able to be pooled?)

image

Steady state (while doing 72k req/s) is 8.497 MB After GC?, but TaskManger is reporting a WorkingSet of 90MB and stays there (10min run)

image

of course you can also collect a trace to verify this theory - GC informational events will tell you the budget.

Trace: Plaintext.etl

dotnet-counters shows

[Microsoft.AspNetCore.Hosting]
    Current Requests                                               0
    Failed Requests                                                0
    Request Rate (Count / 1 sec)                              72,490
    Total Requests                                         4,072,134
[System.Runtime]
    % Time in GC since last GC (%)                                 0
    Allocation Rate (B / 1 sec)                           44,066,696
    CPU Usage (%)                                                 45
    Exception Count (Count / 1 sec)                                0
    GC Committed Bytes (B)                                46,350,336
    GC Fragmentation (%)                                           0.442
    GC Heap Size (MB)                                             19
    Gen 0 GC Count (Count / 1 sec)                                 2
    Gen 0 Size (B)                                           323,432
    Gen 1 GC Count (Count / 1 sec)                                 0
    Gen 1 Size (B)                                         4,635,744
    Gen 2 GC Count (Count / 1 sec)                                 0
    Gen 2 Size (B)                                         3,735,864
    IL Bytes Jitted (B)                                      232,353
    LOH Size (B)                                             131,272
    Monitor Lock Contention Count (Count / 1 sec)                  2
    Number of Active Timers                                        0
    Number of Assemblies Loaded                                   69
    Number of Methods Jitted                                   3,291
    POH (Pinned Object Heap) Size (B)                      1,457,352
    ThreadPool Completed Work Item Count (Count / 1 sec)     136,295
    ThreadPool Queue Length                                       21
    ThreadPool Thread Count                                       15
    Working Set (MB)                                              91

So the GC Committed Bytes is about the same as Allocation Rate/sec of ~45MB; Working Set is about double that at ~91 MB

GC Heap Size is much lower at 19MB; its doing 2 Gen0/sec (which to be fair seems a reasonable rate so maybe what the GC is doing is fine); however this all to clean up 3 async read objects of which only a max of 768 can exist at any time (3 x 256); so 91MB Working Set just presents as an expectedly high memory use; when there's nothing particularly active at any given time?

Pooling these 3 async boxes in this example might be a drop in the ocean for another apps allocations as it will be something else; however it will still have these allocations.

However it currently presents as the simplest Plaintext app; with only 256 connections needs 90MB to run, which seems high as the baseline for the most minimal ASP.NET Core app with TLS?

@benaadams
Copy link
Member

which to be fair seems a reasonable rate so maybe what the GC is doing is fine

i.e. 2 GCs/sec seems nice; and perhaps there is a tweak so its budget doesn't go so high at the very start:

image

However mostly it is actually allocating 45MB/sec, and I'd prefer to not allocate that 45MB; since its an option, rather than increase the GC rate (if that makes sense?)

@Maoni0
Copy link
Member

Maoni0 commented Apr 9, 2021

@benaadams it was high because the survival rate was very high and soon after the budget got adjust down so that seems exactly the right thing to do. I'm curious to know why you wouldn't want it to go so high? you get fewer GCs that way when GC knows it's not going to be productive to do a collection.

GC Heap Size is much lower at 19MB;

I think we should seriously rename all these counters to add an "after GC" at the end of their names... because this does not include the gen0 you are about to allocate. would that help with clarifying why there's a discrepancy?

@benaadams
Copy link
Member

benaadams commented Apr 10, 2021

it was high because the survival rate was very high and soon after the budget got adjust down so that seems exactly the right thing to do. I'm curious to know why you wouldn't want it to go so high?

I don't particularly have an issue with what the GC is doing; the app is allocating at a rate and the GC is running twice a second for that allocation rate; I'm not saying I want the GC to run more frequently 😅

Its the async scenario that bothers me. If these were changed to sync reads rather than async reads then the steady state allocation rate would become zero (and likely the working set would be smaller).

Moving to async, the allocation rate moves to 45MB/s purely for these 3 async boxes and as an app developer since they are embedded in the framework they are allocations that nothing can be done about; but its very clear what they are:

image

To me it just seems if there is a clear; demonstratable and straightforward opportunity to bring the allocation rate and working set down; assuming there aren't significant downsides, that would be a better direction to favour?

@davidfowl
Copy link
Member Author

I have a different take on this. In the above case, the GC can keep up with the allocation rate, things are great, CPU time spent in the GC is low. The problem is, there's a rate at which the GC can't keep up and CPU time inevitably rises and then we look at how to spend time reducing allocations to reduce time spent in the GC. If the libraries you use allocate into gen0 then they end up setting the floor for your allocation rate. At this stage you have 2 options:

  • Tune the GC budgets to run the GC less, resulting in using more memory
  • Start pooling (or avoiding allocations) to avoid contributing to the allocation rate

Does this sound correct or am I off base?

@benaadams
Copy link
Member

Tune the GC budgets to run the GC less, resulting in using more memory

So for less total memory use the GC can run faster (and there are constraint settings available for that); but that means allocations will prematurely age-up making a Gen2 collection more likely; or as you say can use much more memory so less frequent GCs meaning the lifetime of the objects has likely passed making a Gen2 collection less likely.

However it's better to avoid allocations where we can; to give the GC more headroom, to deal with the allocations that we can't easily, and to give user code a bigger goldilocks zone where they aren't worrying about how much they allocate.

Start pooling (or avoiding allocations) to avoid contributing to the allocation rate

This is the crux of the issue for user code, for example, if it's Linq use then can likely tweak it to not use Linq without much of an issue; and if it's something infrequent and not in the hot path then it doesn't matter too much.

However, the above examples which happen for every read of data are both very much in the hot path; and the cost for user-code to avoid them is extremely high (need to reimplement their own TLS layer as per Drawaes/Leto; reimplement WebSockets as per StackExchange/NetGain); essentially ditching the framework stack, which are major investments with extra security concerns.

Moving away from the simple Plaintext app; this is our "In Space" WebSocket server showing allocations between two GCs on current .NET 6.0:

image

99.9% of the allocations are the main path .NET/ASP.NET Core apis and not our app's; as far as I'm aware there is nothing we can do; short of ditching the Framework, to reduce these allocations? It's essentially the baseline allocation rate we cannot go under.

@benaadams
Copy link
Member

Added change to fix the byte[] allocations in the above #51060

@ManickaP ManickaP added this to the Future milestone Apr 22, 2021
@ManickaP ManickaP removed the untriaged New issue has not been triaged by the area owner label Apr 22, 2021
@karelz karelz added the enhancement Product code improvement that does NOT require public API changes/additions label Apr 22, 2021
@ManickaP
Copy link
Member

Triage: part of larger effort, needs measurable wins.

@davidfowl
Copy link
Member Author

davidfowl commented Apr 22, 2021

Networking is a high performance part of the stack that is used in many scenarios. I'd expect us to do this in websockets, and SSL Stream as they are in the hot path of apps that use them.

If it doesn't fit there I'd have a hard time imagining where it does fit.

@davidfowl
Copy link
Member Author

@stephentoub is feature (the pooling) in yet for C# 10?

@stephentoub
Copy link
Member

is feature (the pooling) in yet for C# 10?

The compiler support for async method builder overrides is not in yet.
cc: @jcouv

@stephentoub stephentoub self-assigned this Jul 23, 2021
@stephentoub stephentoub modified the milestones: Future, 6.0.0 Jul 23, 2021
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jul 26, 2021
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Aug 11, 2021
@davidfowl
Copy link
Member Author

SslStream next! 😁

@stephentoub
Copy link
Member

Not in 6.

@davidfowl
Copy link
Member Author

davidfowl commented Aug 11, 2021

Yea I didn't expect any of it in 6 so I'm thankful for what we got far

@ghost ghost locked as resolved and limited conversation to collaborators Sep 10, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Net enhancement Product code improvement that does NOT require public API changes/additions tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants