Use the new async state machine pooling feature on websockets #50921

davidfowl · 2021-04-08T16:12:56Z

Description

Looking at an allocation profile for SignalR, the top allocations are coming from state machines allocated by Kestrel on websocket reads. We should be able to remove these by using the new pooling introduced here #50116.

Regression?

No

Data

Analysis

This is the code path:

runtime/src/libraries/System.Net.WebSockets/src/System/Net/WebSockets/ManagedWebSocket.cs

Line 685 in d927fcb

    
           private async ValueTask<TWebSocketReceiveResult> ReceiveAsyncPrivate<TWebSocketReceiveResultGetter, TWebSocketReceiveResult>(

cc @stephentoub @CarnaViire

ghost · 2021-04-08T16:13:00Z

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Looking at an allocation profile for SignalR, the top allocations are coming from state machines allocated by Kestrel on websocket reads. We should be able to remove these by using the new pooling introduced here #50116.

Regression?

No

Data

Analysis

This is teh codez:

runtime/src/libraries/System.Net.WebSockets/src/System/Net/WebSockets/ManagedWebSocket.cs

Line 685 in d927fcb

    
           private async ValueTask<TWebSocketReceiveResult> ReceiveAsyncPrivate<TWebSocketReceiveResultGetter, TWebSocketReceiveResult>(

cc @stephentoub @CarnaViire

Author:	davidfowl
Assignees:	-
Labels:	`area-System.Net`, `tenet-performance`, `untriaged`
Milestone:	-

stephentoub · 2021-04-08T16:16:08Z

The plan is to use the feature in a few places based on profiling to see that it not only reduces allocations but also improves throughput under load. This is certainly one of the places we'll experiment with.

We haven't done that yet because the feature hasn't been implemented in the compiler yet. There's no action that can be reasonably taken until that happens.

davidfowl · 2021-04-08T16:22:32Z

The plan is to use the feature in a few places based on profiling to see that it not only reduces allocations but also improves throughput under load.

This is the one that always gets me. CPU time spent in the GC vs CPU time spent in our own pools calling rent/return.

We haven't done that yet because the feature hasn't been implemented in the compiler yet. There's no action that can be reasonably taken until that happens.

Yea, I know, I'm just filing issues and storing profiles.

stephentoub · 2021-04-08T16:24:19Z

This is the one that always gets me. CPU time spent in the GC vs CPU time spent in our own pools calling rent/return.

"gets you" meaning you forget that there's a cost to pooling, or "gets you" meaning you think we shouldn't be concerned about such costs and should always pool everything?

davidfowl · 2021-04-08T17:51:36Z

"gets you" meaning you forget that there's a cost to pooling, or "gets you" meaning you think we shouldn't be concerned about such costs and should always pool everything?

The latter assuming it doesn't make things worse because allocations are like paper cuts that add up (peanut butter). There are a couple of things we don't have:

IMO not allocating for infrastructure things where it can be easily avoided makes sense
As libraries that run in various scenarios, us allocating means the end application needs to tune the GC to meet their performance needs. This means they need to understand the framework and the runtime's allocations.

We don't know how to measure the CPU impact of all of these small pools and that's problematic when comparing to if it's better than GC if you don't see a throughput increase (or working set decrease).

The other very unsatisfying thing IMO is not seeing a working set reduction when reducing allocations... (gen0 ones). It feels like we're just delaying the time to when a GC happens but not saving on overall memory of the application.

stephentoub · 2021-04-08T18:00:49Z

They're all "allocations"; it's just that only some of them show up in GC allocation profile traces. We could allocate and free native memory, that wouldn't show up on a managed profile, but it could be a worse issue. We could allocate and free by renting and returning from some pool, and that similarly wouldn't show up on a profile. But these things all have costs. In sense, the GC itself is a pool: it grabs some memory from the OS, and then effectively rents out pieces of it... it's just that rather than those pieces being explicitly returned by code, the GC goes and returns them itself. There are many considerations that need to factor in when pooling, and most pool implementations don't, e.g. using objects in a manner less likely to cause cache problems.

We don't know how to measure the CPU impact of all of these small pools and that's problematic when comparing to if it's better than GC if you don't see a throughput increase (or working set decrease).

Right. So if we can't come up with a scenario where we see a meaningful improvement, we're better off not adding the complexity of a custom pool, just to hide it from a managed allocation profile, with no other benefits we're able to measure, especially if that pool doesn't play nicely with all of the GC's knobs, which close to none do.

The latter assuming it doesn't make things worse

This is where we differ. You say "do it as long as it doesn't make things worse". I say "do it if it makes things better".

cc: @Maoni0

davidfowl · 2021-04-08T18:16:44Z

We could allocate and free by renting and returning from some pool, and that similarly wouldn't show up on a profile. But these things all have costs. In sense, the GC itself is a pool: it grabs some memory from the OS, and then effectively rents out pieces of it... it's just that rather than those pieces being explicitly returned by code, the GC goes and returns them itself. There are many considerations that need to factor in when pooling, and most pool implementations don't, e.g. using objects in a manner less likely to cause cache problems.

This is exactly why we need to be able to measure because we measure scenarios and make decisions, when as a library author, you don't have any control over those scenarios. The peanut butter nature of gen0 allocations is what bugs me and why it "feels right" as a library author to err on the side of reuse as much as possible where it doesn't make the code crazy. Now I'm no advocating for global pools for all the things (I think they suck in the various ways you mention) but its really the only control you have outside of telling the app developer to

Right. So if we can't come up with a scenario where we see a meaningful improvement, we're better off not adding the complexity of a custom pool, just to hide it from a managed allocation profile, with no other benefits we're able to measure, especially if that pool doesn't play nicely with all of the GC's knobs, which close to none do.

Complexity is in the eye of the beholder. I think we just need to be able to measure the cost of these custom pools better. It's the only knob we have as library authors.

This is where we differ. You say "do it as long as it doesn't make things worse". I say "do it if it makes things better".

This is where we disagree 😄, because I plan to pool where possible (and where it seems wasteful) and where it doesn't make things too complex (judgement call obviously). I see it as giving more headroom to the applications where libraries exist as a guest.

geoffkizer · 2021-04-08T19:12:42Z

using objects in a manner less likely to cause cache problems

I wonder about this specifically re pooling of state machines. The state machines hold the async "stack". Seems like GC allocations would do better at locality than pooling.

davidfowl · 2021-04-08T19:14:42Z

using objects in a manner less likely to cause cache problems

I'd also like to be able to measure some of this as well because I assume this is also very much like peanut butter.

davidfowl · 2021-04-08T19:42:09Z

I guess this comes down to the fact that gen0 isn't better than pooling in lots of cases. We've seen evidence in Kestrel that pooling has cumulatively improved the overall throughput. I guess you could write a pool that was really bad and ended up making the performance worse or one that is too complex to maintain but generally well isolated pools, or simple object reuse is better.

benaadams · 2021-04-08T22:20:00Z

The other very unsatisfying thing IMO is not seeing a working set reduction when reducing allocations... (gen0 ones). It feels like we're just delaying the time to when a GC happens but not saving on overall memory of the application.

I see this; short of reducing allocs to 0 or adding the settings to cap the GC size, even if your allocations are only 1000 Tasks per second eventually it will tend to a 1GB of heap even if only 5MB is ever in use. I did think there was some heuristic that the GC used to adjust its allocation budget as to collect to match the memory size to allocation rate? https://devblogs.microsoft.com/dotnet/garbage-collection-at-food-courts/

This is the same as GC adjusts the amount of allocations before the next collection. It’s called the allocation budget. And this “adjusting” that GC does is part of its dynamic tuning. As the process runs, GC will modify this allocation budget according to the survival rate it observes to keep itself productive.

Do the startup long surviving allocations throw it off; or maybe I'm misunderstanding the allocation budget. Perhaps it starts off "infinite" and reduces if allocations are high (large heap -> collected faster large heap); rather than starting off low and increasing if allocations are high (small heap -> large heap) /cc @Maoni0

benaadams · 2021-04-08T22:30:22Z

eventually it will tend to a 1GB of heap even if only 5MB is ever in use

Well; maybe not 1GB anymore with the segment size changes; but larger than expected as a slow allocation rate waits a very long time for a GC by which time the heap has grown significantly even if nothing is ever particularly in use (on a machine with significant RAM and no GC setting tweaking).

Saying that I haven't experienced any performance issues from this large heap collection; its more of a perception issue from looking at what the OS reports as the app is using, so perhaps its more of a feel good factor? 😅

Maoni0 · 2021-04-08T22:43:21Z

Do the startup long surviving allocations throw it off; or maybe I'm misunderstanding the allocation budget. Perhaps it starts off "infinite" and reduces if allocations are high (large heap -> collected faster large heap); rather than starting off low and increasing if allocations are high (small heap -> large heap) /cc @Maoni0

the initial gen0 budget is mostly based on LLC (last level cache) size. during startup when gen0 surv rate is very high the budget for gen0 will quickly get adjusted to be high. when the gen0 surv rate becomes much lower during the steady state the budget will be adjusted down to reflect that.

benaadams · 2021-04-08T23:06:28Z

the initial gen0 budget is mostly based on LLC size. during startup when gen0 surv rate is very high the budget for gen0 will quickly get adjusted to be high. when the gen0 surv rate becomes much lower during the steady state the budget will be adjusted down to reflect that.

Might be the extremely fast ramp up that exhibits this then? Pushing up the predicted budget; followed by no GC for a very long time where it fills the budget? I'm understand I'm not familiar enough so essentially completely guessing 😅; but if the budgets are revaluated only at the next GC; perhaps a timeout "pseudo" GC if the budget hasn't been hit for a while that can serve to evaluates budget down is the steady state allocation flips into a massively lower allocation rate?

e.g. from my various amortize issues #30168, dotnet/aspnetcore#11940, dotnet/aspnetcore#11940, etc the only allocations for our websocket server are the Read async boxes in the layering chain of reads; but the app will still end up using a lot of memory according to the OS (though the active managed heap is pretty minimal)

@stephentoub's "Make "async ValueTask/ValueTask" methods ammortized allocation-free" dotnet/coreclr#26310 (now removed 😢) reduced this to a single asyncbox allocation per read for the steady state (in WebSockets where does an .AsTask()); but the memory would still; very slowly, inch up to a high level before it triggered a GC (the startup ramp up was very fast; large and nearly 100% survives)

I haven't rerun this test on the latest runtime; so perhaps I should also do that and find out what it stops at.

benaadams · 2021-04-08T23:09:17Z

perhaps a timeout "pseudo" GC if ....

I could probably also do a GC.Collect() on a N minute timer to test this theory 🤔

davidfowl · 2021-04-08T23:23:41Z

but the app will still end up using a lot of memory according to the OS (though the active managed heap is pretty minimal)

There's a new counter in .NET 6 dotnet/diagnostics#2139 that might help track down where the rest of the memory is going.

Maoni0 · 2021-04-09T00:03:57Z

Might be the extremely fast ramp up that exhibits this then? Pushing up the predicted budget; followed by no GC for a very long time where it fills the budget?

if the budget was adjusted very high, yes. it'll need to wait till the next GC to observe that your alloc behavior has changed. in workstation GC we do check more often (but then again WKS GC has a much smaller max budget). we should probably check this for Server GC as well but in general this hasn't been a problem (happy to discuss if it's a problem for you).

Maoni0 · 2021-04-09T00:04:23Z

of course you can also collect a trace to verify this theory - GC informational events will tell you the budget.

davidfowl · 2021-04-09T06:09:38Z

Saying that I haven't experienced any performance issues from this large heap collection; its more of a perception issue from looking at what the OS reports as the app is using, so perhaps its more of a feel good factor? 😅

It's more than a feel good factor, it's also why you can't tell how much memory you need to run. This shows up when people try to set container limits.

I do think there's a balance to be struck here and that's basically what's being discussed. Easy allocations don't really need a benchmark to remove. We removed extra closures, string, delegate allocations with small changes and new compiler features (like static lambdas). We also agree that pooling large objects is beneficial because they are actually expensive in the GC so reusing them is beneficial. So now we're trying to find the middle ground, how much do we contribute to gen0 as a framework/set of libraries. It's sometimes obvious to do this if there's a throughout benefit but we rarely mention the memory benefit because it's somewhat out of the libraries hands (you just run the GC more often right?).

PS: I'm spending some time trying to better understand the overhead and impact of many object specialized pools in the system:

benaadams · 2021-04-09T08:54:18Z

An example in Kestrel/samples/PlaintextApp with the addition of TLS there are essentially 3 allocations after start-up and initial establishment of 256 connections; the Async Boxes for StreamPipeReader+ReadAsync; SslStream+ReadAsyncInternal, DuplexPipeStream+ReadAsyncInternal (all of which should be able to be pooled?)

Steady state (while doing 72k req/s) is 8.497 MB After GC?, but TaskManger is reporting a WorkingSet of 90MB and stays there (10min run)

of course you can also collect a trace to verify this theory - GC informational events will tell you the budget.

Trace: Plaintext.etl

dotnet-counters shows

[Microsoft.AspNetCore.Hosting]
    Current Requests                                               0
    Failed Requests                                                0
    Request Rate (Count / 1 sec)                              72,490
    Total Requests                                         4,072,134
[System.Runtime]
    % Time in GC since last GC (%)                                 0
    Allocation Rate (B / 1 sec)                           44,066,696
    CPU Usage (%)                                                 45
    Exception Count (Count / 1 sec)                                0
    GC Committed Bytes (B)                                46,350,336
    GC Fragmentation (%)                                           0.442
    GC Heap Size (MB)                                             19
    Gen 0 GC Count (Count / 1 sec)                                 2
    Gen 0 Size (B)                                           323,432
    Gen 1 GC Count (Count / 1 sec)                                 0
    Gen 1 Size (B)                                         4,635,744
    Gen 2 GC Count (Count / 1 sec)                                 0
    Gen 2 Size (B)                                         3,735,864
    IL Bytes Jitted (B)                                      232,353
    LOH Size (B)                                             131,272
    Monitor Lock Contention Count (Count / 1 sec)                  2
    Number of Active Timers                                        0
    Number of Assemblies Loaded                                   69
    Number of Methods Jitted                                   3,291
    POH (Pinned Object Heap) Size (B)                      1,457,352
    ThreadPool Completed Work Item Count (Count / 1 sec)     136,295
    ThreadPool Queue Length                                       21
    ThreadPool Thread Count                                       15
    Working Set (MB)                                              91

So the GC Committed Bytes is about the same as Allocation Rate/sec of ~45MB; Working Set is about double that at ~91 MB

GC Heap Size is much lower at 19MB; its doing 2 Gen0/sec (which to be fair seems a reasonable rate so maybe what the GC is doing is fine); however this all to clean up 3 async read objects of which only a max of 768 can exist at any time (3 x 256); so 91MB Working Set just presents as an expectedly high memory use; when there's nothing particularly active at any given time?

Pooling these 3 async boxes in this example might be a drop in the ocean for another apps allocations as it will be something else; however it will still have these allocations.

However it currently presents as the simplest Plaintext app; with only 256 connections needs 90MB to run, which seems high as the baseline for the most minimal ASP.NET Core app with TLS?

benaadams · 2021-04-09T09:10:00Z

which to be fair seems a reasonable rate so maybe what the GC is doing is fine

i.e. 2 GCs/sec seems nice; and perhaps there is a tweak so its budget doesn't go so high at the very start:

However mostly it is actually allocating 45MB/sec, and I'd prefer to not allocate that 45MB; since its an option, rather than increase the GC rate (if that makes sense?)

Maoni0 · 2021-04-09T21:10:43Z

@benaadams it was high because the survival rate was very high and soon after the budget got adjust down so that seems exactly the right thing to do. I'm curious to know why you wouldn't want it to go so high? you get fewer GCs that way when GC knows it's not going to be productive to do a collection.

GC Heap Size is much lower at 19MB;

I think we should seriously rename all these counters to add an "after GC" at the end of their names... because this does not include the gen0 you are about to allocate. would that help with clarifying why there's a discrepancy?

benaadams · 2021-04-10T03:08:39Z

it was high because the survival rate was very high and soon after the budget got adjust down so that seems exactly the right thing to do. I'm curious to know why you wouldn't want it to go so high?

I don't particularly have an issue with what the GC is doing; the app is allocating at a rate and the GC is running twice a second for that allocation rate; I'm not saying I want the GC to run more frequently 😅

Its the async scenario that bothers me. If these were changed to sync reads rather than async reads then the steady state allocation rate would become zero (and likely the working set would be smaller).

Moving to async, the allocation rate moves to 45MB/s purely for these 3 async boxes and as an app developer since they are embedded in the framework they are allocations that nothing can be done about; but its very clear what they are:

To me it just seems if there is a clear; demonstratable and straightforward opportunity to bring the allocation rate and working set down; assuming there aren't significant downsides, that would be a better direction to favour?

davidfowl · 2021-04-10T04:20:13Z

I have a different take on this. In the above case, the GC can keep up with the allocation rate, things are great, CPU time spent in the GC is low. The problem is, there's a rate at which the GC can't keep up and CPU time inevitably rises and then we look at how to spend time reducing allocations to reduce time spent in the GC. If the libraries you use allocate into gen0 then they end up setting the floor for your allocation rate. At this stage you have 2 options:

Tune the GC budgets to run the GC less, resulting in using more memory
Start pooling (or avoiding allocations) to avoid contributing to the allocation rate

Does this sound correct or am I off base?

benaadams · 2021-04-10T13:04:10Z

Tune the GC budgets to run the GC less, resulting in using more memory

So for less total memory use the GC can run faster (and there are constraint settings available for that); but that means allocations will prematurely age-up making a Gen2 collection more likely; or as you say can use much more memory so less frequent GCs meaning the lifetime of the objects has likely passed making a Gen2 collection less likely.

However it's better to avoid allocations where we can; to give the GC more headroom, to deal with the allocations that we can't easily, and to give user code a bigger goldilocks zone where they aren't worrying about how much they allocate.

Start pooling (or avoiding allocations) to avoid contributing to the allocation rate

This is the crux of the issue for user code, for example, if it's Linq use then can likely tweak it to not use Linq without much of an issue; and if it's something infrequent and not in the hot path then it doesn't matter too much.

However, the above examples which happen for every read of data are both very much in the hot path; and the cost for user-code to avoid them is extremely high (need to reimplement their own TLS layer as per Drawaes/Leto; reimplement WebSockets as per StackExchange/NetGain); essentially ditching the framework stack, which are major investments with extra security concerns.

Moving away from the simple Plaintext app; this is our "In Space" WebSocket server showing allocations between two GCs on current .NET 6.0:

99.9% of the allocations are the main path .NET/ASP.NET Core apis and not our app's; as far as I'm aware there is nothing we can do; short of ditching the Framework, to reduce these allocations? It's essentially the baseline allocation rate we cannot go under.

benaadams · 2021-04-11T02:16:19Z

Added change to fix the byte[] allocations in the above #51060

ManickaP · 2021-04-22T16:22:00Z

Triage: part of larger effort, needs measurable wins.

davidfowl · 2021-04-22T16:36:36Z

Networking is a high performance part of the stack that is used in many scenarios. I'd expect us to do this in websockets, and SSL Stream as they are in the hot path of apps that use them.

If it doesn't fit there I'd have a hard time imagining where it does fit.

davidfowl · 2021-06-03T07:02:29Z

@stephentoub is feature (the pooling) in yet for C# 10?

stephentoub · 2021-06-03T10:17:43Z

is feature (the pooling) in yet for C# 10?

The compiler support for async method builder overrides is not in yet.
cc: @jcouv

davidfowl · 2021-08-11T02:16:58Z

SslStream next! 😁

stephentoub · 2021-08-11T02:23:52Z

Not in 6.

davidfowl · 2021-08-11T02:30:57Z

Yea I didn't expect any of it in 6 so I'm thankful for what we got far

davidfowl added the tenet-performance Performance related issue label Apr 8, 2021

dotnet-issue-labeler bot added area-System.Net untriaged New issue has not been triaged by the area owner labels Apr 8, 2021

benaadams mentioned this issue Apr 11, 2021

Rent correct buffer size for SslStream.Encrypt #51060

Closed

ManickaP added this to the Future milestone Apr 22, 2021

ManickaP removed the untriaged New issue has not been triaged by the area owner label Apr 22, 2021

karelz added the enhancement Product code improvement that does NOT require public API changes/additions label Apr 22, 2021

stephentoub self-assigned this Jul 23, 2021

stephentoub modified the milestones: Future, 6.0.0 Jul 23, 2021

stephentoub mentioned this issue Jul 26, 2021

Refactor ManagedWebSocket to avoid forcing Task allocations for ReceiveAsync #56282

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Jul 26, 2021

stephentoub closed this as completed in #56282 Aug 11, 2021

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Aug 11, 2021

ghost locked as resolved and limited conversation to collaborators Sep 10, 2021

Use the new async state machine pooling feature on websockets #50921

Use the new async state machine pooling feature on websockets #50921

Comments

davidfowl commented Apr 8, 2021 • edited Loading

Description

Regression?

Data

Analysis

ghost commented Apr 8, 2021

Description

Regression?

Data

Analysis

stephentoub commented Apr 8, 2021 • edited Loading

davidfowl commented Apr 8, 2021

stephentoub commented Apr 8, 2021

davidfowl commented Apr 8, 2021 • edited Loading

stephentoub commented Apr 8, 2021 • edited Loading

davidfowl commented Apr 8, 2021

geoffkizer commented Apr 8, 2021

davidfowl commented Apr 8, 2021

davidfowl commented Apr 8, 2021

benaadams commented Apr 8, 2021

benaadams commented Apr 8, 2021

Maoni0 commented Apr 8, 2021 • edited Loading

benaadams commented Apr 8, 2021

benaadams commented Apr 8, 2021

davidfowl commented Apr 8, 2021

Maoni0 commented Apr 9, 2021

Maoni0 commented Apr 9, 2021

davidfowl commented Apr 9, 2021 • edited Loading

benaadams commented Apr 9, 2021 • edited Loading

benaadams commented Apr 9, 2021

Maoni0 commented Apr 9, 2021

benaadams commented Apr 10, 2021 • edited Loading

davidfowl commented Apr 10, 2021

benaadams commented Apr 10, 2021

benaadams commented Apr 11, 2021

ManickaP commented Apr 22, 2021

davidfowl commented Apr 22, 2021 • edited Loading

davidfowl commented Jun 3, 2021

stephentoub commented Jun 3, 2021

davidfowl commented Aug 11, 2021

stephentoub commented Aug 11, 2021

davidfowl commented Aug 11, 2021 • edited Loading

davidfowl commented Apr 8, 2021 •

edited

Loading

stephentoub commented Apr 8, 2021 •

edited

Loading

davidfowl commented Apr 8, 2021 •

edited

Loading

stephentoub commented Apr 8, 2021 •

edited

Loading

Maoni0 commented Apr 8, 2021 •

edited

Loading

davidfowl commented Apr 9, 2021 •

edited

Loading

benaadams commented Apr 9, 2021 •

edited

Loading

benaadams commented Apr 10, 2021 •

edited

Loading

davidfowl commented Apr 22, 2021 •

edited

Loading

davidfowl commented Aug 11, 2021 •

edited

Loading