Make "async ValueTask/ValueTask<T>" methods ammortized allocation-free #26310

stephentoub · 2019-08-22T01:44:44Z

Today async ValueTask/ValueTask<T> methods use builders that special-case the synchronously completing case (to just return a default(ValueTask) or new ValueTask<T>(result)) but defer to the equivalent of async Task/Task<T> for when the operation completes asynchronously. This, however, doesn't take advantage of the fact that value tasks can wrap arbitrary IValueTaskSource/IValueTaskSource<T> implementations.

I've had this sitting on the shelf for a while, but finally cleaned it up. The first three commits here are just moving around existing code. The last two commits are the meat of this change. This changes AsyncValueTaskMethodBuilder and AsyncValueTaskMethodBuilder<T> to use pooled IValueTaskSource/IValueTaskSource<T> instances, such that calls to an async ValueTask/ValueTask<T> method incur 0 allocations as long as there's a cached object available.

I've marked this as a draft and work-in-progress for a few reasons:

There's a breaking change here, in that while we say/document that ValueTask/ValueTask<T>s are more limited in what they can be used for, nothing in the implementation actually stops a ValueTask that was wrapping a Task from being used as permissively as Task, e.g. if you were to await such a ValueTask multiple times, it would happen to work, even though we say "never do that". This change means that anyone who was relying on such undocumented behaviors will now be broken. I think this is a reasonable thing to do in a major release, but I also want feedback and a lot of runway on it. There's the potential to make it opt-in (or opt-out) as well, but that will also non-trivially complicate the implementation.
Policy around pooling. This is always a difficult thing to tune. Right now I've chosen a policy that limits the number of pooled objects per state machine type to an arbitrary multiple of the processor count, and that tries to strike a balance between contention and garbage by using a spin lock and if there's any contention while trying to get or return a pooled object, the cache is ignored. We will need to think hard about what policy to use here. It's also possible it could be tuned per state machine, e.g. by having an attribute that's looked up via reflection when creating the cache for a state machine, but that'll add a lot of first-access overhead to any async ValueTask/ValueTask<T> method.
Perf validation.

cc: @kouvel, @tarekgh, @benaadams

jkotas · 2019-08-22T12:40:46Z

What do we expect the performance gains from this to be - for primary perf metrics like requests per second for something real looking? (I understand that this will reduce the number of GCs by moving the cost elsewhere.)

stephentoub · 2019-08-22T12:48:55Z

What do we expect the performance gains from this to be - for primary perf metrics like requests per second for something real looking?

Don't know yet. I'm hoping by putting this up as a draft folks will try it out :) I'm also going to try it on a few workloads. I won't be merging it until it proves valuable.

...ystem.Private.CoreLib/shared/System/Runtime/CompilerServices/AsyncValueTaskMethodBuilderT.cs

benaadams · 2019-08-22T14:18:42Z

Will give it a whirl :)

benaadams · 2019-08-22T15:51:54Z

Isn't very happy

Unhandled exception. System.InvalidOperationException: Operation is not valid due to the current state of the object.
   at System.Threading.Tasks.Sources.ManualResetValueTaskSourceCore`1.OnCompleted(Action`1 continuation, Object state, Int16 token, ValueTaskSourceOnCompletedFlags flags) in C:\GitHub\coreclr\src\System.Private.CoreLib\shared\System\Threading\Tasks\Sources\ManualResetValueTaskSourceCore.cs:line 163
   at System.Runtime.CompilerServices.AsyncValueTaskMethodBuilder`1.ValueTaskStateMachineBox.OnCompleted(Action`1 continuation, Object state, Int16 token, ValueTaskSourceOnCompletedFlags flags) in C:\GitHub\coreclr\src\System.Private.CoreLib\shared\System\Runtime\CompilerServices\AsyncValueTaskMethodBuilderT.cs:line 309
   at System.Runtime.CompilerServices.ValueTaskAwaiter`1.System.Runtime.CompilerServices.IStateMachineBoxAwareAwaiter.AwaitUnsafeOnCompleted(IAsyncStateMachineBox box) in C:\GitHub\coreclr\src\System.Private.CoreLib\shared\System\Runtime\CompilerServices\ValueTaskAwaiter.cs:line 188
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AwaitUnsafeOnCompleted[TAwaiter,TStateMachine](TAwaiter& awaiter, TStateMachine& stateMachine) in C:\GitHub\coreclr\src\System.Private.CoreLib\shared\System\Runtime\CompilerServices\AsyncTaskMethodBuilderT.cs:line 88
--- End of stack trace from previous location where exception was thrown ---
   at System.Threading.Tasks.Task.<>c.<ThrowAsync>b__139_1(Object state) in C:\GitHub\coreclr\src\System.Private.CoreLib\shared\System\Threading\Tasks\Task.cs:line 1910
   at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute() in C:\GitHub\coreclr\src\System.Private.CoreLib\shared\System\Threading\ThreadPool.cs:line 880
   at System.Threading.ThreadPoolWorkQueue.Dispatch() in C:\GitHub\coreclr\src\System.Private.CoreLib\shared\System\Threading\ThreadPool.cs:line 700

stephentoub · 2019-08-22T15:53:08Z

Isn't very happy

What is this running?

benaadams · 2019-08-22T16:00:32Z

Isn't very happy

What is this running?

Just trying to debug into it as the stack trace clearly isn't very helpful :)

halter73 · 2019-08-22T16:10:03Z

e.g. if you were to await such a ValueTask multiple times, it would happen to work, even though we say "never do that". This change means that anyone who was relying on such undocumented behaviors will now be broken.

I say it's better to make this change sooner rather than later. I like that this change might teach people the hard way not to await the same ValueTask multiple times since it will no longer just Pipes and other IValueTaskSource-based ValueTasks that break.

Do you expect there will be any common workloads where pooling the boxes hurts perf?

stephentoub · 2019-08-22T16:16:34Z

Do you expect there will be any common workloads where pooling the boxes hurts perf?

It's hard to say.

Using pooling is based on the assumption that the overheads associated with pooling (e.g. synchronization to access the pool, any cache effects that come from using the pool, etc.) are less than the overheads associated with allocation+GC. That is not always the case, but it's also hard to predict.

There's also the fact that this is changing the underlying implementation, so it's not just pooling vs non-pooling, but ValueTask wrapping Task vs wrapping IValueTaskSource, different code paths being taken internally, different synchronization used to hook up a continuation, etc. Some of that is just tuning, but some of it will be trade-offs as well. In theory this approach could have less overhead even if pooling weren't used at all (just using a custom IValueTaskSource) because it doesn't need to accomodate things like multiple continuations, but we're also talking minutia at that point.

In the end, I'll want us to vet this on real workloads that demonstrate real value before moving ahead with it.

mburbea · 2019-08-22T17:21:15Z

@mgravell has a pretty interesting library that does something very similar to this
https://github.com/mgravell/PooledAwait

benaadams · 2019-08-22T18:22:42Z

Revisiting https://github.com/dotnet/coreclr/issues/21612 again 😉

Since ValueTaskStateMachineBox is a subtype would just calling it StateMachineBox work (as is done for many Enumerators)?

So rather than
AsyncValueTaskMethodBuilder'1.ValueTaskStateMachineBox.OnCompleted
It is
AsyncValueTaskMethodBuilder'1.StateMachineBox.OnCompleted

However that's not too important 😄

benaadams · 2019-08-22T18:24:03Z

Issue I'm having I think is in websockets; as its specifically commented about it https://github.com/dotnet/corefx/blob/ba9f27788bd99a29e5d832531f0ee86d8eae4959/src/System.Net.WebSockets/src/System/Net/WebSockets/ManagedWebSocket.netcoreapp.cs#L30-L31

// WARNING: This code is only valid because ReceiveAsyncPrivate returns a ValueTask that wraps a T or a Task.
// If that ever changes where ReceiveAsyncPrivate could wrap an IValueTaskSource, this must also change.

None of the other ValueTasks seem to be having issues

stephentoub · 2019-08-22T18:25:02Z

issue I'm having I think is in websockets

Yup, I wrote both that code and comment :) Shame on me. Will change it.

mgravell · 2019-08-22T18:59:42Z

@mburbea I would be genuinely delighted if I could nuke that lib in entirety because the feature has been implemented in the core lib by folks who know better than me what they're doing.

benaadams · 2019-08-22T20:22:07Z

256 websocket connections ~30sec

Pre

Post

Might still want to do "Amortize WebSocket.EnsureBufferContainsAsync" dotnet/corefx#39455 not sure why its not really effected by this change; however it does address the harder to amortize async methods.

Will do some benchmarks

stephentoub · 2019-08-22T20:52:44Z

not sure why its not really effected by this change

Oh, simpler explanation:

private async Task EnsureBufferContainsAsync

;-)

benaadams · 2019-08-22T21:17:02Z

Yep, changing entirely fixes it

benaadams · 2019-08-22T22:17:09Z

I'm liking this very much. Even with its low(ish) pool limit the allocation growth is low

200 websocket connections ~30sec

1000 websocket connections ~30sec

And the cost for renting and returning boxes seems fairly minimal

stephentoub · 2019-08-22T22:28:02Z

I'm liking this very much

Thanks. Does it have a measurable impact on your throughout?

benaadams · 2019-08-22T22:33:56Z

The change in before and after allocations is huge 😄

1000 websocket connections ~30sec

Pre: 1.36M objects allocated (245MB)

Post: 209K objects allocated (20MB)

benaadams · 2019-08-23T00:00:32Z

Does it have a measurable impact on your throughout?

Yes(ish); unfortunately (or fortunately), it moves it into a realm beyond what I can load test maximums accurately with my current setup as I can't generate enough load. Its also highlighted some other hot spots in our own code.

While I can't currently quantify that (though will look into it more); more interestingly; it also means for the same number of GCs and RAM usage I can pack the containers at least x4 tighter when they aren't under maximum load.

benaadams · 2019-08-23T00:08:29Z

The overhead of GetOrCapture is less than capturing the ExecutionContext

Well some are more than others; but is a fairly insignificant cost

ReturnBox is a little more; but fairly small

benaadams · 2019-08-23T00:28:56Z

@mgravell are you in a position to give this a try for your workloads? Note it only applies to async ValueTask and ValueTask (not async Task)

benaadams · 2019-08-23T04:13:01Z

Items to close if this does go forward:
dotnet/aspnetcore#11940
dotnet/aspnetcore#11942
https://github.com/dotnet/corefx/issues/39256
dotnet/corefx#39455
https://github.com/dotnet/corefx/issues/39258
dotnet/corefx#39450
https://github.com/dotnet/corefx/issues/39257

and probably dotnet/efcore#17391

jkotas · 2019-10-21T19:26:53Z

For microbenchmarks, the numbers look like what I'd hope for

Are these single-threaded microbenchmarks a good proxy for what happens in the real workloads?

I have whipped a random multi-threaded one https://gist.github.com/jkotas/776a7609f8beaaa639230caa76c73813. The results I am getting:

~5.5% throughput regression when the cache is off
~2% throughput regression and more volatile performance when the cache is on

Full Results

``` D:\coreclr\bin\Product>set COMPlus_gcServer=1

D:\coreclr\bin\Product>baseline\CoreRun.exe stress.exe
12844 Average: 12844
12093 Average: 12468
11969 Average: 12302
11969 Average: 12218
12031 Average: 12181
11922 Average: 12138
12016 Average: 12120
12062 Average: 12113
12016 Average: 12102
12203 Average: 12112
^C
D:\coreclr\bin\Product>new\CoreRun.exe stress.exe
12453 Average: 12453
13172 Average: 12812
13250 Average: 12958
13250 Average: 13031
12875 Average: 13000
12594 Average: 12932
12640 Average: 12890
12610 Average: 12855
12437 Average: 12809
12578 Average: 12785
12625 Average: 12771
^C
D:\coreclr\bin\Product>set DOTNET_SYSTEM_THREADING_POOLASYNCVALUETASKS=1

D:\coreclr\bin\Product>new\CoreRun.exe stress.exe
13063 Average: 13063
12843 Average: 12953
12563 Average: 12823
11906 Average: 12593
11735 Average: 12422
11625 Average: 12289
12593 Average: 12332
12657 Average: 12373
^C

</p>
</details>

jkotas · 2019-10-21T19:28:09Z

Similar caches were source of troubles in the unmanaged threadpool in the past. We ended up with per-processor caches, but they are still sub-optimal. The small memory blocks will lose their affinity to the right core over time and so one ends up with false sharing or cross-NUMA node memory access. The GC allocated memory avoids these issues.

Having said this, I am fine with experimenting with this to see whether somebody can demonstrate measurable throughput improvement thanks to the caching.

jkotas · 2019-10-21T19:32:56Z

whether somebody can demonstrate measurable throughput improvement thanks to the caching

If there is a workload that shows measurable throughput improvement, I would like the GC team to take a look at the GC traces to validate that it is not hitting a GC tuning bug.

jkotas · 2019-10-21T19:36:33Z

Another interesting metric to look at is code size. Benchmark: https://gist.githubusercontent.com/jkotas/c3913ac6a98b3474b888926aa9066dd6/raw/95c548546d3e7d183e028348a96a50181c1546ed/manytasks.cs. Native code size bytes from perfview JIT stats:

baseline: 2,603,033 bytes
new / off: 2,659,316 bytes (2% regression)
new / on: 2,810,236 bytes (8% regression)

stephentoub · 2019-10-21T20:26:42Z

~5.5% throughput regression when the cache is off

That case should be unaffected by parallelism. If you see opportunities in this PR to address that, I'd love to hear about it.

I also put your example into Benchmark.NET form:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Diagnosers;
using BenchmarkDotNet.Running;
using System;
using System.Threading.Tasks;

[MemoryDiagnoser]
public class Program
{
    static void Main(string[] args) => BenchmarkSwitcher.FromAssemblies(new[] { typeof(Program).Assembly }).Run(args);

    [Benchmark]
    public void Parallel()
    {
        var tasks = new Task[Environment.ProcessorCount * 4];
        for (int i = 0; i < tasks.Length; i++) tasks[i] = Work();
        Task.WaitAll(tasks);
    }

    static async ValueTask<int> Step(int i)
    {
        if (i >= 999) await Task.Delay(1);
        else await Task.Yield();
        return i + 1;
    }

    static async Task Work()
    {
        var r = new Random();
        for (int i = 0; i < 100000; i++)
        {
            await Step(r.Next(1000));
        }
    }
}

and I get:

Method	Toolchain	Mean	Error	StdDev	Ratio	Gen 0	Gen 1	Gen 2	Allocated
Parallel	\new\corerun.exe	1.685 s	0.0161 s	0.0150 s	1.00	99000.0000	2000.0000	-	586.7 MB
Parallel	\old\corerun.exe	1.692 s	0.0218 s	0.0204 s	1.00	99000.0000	2000.0000	-	586.7 MB

Maybe there's a machine-specific aspect to it? (There's also both randomness and "saturate the machine" parallelism here, so there's going to be variability.)

Note that the goal is for there to be 0 perf impact on an example like yours when the cache is off: it's meant to use exactly the same infrastructure as is used today with as little ceremony as we can muster to enable switching on the cache optionally, though there are a few tweaks to try to improve ValueTask perf even in the case where the cache is off.

Also note that a large portion of the time spent in your example is in the Task.Delay(1), which on Windows is going to incur an ~15ms delay. If I delete that part, I get:

Method	Toolchain	Mean	Error	StdDev	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated
Parallel	\new\corerun.exe	541.3 ms	8.09 ms	7.57 ms	0.96	0.02	92000.0000	2000.0000	-	549.34 MB
Parallel	\old\corerun.exe	562.8 ms	4.99 ms	4.67 ms	1.00	0.00	92000.0000	2000.0000	-	549.34 MB

~2% throughput regression and more volatile performance when the cache is on

My machine isn't as quiet as I'd like for a parallelism test, but I just ran it three times (without the Task.Delay, which dominates) and got:

Method	Toolchain	Mean	Error	StdDev	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated
Parallel	\new\corerun.exe	538.8 ms	10.68 ms	14.26 ms	0.97	0.03	46000.0000	1000.0000	-	275.13 MB
Parallel	\old\corerun.exe	556.5 ms	8.63 ms	8.08 ms	1.00	0.00	92000.0000	2000.0000	-	549.34 MB

Method	Toolchain	Mean	Error	StdDev	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated
Parallel	\new\corerun.exe	547.6 ms	10.70 ms	16.97 ms	0.95	0.02	47000.0000	1000.0000	-	280.52 MB
Parallel	\old\corerun.exe	571.8 ms	2.94 ms	2.75 ms	1.00	0.00	92000.0000	2000.0000	-	549.34 MB

Method	Toolchain	Mean	Error	StdDev	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated
Parallel	\new\corerun.exe	547.6 ms	10.83 ms	22.13 ms	0.98	0.03	49000.0000	1000.0000	-	296.21 MB
Parallel	\old\corerun.exe	556.9 ms	5.75 ms	5.38 ms	1.00	0.00	92000.0000	2000.0000	-	549.34 MB

Putting the Task.Delay back in I again saw basically no difference (in throughput... obviously the allocation is much different):

Method	Toolchain	Mean	Error	StdDev	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated
Parallel	\new\corerun.exe	1.922 s	0.0383 s	0.0790 s	1.00	0.05	43000.0000	2000.0000	-	257.68 MB
Parallel	\old\corerun.exe	1.924 s	0.0379 s	0.0729 s	1.00	0.00	99000.0000	3000.0000	-	586.73 MB

new / off: 2,659,316 bytes (2% regression)

There's a layer of helpers this is going through in order to enable the opt-in switch. If we were to switch to an always-on model, I'd expect this to go away.

new / on: 2,810,236 bytes (8% regression)

Every async ValueTask or async ValueTask<T> method is now referencing two different sets of builders, and every async ValueTask{<T>}'s builder now has the cache stuff as well. If we were to switch to an always-on model, I'd hope a good chunk of this would go away.

There's another option as well. Right now the switch flips between doing exactly what we do today (async completions are backed by Task instances) and using the cache (async completions are backed by pooled IValueTaskSource instances). I could also make it so that the switch only controls whether the cache is used, i.e. always use the IValueTaskSource instances, but if the switch is off, behave as if no instances could ever be retrieved from the cache (and don't try to return anything to it). The upside of that is async ValueTask{<T>} methods wouldn't pull in as much as they do from Task<T>, since they don't need to utilize it. The downside is the visible there-but-not-supported behaviors that Task ends up surfacing through ValueTask (e.g. .GetAwaiter().GetResult() working on incomplete ValueTasks even though we say that's not allowed) wouldn't be configurable with the switch.

Having said this, I am fine with experimenting with this to see whether somebody can demonstrate measurable throughput improvement thanks to the caching.

I'm having trouble rationalizing that with your stats and comments about GC tuning, as it sounds like your opinion is that a) this shouldn't be necessary, and b) if it is it's because of a GC perf bug that should be fixed... and then see (a). 😉

Is your preference I just close this PR?

jkotas · 2019-10-21T21:50:27Z

The occasional Task.Delay is important in my benchmarks because of it triggers thread movement from one core to a different core that happens in real workloads. There are other ways to achieve the same thing. I have not seen a significant time spent in the Task.Delay on my dev box (32 hyperthreading / 16 physical core Xeon E5). Most of the time is spent in the threadpool enqueue. Commenting out the Task.Delay changes the overall performance by a few percent at most on my machine.

I know there is a lot of interest in building caches like these because of single-minded focus on GC allocations. I would like to see a more data on whether they are worth it. It is why I have said that I am fine with experimenting with this. This interest in building caches like these won't go away until we have this data.

Personally, I believe that this cache should not be needed as long as the GC is working properly (and the cache likely hurts performance on average in real workloads). But I do not have data to prove it.

benaadams · 2019-10-21T23:29:02Z

In netcoreapp3.0 AsyncStateMachineBoxes are the only real source of allocations in our app:

For a very light load (0% CPU), they allocate 127MB every 1.5minutes and cause a GC

With the app stabilising around 170MB working set, 420MB Commit size (though still creeping up slowly)

Will report back changes, in this PR

benaadams · 2019-10-22T01:12:05Z

Distracted by trying to find out what all the native allocations are

benaadams · 2019-10-22T01:36:23Z

Raised question on the the 12MB of Jit allocations hanging around: https://github.com/dotnet/coreclr/issues/27357

syncblocks and threads (and threadstatics for those threads) it looks like later in the execution

benaadams · 2019-10-22T01:51:27Z

src/System.Private.CoreLib/shared/System/Runtime/CompilerServices/AsyncTaskCache.cs

+        private static int GetPoolAsyncValueTasksLimitValue() =>
+            int.TryParse(Environment.GetEnvironmentVariable("DOTNET_SYSTEM_THREADING_POOLASYNCVALUETASKSLIMIT"), out int result) && result > 0 ?
+                result :
+                Environment.ProcessorCount * 4; // arbitrary default value


Seems a bit low
~~suggestion~~ ~~ Environment.ProcessorCount * 32; // arbitrary default value~~ ~~~~

Nvm, have same allocation profile with 4 and 32.

benaadams · 2019-10-22T02:11:48Z

Roughly same period with same load

97MB working set, 355MB Commit size though it keeps going up (likely due to the ValueTaskSourceAsTask in Websocket.

benaadams · 2019-10-22T02:28:41Z

High load (~1.5 mins 4 cores 8 HT)

set DOTNET_SYSTEM_THREADING_POOLASYNCVALUETASKS=0

set DOTNET_SYSTEM_THREADING_POOLASYNCVALUETASKS=1
set DOTNET_SYSTEM_THREADING_POOLASYNCVALUETASKSLIMIT=256

benaadams · 2019-10-22T06:39:02Z

Assuming the ValueTask+ValueTaskSourceAsTask<ValueWebSocketReceiveResult>* in Websocket can be avoided; perhaps by inverting the dependency of Receive and Close (and making Close a Tcs or similar)... (@stephentoub could work?)

With this change off, running under load I quickly hit 1GB working set (less than 30 secs):

after which it doesn't increase but will start cycling through GCs. Can cap it lower with the new options in .NET 3.0+ but that will mean it GCs faster.

With DOTNET_SYSTEM_THREADING_POOLASYNCVALUETASKS=1 I'm somewhere very different:

Both have around ~275MB in actual use

But the original has 1GB allocated to the program and is incurring continuous GCs; the other aside from the point at the top* only has 291MB allocated to the program to do it and has no GCs; and yet both are doing heavy async network work.

The big win here is the ability not just to reduce GC allocations but to move to a zero alloc pauseless stable state; while using the standard framework methods, were the GC never needs to run and the working set remains small. One aspect of this is it allows tighter packing of processes/containers etc.

Another aspect is: like intrinsics it opens up a new set of problems for .NET Core to be applicable for (not disrespecting the amazing work the GC does); as there are areas where people reject the idea of a GC needing to periodically run.

jkotas · 2019-10-22T06:52:29Z

Are you able to measure peak RPS that your server can handle? If yes, do the extra GCs make measurable impact on the peak RPS (GC limited to 300MB vs. this thing)?

Environment.ProcessorCount * 32

This looks like a very large limit. It means you can easily have 50kB - 100kB in caches for each ValueTask used in your program.

people reject the idea of a GC needing to periodically run.

These people should use Rust. .NET is not going to be a success story for them.

GSPP · 2019-10-22T10:51:06Z

@benaadams what kind of pause targets do your workloads have? How rare do pauses have to be? Essentially zero?

Games have a lot of trouble with GC pauses as well. Attaining no pauses requires an awkward coding style. A 60Hz framerate means a 16ms budget per frame. I understand you need a small heap to survive a single G0 collection with this.

My main point: Maybe we could best solve this pause issue with low-pause GC features instead of running after every possible source of garbage. G2 pauses are (in my experience) very bearable these days because it's all background.

I have a production web site with a 1GB heap. I see 10ms G2 pauses but I see 100ms G0 and G1 pauses. So those have become more of a problem now. @Maoni0 said something similar in her recent talk.

@benaadams would you be satisfied if the GC was able to cap all pauses at say 10ms? Would there still be a need to limit their frequency?

benaadams · 2019-10-22T20:07:39Z

Environment.ProcessorCount * 32

This looks like a very large limit. It means you can easily have 50kB - 100kB in caches for each ValueTask used in your program.

Retested under high load, end up with same allocation profile with 4 or 32 so retract my increase; for lower loads end up with more allocations (maybe double); which is likely due to more async methods being simultaneously in a wait state. Which is why I originally suggested the increase. Sticking with 4 and allowing it to be configurable (as now) is probably the best of both worlds.

Games have a lot of trouble with GC pauses as well. Attaining no pauses requires an awkward coding style.

Its not very awkward anymore with .NET Core 3.0 and C#8; certainly much less awkward than with rust's borrow checker. The aim is not for the purity of no allocations, which is indeed more problem than its worth, but long sustained periods of no allocations; and this is also the approach many games take where when and where allocations happen is constrained and controlled.

This works quite well in .NET Core 3.0; except for async calls; which is intensely awkward e.g. see code changes in dotnet/corefx#39455, dotnet/aspnetcore#11942 which are fairly straightforward async methods and all the methods in the BCL need to be changed for this to work or require re-implementation externally to make non-allocating versions.

Async is also problematic as the statemachines are sat on for a while; in that they generally take some time and form long chains of statemachines, so when one propergates a GC generation it is many.

The advantage of this change is no special code is required by the implementer and it works for their methods too.

would you be satisfied if the GC was able to cap all pauses at say 10ms? Would there still be a need to limit their frequency?

That would definitely be a useful tool; much like TryStartNoGCRegion but the GC work still needs to be done and with I/O becoming heavily async focused it would be very useful if these unavoidable allocations could be avoided so GC wasn't even involved for the most part (e.g. trying to maintain a 16ms/11ms tick rate while network I/O is on going)

benaadams · 2019-10-22T20:15:27Z

If yes, do the extra GCs make measurable impact on the peak RPS (GC limited to 300MB vs. this thing)?

Will investigate further

This looks like a very large limit

Current limit looks ok

in caches for each ValueTask used in your program.

Could drop the caches on a GC as per Gen2GcCallback? So infrequent async statemachines don't get cached forever?

benaadams · 2019-10-22T20:22:40Z

Another interesting metric to look at is code size.

baseline: 2,603,033 bytes
new / off: 2,659,316 bytes (2% regression)
new / on: 2,810,236 bytes (8% regression)

Could guard it with something that's dropped from R2R like Avx2.IsSupported (though something better formalised as not being included for R2R would be better); then the caches would only come into play at Jit Tier1 time?

Today `async ValueTask/ValueTask<T>` methods use builders that special-case the synchronously completing case (to just return a `default(ValueTask)` or `new ValueTask<T>(result))` but defer to the equivalent of `async Task/Task<T>` for when the operation completes asynchronously. This, however, doesn't take advantage of the fact that value tasks can wrap arbitrary `IValueTaskSource/IValueTaskSource<T>` implementations. This commit gives `AsyncValueTaskMethodBuilder` and `AsyncValueTaskMethodBuilder<T>` the ability to use pooled `IValueTaskSource/IValueTaskSource<T>` instances, such that calls to an `async ValueTask/ValueTask<T>` method incur 0 allocations (ammortized) as long as there's a cached object available. Currently the pooling is behind a feature flag, requiring opt-in via the DOTNET_SYSTEM_THREADING_POOLASYNCVALUETASKS environment variable (setting it to "true" or "1"). This is done for a few reasons: - There's a breaking change here, in that while we say/document that `ValueTask/ValueTask<T>`s are more limited in what they can be used for, nothing in the implementation actually stops a `ValueTask` that was wrapping a `Task` from being used as permissively as `Task`, e.g. if you were to `await` such a `ValueTask` multiple times, it would happen to work, even though we say "never do that". This change means that anyone who was relying on such undocumented behaviors will now be broken. I think this is a reasonable thing to do in a major release, but I also want feedback and a lot of runway on it. - Policy around pooling. Pooling is always a difficult thing to tune. Right now I've chosen a policy that limits the number of pooled objects per state machine type to an arbitrary multiple of the processor count, and that tries to strike a balance between contention and garbage by using a spin lock and if there's any contention while trying to get or return a pooled object, the cache is ignored. We will need to think hard about what policy to use here. It's also possible it could be tuned per state machine, e.g. by having an attribute that's looked up via reflection when creating the cache for a state machine, but that'll add a lot of first-access overhead to any `async ValueTask/ValueTask<T>` method. For now, it's tunable via the `DOTNET_SYSTEM_THREADING_POOLASYNCVALUETASKSLIMIT` environment variable, which may be set to the maximum number of pooled objects desired per state machine. - Performance validation. Preliminary numbers show that this accomplishes its goal, having on-par throughput with the current implementation but with significantly less allocation. That needs to be validated at scale and across a variety of workloads. - Diagnostics. There are several diagnostics-related abilities available for `async Task/Task<T>` methods that are not included here when using `async ValueTask/ValueTask<T>` when pooling. We need to evaluate these (e.g. tracing) and determine which pieces need to be enabled and which we're fine omitting. Before shipping .NET 5, we could choose to flip the polarity of the switch (making it opt-out rather than opt-in), remove the fallback altogether and just have it be always on, or revert this change, all based on experimentation and data we receive between now and then.

stephentoub · 2019-10-24T00:10:03Z

Could guard it with something that's dropped from R2R

Maybe, but that would make certain aspects of this harder and potentially more expensive. Right now I get away with things like Unsafe.As because I know everything will be either type A or type B, per the readonly static that never changes. A fallback like I think you're suggesting would mean I could no longer trust that and would need to come up with an alternative.

It might be reasonable to still do something like you suggest, but use it as a mechanism to impact cache size, e.g. only start caching boxes after N calls to the method. That would help avoid heap object bloat for rarely used methods, but it wouldn't help with the aforementioned code size issues (and could be addressed in other ways, like dropping all boxes on gen2 gcs). The best thing we can do for code size I think is decide to either delete this change or delete the fallback, so that we're not paying double the code for every such method.

stephentoub · 2019-10-24T00:10:56Z

Given the feedback, the desire for more data, and the agreement that this is a reasonable way to get it and experiment more, I'm going to go ahead and merge this. I'll open an issue up track follow-ups for .NET 5. Thanks.

This was referenced Aug 22, 2019

Amortize HttpRequestStream & DuplexPipeStream ReadAsync calls dotnet/aspnetcore#11942

Closed

Amortize WebSocket.EnsureBufferContainsAsync calls dotnet/corefx#39455

Closed

benaadams reviewed Aug 22, 2019

View reviewed changes

...ystem.Private.CoreLib/shared/System/Runtime/CompilerServices/AsyncValueTaskMethodBuilderT.cs Outdated Show resolved Hide resolved

benaadams mentioned this pull request Aug 22, 2019

Amortize StreamPipeReader ReadAsync calls dotnet/corefx#39450

Closed

benaadams mentioned this pull request Aug 22, 2019

Change EnsureBufferContainsAsync to ValueTask dotnet/corefx#40519

Closed

benaadams mentioned this pull request Aug 23, 2019

Reduce heap allocations for asynchronously-completing methods dotnet/efcore#17391

Open

stephentoub mentioned this pull request Aug 23, 2019

Change some internal async Task methods to be async ValueTask dotnet/corefx#40527

Merged

benaadams reviewed Oct 22, 2019

View reviewed changes

stephentoub force-pushed the asyncvaluetaskallocs branch from 04fc18f to c85cc72 Compare October 23, 2019 17:10

stephentoub merged commit bbbcb8b into dotnet:master Oct 24, 2019

stephentoub deleted the asyncvaluetaskallocs branch October 24, 2019 00:39

stephentoub mentioned this pull request Jan 31, 2020

Determine what to do with "async ValueTask" caching dotnet/runtime#13633

Closed

kikaragyozov mentioned this pull request Feb 17, 2020

Make Dapper use the new IValueTask<T> instead of Task? DapperLib/Dapper#1413

Closed

karelz mentioned this pull request May 6, 2020

WebSocket doesn't amortize its ValueTask ReadAsync calls dotnet/runtime#30167

Closed

benaadams mentioned this pull request Apr 8, 2021

Use the new async state machine pooling feature on websockets dotnet/runtime#50921

Closed

This was referenced Feb 20, 2022

Memory allocations in Fetch method jstedfast/MailKit#1335

Closed

Is this make sense to be used with the latest .net frameworks? mgravell/PooledAwait#13

Open

Make "async ValueTask/ValueTask<T>" methods ammortized allocation-free #26310

Make "async ValueTask/ValueTask<T>" methods ammortized allocation-free #26310

Conversation

stephentoub commented Aug 22, 2019 • edited Loading

jkotas commented Aug 22, 2019

stephentoub commented Aug 22, 2019

benaadams commented Aug 22, 2019

benaadams commented Aug 22, 2019

stephentoub commented Aug 22, 2019

benaadams commented Aug 22, 2019

halter73 commented Aug 22, 2019

stephentoub commented Aug 22, 2019

mburbea commented Aug 22, 2019

benaadams commented Aug 22, 2019

benaadams commented Aug 22, 2019 • edited Loading

stephentoub commented Aug 22, 2019

mgravell commented Aug 22, 2019 • edited Loading

benaadams commented Aug 22, 2019

stephentoub commented Aug 22, 2019

benaadams commented Aug 22, 2019

benaadams commented Aug 22, 2019 • edited Loading

stephentoub commented Aug 22, 2019

benaadams commented Aug 22, 2019 • edited Loading

benaadams commented Aug 23, 2019

benaadams commented Aug 23, 2019 • edited Loading

benaadams commented Aug 23, 2019

benaadams commented Aug 23, 2019 • edited Loading

jkotas commented Oct 21, 2019

jkotas commented Oct 21, 2019 • edited Loading

jkotas commented Oct 21, 2019

jkotas commented Oct 21, 2019

stephentoub commented Oct 21, 2019 • edited Loading

jkotas commented Oct 21, 2019

benaadams commented Oct 21, 2019

benaadams commented Oct 22, 2019

benaadams commented Oct 22, 2019 • edited Loading

benaadams Oct 22, 2019 • edited Loading

Choose a reason for hiding this comment

benaadams Oct 22, 2019

Choose a reason for hiding this comment

benaadams commented Oct 22, 2019

benaadams commented Oct 22, 2019 • edited Loading

benaadams commented Oct 22, 2019

jkotas commented Oct 22, 2019 • edited Loading

GSPP commented Oct 22, 2019

benaadams commented Oct 22, 2019

benaadams commented Oct 22, 2019

benaadams commented Oct 22, 2019

stephentoub commented Oct 24, 2019 • edited Loading

stephentoub commented Oct 24, 2019

stephentoub commented Aug 22, 2019 •

edited

Loading

benaadams commented Aug 22, 2019 •

edited

Loading

mgravell commented Aug 22, 2019 •

edited

Loading

benaadams commented Aug 22, 2019 •

edited

Loading

benaadams commented Aug 22, 2019 •

edited

Loading

benaadams commented Aug 23, 2019 •

edited

Loading

benaadams commented Aug 23, 2019 •

edited

Loading

jkotas commented Oct 21, 2019 •

edited

Loading

stephentoub commented Oct 21, 2019 •

edited

Loading

benaadams commented Oct 22, 2019 •

edited

Loading

benaadams Oct 22, 2019 •

edited

Loading

benaadams commented Oct 22, 2019 •

edited

Loading

jkotas commented Oct 22, 2019 •

edited

Loading

stephentoub commented Oct 24, 2019 •

edited

Loading