Bump downwards #37

fitzgen · 2019-11-01T21:04:46Z

This changes bumpalo's implementation from

initializing the bump pointer at the start of the chunk, and
incrementing the bump pointer to allocate an object

to

initializing the bump pointer at the end of the chunk, and
decrementing the bump pointer to allocate an object

This means that we are now rounding down to align the pointer, which is just masking the bottom bits. Rounding up, what we used to have to do, required an addition which could overflow, which meant that we had an extra conditional branch in the generated code.

Furthermore, once the bump pointer is decremented, it is now pointing directly at the allocated space. Previously, we had to save a copy of the original pointer in a temporary, update the bump pointer, and then return the temporary. That requires the use of an extra register, so the new approach should help lower register pressure at call sites, producing slightly better code.

The decrement also requirers fewer instructions to implement, which is better for code size, and all else being equal should also imply a speed up in its own right as well.

Put all this together and it looks like allocation speeds up 3-19% depending on the work load! See the benchmark results below.

Note that there is a ~4% regression in realloc performance. This is because the new, decrementing-the-bump-pointer implementation cannot grow the last allocation in place by only updating the bump pointer. It has to do a copy since the beginning of the allocation moves, even when we get to reuse the original allocation's space. I think this is worth the trade off for the speed up to allocation, however.

Benchmark Results

alloc/small             time:   [26.129 us 26.168 us 26.208 us]
                        thrpt:  [381.56 Melem/s 382.15 Melem/s 382.71 Melem/s]
                 change:
                        time:   [-9.2069% -8.7900% -8.3936%] (p = 0.00 < 0.05)
                        thrpt:  [+9.1627% +9.6372% +10.141%]
                        Performance has improved.
Found 123 outliers among 1000 measurements (12.30%)
  51 (5.10%) high mild
  72 (7.20%) high severe

alloc/big               time:   [348.03 us 348.21 us 348.41 us]
                        thrpt:  [28.702 Melem/s 28.718 Melem/s 28.733 Melem/s]
                 change:
                        time:   [-3.1144% -3.0057% -2.8915%] (p = 0.00 < 0.05)
                        thrpt:  [+2.9776% +3.0989% +3.2145%]
                        Performance has improved.
Found 150 outliers among 1000 measurements (15.00%)
  58 (5.80%) low mild
  46 (4.60%) high mild
  46 (4.60%) high severe

alloc-with/small        time:   [26.446 us 26.477 us 26.508 us]
                        thrpt:  [377.25 Melem/s 377.69 Melem/s 378.12 Melem/s]
                 change:
                        time:   [-16.499% -16.191% -15.898%] (p = 0.00 < 0.05)
                        thrpt:  [+18.904% +19.318% +19.759%]
                        Performance has improved.
Found 57 outliers among 1000 measurements (5.70%)
  43 (4.30%) high mild
  14 (1.40%) high severe

alloc-with/big          time:   [313.26 us 313.75 us 314.35 us]
                        thrpt:  [31.811 Melem/s 31.872 Melem/s 31.922 Melem/s]
                 change:
                        time:   [-6.5853% -6.2957% -6.0163%] (p = 0.00 < 0.05)
                        thrpt:  [+6.4014% +6.7187% +7.0495%]
                        Performance has improved.
Found 166 outliers among 1000 measurements (16.60%)
  70 (7.00%) low mild
  44 (4.40%) high mild
  52 (5.20%) high severe

format-realloc/format-realloc/10
                        time:   [84.850 ns 85.002 ns 85.162 ns]
                        thrpt:  [117.42 Melem/s 117.64 Melem/s 117.86 Melem/s]
                 change:
                        time:   [+4.8825% +5.4527% +6.2553%] (p = 0.00 < 0.05)
                        thrpt:  [-5.8870% -5.1707% -4.6552%]
                        Performance has regressed.
Found 299 outliers among 1000 measurements (29.90%)
  1 (0.10%) low severe
  78 (7.80%) low mild
  22 (2.20%) high mild
  198 (19.80%) high severe

format-realloc/format-realloc/80
                        time:   [85.144 ns 85.353 ns 85.571 ns]
                        thrpt:  [934.89 Melem/s 937.29 Melem/s 939.58 Melem/s]
                 change:
                        time:   [+4.6040% +5.5085% +6.1615%] (p = 0.00 < 0.05)
                        thrpt:  [-5.8039% -5.2209% -4.4014%]
                        Performance has regressed.
Found 168 outliers among 1000 measurements (16.80%)
  40 (4.00%) high mild
  128 (12.80%) high severe

format-realloc/format-realloc/270
                        time:   [84.940 ns 85.080 ns 85.225 ns]
                        thrpt:  [3.1681 Gelem/s 3.1735 Gelem/s 3.1787 Gelem/s]
                 change:
                        time:   [+3.7967% +4.2268% +4.6452%] (p = 0.00 < 0.05)
                        thrpt:  [-4.4390% -4.0554% -3.6579%]
                        Performance has regressed.
Found 229 outliers among 1000 measurements (22.90%)
  8 (0.80%) low severe
  2 (0.20%) low mild
  11 (1.10%) high mild
  208 (20.80%) high severe

format-realloc/format-realloc/640
                        time:   [85.917 ns 86.199 ns 86.497 ns]
                        thrpt:  [7.3991 Gelem/s 7.4247 Gelem/s 7.4490 Gelem/s]
                 change:
                        time:   [+2.2676% +3.1780% +3.8626%] (p = 0.00 < 0.05)
                        thrpt:  [-3.7190% -3.0801% -2.2173%]
                        Performance has regressed.
Found 169 outliers among 1000 measurements (16.90%)
  62 (6.20%) high mild
  107 (10.70%) high severe

This changes `bumpalo`'s implementation from * initializing the bump pointer at the start of the chunk, and * incrementing the bump pointer to allocate an object to * initializing the bump pointer at the end of the chunk, and * decrementing the bump pointer to allocate an object This means that we are now rounding down to align the pointer, which is just masking the bottom bits. Rounding up, what we used to have to do, required an addition which could overflow, which meant that we had an extra conditional branch in the generated code. Furthermore, once the bump pointer is decremented, it is now pointing directly at the allocated space. Previously, we had to save a copy of the original pointer in a temporary, update the bump pointer, and then return the temporary. That requires the use of an extra register, so the new approach should help lower register pressure at call sites, producing slightly better code. The decrement also requirers fewer instructions to implement, which is better for code size, and all else being equal should also imply a speed up in its own right as well. Put all this together and it looks like allocation speeds up 3-19% depending on the work load! See the benchmark results below. Note that there is a ~4% regression in `realloc` performance. This is because the new, decrementing-the-bump-pointer implementation cannot grow the last allocation in place by only updating the bump pointer. It has to do a copy since the beginning of the allocation moves, even when we get to reuse the original allocation's space. I think this is worth the trade off for the speed up to allocation, however. -------------------------------------------------------------------------------- alloc/small time: [26.129 us 26.168 us 26.208 us] thrpt: [381.56 Melem/s 382.15 Melem/s 382.71 Melem/s] change: time: [-9.2069% -8.7900% -8.3936%] (p = 0.00 < 0.05) thrpt: [+9.1627% +9.6372% +10.141%] Performance has improved. Found 123 outliers among 1000 measurements (12.30%) 51 (5.10%) high mild 72 (7.20%) high severe alloc/big time: [348.03 us 348.21 us 348.41 us] thrpt: [28.702 Melem/s 28.718 Melem/s 28.733 Melem/s] change: time: [-3.1144% -3.0057% -2.8915%] (p = 0.00 < 0.05) thrpt: [+2.9776% +3.0989% +3.2145%] Performance has improved. Found 150 outliers among 1000 measurements (15.00%) 58 (5.80%) low mild 46 (4.60%) high mild 46 (4.60%) high severe alloc-with/small time: [26.446 us 26.477 us 26.508 us] thrpt: [377.25 Melem/s 377.69 Melem/s 378.12 Melem/s] change: time: [-16.499% -16.191% -15.898%] (p = 0.00 < 0.05) thrpt: [+18.904% +19.318% +19.759%] Performance has improved. Found 57 outliers among 1000 measurements (5.70%) 43 (4.30%) high mild 14 (1.40%) high severe alloc-with/big time: [313.26 us 313.75 us 314.35 us] thrpt: [31.811 Melem/s 31.872 Melem/s 31.922 Melem/s] change: time: [-6.5853% -6.2957% -6.0163%] (p = 0.00 < 0.05) thrpt: [+6.4014% +6.7187% +7.0495%] Performance has improved. Found 166 outliers among 1000 measurements (16.60%) 70 (7.00%) low mild 44 (4.40%) high mild 52 (5.20%) high severe format-realloc/format-realloc/10 time: [84.850 ns 85.002 ns 85.162 ns] thrpt: [117.42 Melem/s 117.64 Melem/s 117.86 Melem/s] change: time: [+4.8825% +5.4527% +6.2553%] (p = 0.00 < 0.05) thrpt: [-5.8870% -5.1707% -4.6552%] Performance has regressed. Found 299 outliers among 1000 measurements (29.90%) 1 (0.10%) low severe 78 (7.80%) low mild 22 (2.20%) high mild 198 (19.80%) high severe format-realloc/format-realloc/80 time: [85.144 ns 85.353 ns 85.571 ns] thrpt: [934.89 Melem/s 937.29 Melem/s 939.58 Melem/s] change: time: [+4.6040% +5.5085% +6.1615%] (p = 0.00 < 0.05) thrpt: [-5.8039% -5.2209% -4.4014%] Performance has regressed. Found 168 outliers among 1000 measurements (16.80%) 40 (4.00%) high mild 128 (12.80%) high severe format-realloc/format-realloc/270 time: [84.940 ns 85.080 ns 85.225 ns] thrpt: [3.1681 Gelem/s 3.1735 Gelem/s 3.1787 Gelem/s] change: time: [+3.7967% +4.2268% +4.6452%] (p = 0.00 < 0.05) thrpt: [-4.4390% -4.0554% -3.6579%] Performance has regressed. Found 229 outliers among 1000 measurements (22.90%) 8 (0.80%) low severe 2 (0.20%) low mild 11 (1.10%) high mild 208 (20.80%) high severe format-realloc/format-realloc/640 time: [85.917 ns 86.199 ns 86.497 ns] thrpt: [7.3991 Gelem/s 7.4247 Gelem/s 7.4490 Gelem/s] change: time: [+2.2676% +3.1780% +3.8626%] (p = 0.00 < 0.05) thrpt: [-3.7190% -3.0801% -2.2173%] Performance has regressed. Found 169 outliers among 1000 measurements (16.90%) 62 (6.20%) high mild 107 (10.70%) high severe

fitzgen · 2019-11-01T21:05:27Z

cc @TethysSvensson

TethysSvensson · 2019-11-01T21:38:35Z

@fitzgen Wow, those are some nice numbers! It might also mean that I can use this for flatbuffers!

fitzgen · 2019-11-01T21:40:17Z

Great! :)

TethysSvensson · 2019-11-01T21:42:12Z

src/lib.rs

+                let new_ptr = footer.ptr.get();
+                // NB: we know it is non-overlapping because of the size check
+                // in the `if` condition.
+                ptr::copy_nonoverlapping(ptr.as_ptr(), new_ptr.as_ptr(), new_size);


Have you tested how much we lose by using ptr::copy instead and now having the new_size <= old_size / 2 check?

Ah, I see. If we are shrinking but not by a lot, we can just use the same pointer. We could reclaim it, but it is a lot of trouble for very few bytes saved. I agree with this implementation! 👍

Exactly, and since we already are doing this calculus, we might was well choose the threshold where we get to do a faster copy as well. That said, if you want to experiment with other implementations and benchmark them, I'm happy to accept results-driven PRs! :)

src/lib.rs

tests/tests.rs

Kamayuq · 2022-01-15T18:17:07Z

So I have tried bumping downwards in a linear allocator and even tough it saves a couple of instructions it was always slower on Jaguar CPUs.

fitzgen added 2 commits November 1, 2019 13:21

Update to criterion 0.3

f7b9e2e

fitzgen merged commit 38054c7 into master Nov 1, 2019

fitzgen deleted the bump-downwards branch November 1, 2019 21:17

TethysSvensson reviewed Nov 1, 2019

View reviewed changes

src/lib.rs Show resolved Hide resolved

TethysSvensson reviewed Nov 1, 2019

View reviewed changes

tests/tests.rs Show resolved Hide resolved

This was referenced Nov 1, 2019

Checklist of feature requests for flatbuffers #38

Closed

Reverse the order of the chunks in iter_allocated_chunks and each_allocated_chunk #42

Merged

fitzgen mentioned this pull request Nov 22, 2019

Anything to do before making a new breaking release? #51

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump downwards #37

Bump downwards #37

fitzgen commented Nov 1, 2019 •

edited

Loading

fitzgen commented Nov 1, 2019

TethysSvensson commented Nov 1, 2019

fitzgen commented Nov 1, 2019

TethysSvensson Nov 1, 2019 •

edited

Loading

TethysSvensson Nov 1, 2019

fitzgen Nov 4, 2019

Kamayuq commented Jan 15, 2022

Bump downwards #37

Bump downwards #37

Conversation

fitzgen commented Nov 1, 2019 • edited Loading

fitzgen commented Nov 1, 2019

TethysSvensson commented Nov 1, 2019

fitzgen commented Nov 1, 2019

TethysSvensson Nov 1, 2019 • edited Loading

Choose a reason for hiding this comment

TethysSvensson Nov 1, 2019

Choose a reason for hiding this comment

fitzgen Nov 4, 2019

Choose a reason for hiding this comment

Kamayuq commented Jan 15, 2022

fitzgen commented Nov 1, 2019 •

edited

Loading

TethysSvensson Nov 1, 2019 •

edited

Loading