Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump downwards #37

Merged
merged 2 commits into from
Nov 1, 2019
Merged

Bump downwards #37

merged 2 commits into from
Nov 1, 2019

Conversation

fitzgen
Copy link
Owner

@fitzgen fitzgen commented Nov 1, 2019

This changes bumpalo's implementation from

  • initializing the bump pointer at the start of the chunk, and
  • incrementing the bump pointer to allocate an object

to

  • initializing the bump pointer at the end of the chunk, and
  • decrementing the bump pointer to allocate an object

This means that we are now rounding down to align the pointer, which is just masking the bottom bits. Rounding up, what we used to have to do, required an addition which could overflow, which meant that we had an extra conditional branch in the generated code.

Furthermore, once the bump pointer is decremented, it is now pointing directly at the allocated space. Previously, we had to save a copy of the original pointer in a temporary, update the bump pointer, and then return the temporary. That requires the use of an extra register, so the new approach should help lower register pressure at call sites, producing slightly better code.

The decrement also requirers fewer instructions to implement, which is better for code size, and all else being equal should also imply a speed up in its own right as well.

Put all this together and it looks like allocation speeds up 3-19% depending on the work load! See the benchmark results below.

Note that there is a ~4% regression in realloc performance. This is because the new, decrementing-the-bump-pointer implementation cannot grow the last allocation in place by only updating the bump pointer. It has to do a copy since the beginning of the allocation moves, even when we get to reuse the original allocation's space. I think this is worth the trade off for the speed up to allocation, however.

Benchmark Results
alloc/small             time:   [26.129 us 26.168 us 26.208 us]
                        thrpt:  [381.56 Melem/s 382.15 Melem/s 382.71 Melem/s]
                 change:
                        time:   [-9.2069% -8.7900% -8.3936%] (p = 0.00 < 0.05)
                        thrpt:  [+9.1627% +9.6372% +10.141%]
                        Performance has improved.
Found 123 outliers among 1000 measurements (12.30%)
  51 (5.10%) high mild
  72 (7.20%) high severe

alloc/big               time:   [348.03 us 348.21 us 348.41 us]
                        thrpt:  [28.702 Melem/s 28.718 Melem/s 28.733 Melem/s]
                 change:
                        time:   [-3.1144% -3.0057% -2.8915%] (p = 0.00 < 0.05)
                        thrpt:  [+2.9776% +3.0989% +3.2145%]
                        Performance has improved.
Found 150 outliers among 1000 measurements (15.00%)
  58 (5.80%) low mild
  46 (4.60%) high mild
  46 (4.60%) high severe

alloc-with/small        time:   [26.446 us 26.477 us 26.508 us]
                        thrpt:  [377.25 Melem/s 377.69 Melem/s 378.12 Melem/s]
                 change:
                        time:   [-16.499% -16.191% -15.898%] (p = 0.00 < 0.05)
                        thrpt:  [+18.904% +19.318% +19.759%]
                        Performance has improved.
Found 57 outliers among 1000 measurements (5.70%)
  43 (4.30%) high mild
  14 (1.40%) high severe

alloc-with/big          time:   [313.26 us 313.75 us 314.35 us]
                        thrpt:  [31.811 Melem/s 31.872 Melem/s 31.922 Melem/s]
                 change:
                        time:   [-6.5853% -6.2957% -6.0163%] (p = 0.00 < 0.05)
                        thrpt:  [+6.4014% +6.7187% +7.0495%]
                        Performance has improved.
Found 166 outliers among 1000 measurements (16.60%)
  70 (7.00%) low mild
  44 (4.40%) high mild
  52 (5.20%) high severe

format-realloc/format-realloc/10
                        time:   [84.850 ns 85.002 ns 85.162 ns]
                        thrpt:  [117.42 Melem/s 117.64 Melem/s 117.86 Melem/s]
                 change:
                        time:   [+4.8825% +5.4527% +6.2553%] (p = 0.00 < 0.05)
                        thrpt:  [-5.8870% -5.1707% -4.6552%]
                        Performance has regressed.
Found 299 outliers among 1000 measurements (29.90%)
  1 (0.10%) low severe
  78 (7.80%) low mild
  22 (2.20%) high mild
  198 (19.80%) high severe

format-realloc/format-realloc/80
                        time:   [85.144 ns 85.353 ns 85.571 ns]
                        thrpt:  [934.89 Melem/s 937.29 Melem/s 939.58 Melem/s]
                 change:
                        time:   [+4.6040% +5.5085% +6.1615%] (p = 0.00 < 0.05)
                        thrpt:  [-5.8039% -5.2209% -4.4014%]
                        Performance has regressed.
Found 168 outliers among 1000 measurements (16.80%)
  40 (4.00%) high mild
  128 (12.80%) high severe

format-realloc/format-realloc/270
                        time:   [84.940 ns 85.080 ns 85.225 ns]
                        thrpt:  [3.1681 Gelem/s 3.1735 Gelem/s 3.1787 Gelem/s]
                 change:
                        time:   [+3.7967% +4.2268% +4.6452%] (p = 0.00 < 0.05)
                        thrpt:  [-4.4390% -4.0554% -3.6579%]
                        Performance has regressed.
Found 229 outliers among 1000 measurements (22.90%)
  8 (0.80%) low severe
  2 (0.20%) low mild
  11 (1.10%) high mild
  208 (20.80%) high severe

format-realloc/format-realloc/640
                        time:   [85.917 ns 86.199 ns 86.497 ns]
                        thrpt:  [7.3991 Gelem/s 7.4247 Gelem/s 7.4490 Gelem/s]
                 change:
                        time:   [+2.2676% +3.1780% +3.8626%] (p = 0.00 < 0.05)
                        thrpt:  [-3.7190% -3.0801% -2.2173%]
                        Performance has regressed.
Found 169 outliers among 1000 measurements (16.90%)
  62 (6.20%) high mild
  107 (10.70%) high severe

This changes `bumpalo`'s implementation from

* initializing the bump pointer at the start of the chunk, and
* incrementing the bump pointer to allocate an object

to

* initializing the bump pointer at the end of the chunk, and
* decrementing the bump pointer to allocate an object

This means that we are now rounding down to align the pointer, which is just
masking the bottom bits. Rounding up, what we used to have to do, required an
addition which could overflow, which meant that we had an extra conditional
branch in the generated code.

Furthermore, once the bump pointer is decremented, it is now pointing directly
at the allocated space. Previously, we had to save a copy of the original
pointer in a temporary, update the bump pointer, and then return the
temporary. That requires the use of an extra register, so the new approach
should help lower register pressure at call sites, producing slightly better
code.

The decrement also requirers fewer instructions to implement, which is better
for code size, and all else being equal should also imply a speed up in its own
right as well.

Put all this together and it looks like allocation speeds up 3-19% depending on
the work load! See the benchmark results below.

Note that there is a ~4% regression in `realloc` performance. This is because
the new, decrementing-the-bump-pointer implementation cannot grow the last
allocation in place by only updating the bump pointer. It has to do a copy since
the beginning of the allocation moves, even when we get to reuse the original
allocation's space. I think this is worth the trade off for the speed up to
allocation, however.

--------------------------------------------------------------------------------

alloc/small             time:   [26.129 us 26.168 us 26.208 us]
                        thrpt:  [381.56 Melem/s 382.15 Melem/s 382.71 Melem/s]
                 change:
                        time:   [-9.2069% -8.7900% -8.3936%] (p = 0.00 < 0.05)
                        thrpt:  [+9.1627% +9.6372% +10.141%]
                        Performance has improved.
Found 123 outliers among 1000 measurements (12.30%)
  51 (5.10%) high mild
  72 (7.20%) high severe

alloc/big               time:   [348.03 us 348.21 us 348.41 us]
                        thrpt:  [28.702 Melem/s 28.718 Melem/s 28.733 Melem/s]
                 change:
                        time:   [-3.1144% -3.0057% -2.8915%] (p = 0.00 < 0.05)
                        thrpt:  [+2.9776% +3.0989% +3.2145%]
                        Performance has improved.
Found 150 outliers among 1000 measurements (15.00%)
  58 (5.80%) low mild
  46 (4.60%) high mild
  46 (4.60%) high severe

alloc-with/small        time:   [26.446 us 26.477 us 26.508 us]
                        thrpt:  [377.25 Melem/s 377.69 Melem/s 378.12 Melem/s]
                 change:
                        time:   [-16.499% -16.191% -15.898%] (p = 0.00 < 0.05)
                        thrpt:  [+18.904% +19.318% +19.759%]
                        Performance has improved.
Found 57 outliers among 1000 measurements (5.70%)
  43 (4.30%) high mild
  14 (1.40%) high severe

alloc-with/big          time:   [313.26 us 313.75 us 314.35 us]
                        thrpt:  [31.811 Melem/s 31.872 Melem/s 31.922 Melem/s]
                 change:
                        time:   [-6.5853% -6.2957% -6.0163%] (p = 0.00 < 0.05)
                        thrpt:  [+6.4014% +6.7187% +7.0495%]
                        Performance has improved.
Found 166 outliers among 1000 measurements (16.60%)
  70 (7.00%) low mild
  44 (4.40%) high mild
  52 (5.20%) high severe

format-realloc/format-realloc/10
                        time:   [84.850 ns 85.002 ns 85.162 ns]
                        thrpt:  [117.42 Melem/s 117.64 Melem/s 117.86 Melem/s]
                 change:
                        time:   [+4.8825% +5.4527% +6.2553%] (p = 0.00 < 0.05)
                        thrpt:  [-5.8870% -5.1707% -4.6552%]
                        Performance has regressed.
Found 299 outliers among 1000 measurements (29.90%)
  1 (0.10%) low severe
  78 (7.80%) low mild
  22 (2.20%) high mild
  198 (19.80%) high severe

format-realloc/format-realloc/80
                        time:   [85.144 ns 85.353 ns 85.571 ns]
                        thrpt:  [934.89 Melem/s 937.29 Melem/s 939.58 Melem/s]
                 change:
                        time:   [+4.6040% +5.5085% +6.1615%] (p = 0.00 < 0.05)
                        thrpt:  [-5.8039% -5.2209% -4.4014%]
                        Performance has regressed.
Found 168 outliers among 1000 measurements (16.80%)
  40 (4.00%) high mild
  128 (12.80%) high severe

format-realloc/format-realloc/270
                        time:   [84.940 ns 85.080 ns 85.225 ns]
                        thrpt:  [3.1681 Gelem/s 3.1735 Gelem/s 3.1787 Gelem/s]
                 change:
                        time:   [+3.7967% +4.2268% +4.6452%] (p = 0.00 < 0.05)
                        thrpt:  [-4.4390% -4.0554% -3.6579%]
                        Performance has regressed.
Found 229 outliers among 1000 measurements (22.90%)
  8 (0.80%) low severe
  2 (0.20%) low mild
  11 (1.10%) high mild
  208 (20.80%) high severe

format-realloc/format-realloc/640
                        time:   [85.917 ns 86.199 ns 86.497 ns]
                        thrpt:  [7.3991 Gelem/s 7.4247 Gelem/s 7.4490 Gelem/s]
                 change:
                        time:   [+2.2676% +3.1780% +3.8626%] (p = 0.00 < 0.05)
                        thrpt:  [-3.7190% -3.0801% -2.2173%]
                        Performance has regressed.
Found 169 outliers among 1000 measurements (16.90%)
  62 (6.20%) high mild
  107 (10.70%) high severe
@fitzgen
Copy link
Owner Author

fitzgen commented Nov 1, 2019

cc @TethysSvensson

@fitzgen fitzgen merged commit 38054c7 into master Nov 1, 2019
@fitzgen fitzgen deleted the bump-downwards branch November 1, 2019 21:17
@TethysSvensson
Copy link
Contributor

@fitzgen Wow, those are some nice numbers! It might also mean that I can use this for flatbuffers!

@fitzgen
Copy link
Owner Author

fitzgen commented Nov 1, 2019

Great! :)

let new_ptr = footer.ptr.get();
// NB: we know it is non-overlapping because of the size check
// in the `if` condition.
ptr::copy_nonoverlapping(ptr.as_ptr(), new_ptr.as_ptr(), new_size);
Copy link
Contributor

@TethysSvensson TethysSvensson Nov 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested how much we lose by using ptr::copy instead and now having the new_size <= old_size / 2 check?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. If we are shrinking but not by a lot, we can just use the same pointer. We could reclaim it, but it is a lot of trouble for very few bytes saved. I agree with this implementation! 👍

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, and since we already are doing this calculus, we might was well choose the threshold where we get to do a faster copy as well. That said, if you want to experiment with other implementations and benchmark them, I'm happy to accept results-driven PRs! :)

@Kamayuq
Copy link

Kamayuq commented Jan 15, 2022

So I have tried bumping downwards in a linear allocator and even tough it saves a couple of instructions it was always slower on Jaguar CPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants