Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizations for Armv8-A #50

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

volyrique
Copy link

This commit adapts the past SIMD optimizations to the Neon extension that is part of the Armv8-A architecture. The changes apply only to the AArch64 execution state (32-bit code would require more work due to the smaller general-purpose registers, so I didn't bother).

In order to gather some performance data, I compiled the included benchmark bench.c using:
gcc -Wall -Wextra -Ofast -flto -march=native -g -o bench bench.c picohttpparser.c
And then ran it with:
taskset -c 1 time -f "%e" ./bench

I ran the benchmark on Ubuntu 18.04 using 3 different cloud instances:

  • a1.large on Amazon Web Services - uses AWS Graviton processors that are apparently based on Arm Cortex-A72
  • c1.large.arm on Packet - uses Cavium ThunderX processors (note that it is the first version)
  • c2.large.arm also on Packet - uses Ampere eMAG processors

Here are the median results from 20 runs - all times are in seconds:

Instance Base time New time Change
a1.large 7.62 4.63 -39.24%
c1.large.arm 16.23 13.36 -17.68%
c2.large.arm 6.29 6.88 +9.38%

Standard errors are 0.1% or less in all cases.

I don't have a good explanation for the regression on Ampere eMAG right now, but I noticed that compiling with Clang produced slightly better times (though still slower than the baseline), so a probable partial explanation is that the software support for the microarchitecture (which is the most recent one) can be improved (or it will be a while until the enhancements make their way into the OS images that can actually be deployed). Unfortunately, I couldn't find any optimization guide for the processor, and the support for the hardware performance counters seemed flaky, so it was a bit difficult to do a deeper analysis.

It should also be possible to optimize the parse_headers() function using the TBL instruction, but that would require transforming token_char_map into a bit array (or something similar), so that it fits into at most 4 vector registers.

I also have an initial implementation (not tested much and certainly not benchmarked) using the Scalable Vector Extension (SVE) in a branch in my fork of the repository.

@kazuho
Copy link
Member

kazuho commented May 17, 2019

Thank you for the PR.

The results are interesting. I'll check the numbers on A52 that I happen to have.

@volyrique
Copy link
Author

volyrique commented Jun 2, 2019

I figured out that header values had a higher chance of being large, so I decided to unroll the vector loop in get_token_to_eol() a bit. Here are the new results:

Instance Base time New time Change
a1.large 7.62 5.13 -32.68%
m6g.large 5.19 2.97 -42.77%
c1.large.arm 16.23 12.54 -22.74%
c2.large.arm 6.29 6.20 -1.43%

The result on the Amazon instance is a bit worse (but still significantly faster than the scalar version), while all the other values have improved; in particular, there is no longer a performance regression on the Ampere eMAG. Standard errors are 0.26% or less.

P.S. I added results for a m6g.large instance on Amazon Web Services, which uses an AWS Graviton2 processor (based on Arm Neoverse N1) and which has recently become available. It ran Ubuntu 18.04 as in the other cases.

@volyrique
Copy link
Author

volyrique commented Oct 31, 2019

Now that Travis CI supports testing in an Arm64 environment, I have enabled it for this project.

I think I also have a pretty good idea about why the performance on the Ampere eMAG is not that good. After some experiments, I have determined that the vector instruction throughput on that machine is 0.50 (instructions per cycle), while on Arm Cortex-A72 it is 1.49 (probably 1.5 - there is some measurement noise). Those values are for vector bitwise operations and comparisons, which are the main operations executed by my optimization. For comparison, the scalar addition throughput is 1.99 in both cases (again, probably 2.00). As a result, it is worth vectorizing on the Ampere machine mainly if there is a significant amount of data to process, so it is not surprising that the second version of my changes, which has raised the threshold for switching from scalar to vector code, behaves better.

As for the hardware performance counters being problematic on the Ampere eMAG - it turns out that there are no problems if the counters are specified explicitly on the perf command line using the IDs from the Ampere documentation.

@volyrique
Copy link
Author

Optimized implementations of the parse_token() function using the TBL instruction or SVE gather loads are available in this and that branch respectively, but neither of them are as convincing as the changes I have proposed here.

The regular SVE optimizations have been merged into this PR.

@enghitalo
Copy link

up

These changes apply only to the AArch64 execution state.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants