-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize the concurrent performance of Cpp target by more than 10 times #4237
Conversation
Signed-off-by: wangtao9 <[email protected]>
and/or parser (default OFF) Signed-off-by: wangtao9 <[email protected]>
This optimization looks great. |
I have done a similar build config for go. It does make a difference. I didn’t get 10x from go, but there could be all sorts of reasons for that. Is your input example all you tried? Your mileage may vary on different input. I’ll try the same thing with go as well. It’s 22:45 where I live, so I’ll try tomorrow |
I've also tried simpler inputs such as "RETURN 1" and more complex examples of over 800 characters, both with significant improvements. But more importantly, this optimization can achieve a huge performance improvement from a mechanical point of view. C++'s runtime handles locks differently from JVM, and it will fall into kernel calls more frequently, which is one of the main reasons why the concurrency performance of c++ target is much slower than that of java. |
Understood. Go used atomically mostly, it is actually pretty good at this
stuff.
…On Wed, Apr 19, 2023 at 09:32 Tao Wang ***@***.***> wrote:
I have done a similar build config for go. It does make a difference. I
didn’t get 10x from go, but there could be all sorts of reasons for that.
Is your input example all you tried? Your mileage may vary on different
input. I’ll try the same thing with go as well.
It’s 22:45 where I live, so I’ll try tomorrow
I've also tried simpler inputs such as "RETURN 1" and more complex
examples of over 800 characters, both with significant improvements.
But more importantly, this optimization can achieve a huge performance
improvement from a mechanical point of view. C++'s runtime handles locks
differently from JVM, and it will fall into kernel calls more frequently,
which is one of the main reasons why the concurrency performance of c++
target is much slower than that of java.
—
Reply to this email directly, view it on GitHub
<#4237 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJ7TMESFU3TJQIQQEQOUDLXB46DZANCNFSM6AAAAAAXCGXR4M>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
@parrt Can this PR be merged? |
Perhaps @hzeller has an opinion here, but frankly I'm terrified of Multi threaded version of the parsing strategy... It is incredibly tricky to get right and there are many people that rely on the C++ runtime. |
OK I just looked at the code. Adding threads you are simply removing a lock I correct? I guess the question is how does it work without the lock in a multithreaded environment? |
I had a brief look - it changes one global variable with multiple Personally, I would anyway avoid doing something like static Foo *foo = nullptr;
call_once(initialize_foo); But more something like static Foo *foo = new Foo(); Then the memory model takes care of initializing that So I would create functions that create the static object and return the pointer, and then call it in such pattern: static StaticData *LexerStaticData = CreateLexerStaticData(); // or whatever that template expands to. I'd make that unconditionally, don't add a define I know @jcking was looking at multi-threaded performance, maybe he has come across this part of the code and maybe has some recollection if/why |
@jcking changed the once implementation to be either a local one or an absl one in this change, but it was not changing the call_once() need per-se. I suspect he did that because things are faster with With suggestions in my previous comment we can eliminate the pre-c++11 need to |
@parrt Not exactly. It does not simply remove a lock, but changes the static data shared by multi-threads into one copy for each thread. Since the data is owned by each thread, locks are no longer relied upon to keep the data safe (although locks still exist). |
@hzeller The idea of this optimization is to turn the static data shared by multi-threads into thread-owned, so as to avoid competition for locks.
|
Is the data structure modified in each thread ? If not, we don't need locks, and can make the data structure But if so, then it being static sounds like a bad idea. Thread local will fix that particular situation to not require locks then but it also means that there is something else going on and changing it to thread local wil change the semantics as now every thread sees a different content. |
@hzeller I also verified this with experiments, as shown in the figure below, the left is the log of building DFA states with a single thread, and the right side is 4 threads doing the same thing. It can be seen that except for the different thread ids, the constructed DFA states are exactly the same. |
So basically you're sacrificing reuse to avoid locks? I suspect this might be counter beneficial in terms if performance with complex grammars because each thread needs to rebuild the complete DFA instead of it being built just once? Is it possible that the locks that currently protect concurrent updates of the DFA are protecting too much code ? |
Signed-off-by: wangtao9 <[email protected]>
Signed-off-by: wangtao9 <[email protected]>
@ericvergnaud I think what you say makes sense, it could happen. |
So if the state is constructed once and then never modified, then the object can be |
I’m supportive of this evolution where the behavior is actionable via a macro when compiling the runtime rather than when generating the parser Envoyé de mon iPhoneLe 25 avr. 2023 à 20:39, Ivan Kochurkin ***@***.***> a écrit :
I thought about it again and now I'm not sure it's a good idea to include the option to generator. Why this runtime option can't be activated in runtime? It's more flexible solution since it doesn't require regeneration. At least, in C++ it's possible to use preprocessor directive.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about it once again and now I'm don't think introducing the new option is a good idea. Why this runtime option can't be activated in runtime? It's more flexible solution since it doesn't require regeneration. At least, in C++ it's possible to use preprocessor directives:
#if USE_THREAD_LOCAL_CACHE
static thread_local
#endif
<lexer.name; format = "cap">StaticData *<lexer.grammarName; format = "lower">LexerStaticData = nullptr;
In Go, I use build configurations that allow the runtime to be built in
single threaded mode. Because the runtime is built at the same time as the
generated parser etc then this works on a per project basis. C++ needs a
different mechanism I guess as the library is pre-built. So -D build
configuration and building different versions of the library seems
reasonable.
But in Go, my purpose is to elide locks altogether when the user knows that
each instance will be independent - the default build is with mutexes.
Things that are just statically initialized are done via a do.Once() and
are not locked after that. I would change the static to just declare in
line, but it needs to deserialize the lexer etc. and is small beer time
wise. The locks are used for prediction cache etc. They are needed if
multiple go routines are calling the same lexer/parser. I did not feel any
need to bother changing code gen as the do.Once() mechanism is
extremely fast anyway, and in single threaded mode it is basically just an
atomic read once it has happened once.
I have not found it overly useful to reuse the parser in multiple go
routines (equivalent to threads) as there tends to be a lot of other work
going on around it, which would mean I have to start mutexing all of that.
It is likely the same for other targets - is anyone really using multiple
threads on the same recognizer? If the grammar is well formed, then the
overhead of one instance of my toolchains per go routine is essentially
irrelevant. Poor grammars are basically just that and I am not going to do
much more work improving things for poor grammars when the answer is to fix
the grammar.
Go is a bit different though as channels make it easy to spark up N workers
and distribute the compile work in parallel. The parsing is generally
trivial compared to everything else that needs doing, which needs multiple
tree walks. So one go routine per translation unit and in single threaded
build configuration works well.
Locking contention with many go routines using one parser is measurable.
However, I did not achieve the same speed increase that this PR suggests
for the C++ runtime. Perhaps the locking in the C++ runtime needs some
examination?
…On Wed, Apr 26, 2023 at 2:39 AM Ivan Kochurkin ***@***.***> wrote:
I thought about it again and now I'm not sure it's a good idea to include
the option to generator. Why this runtime option can't be activated in
runtime? It's more flexible solution since it doesn't require regeneration.
At least, in C++ it's possible to use preprocessor directive.
—
Reply to this email directly, view it on GitHub
<#4237 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJ7TMBCD4WCDCWX6QEHL6TXDAK6TANCNFSM6AAAAAAXCGXR4M>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
@hzeller In C++, these data cannot be simply declared const, because we have no way to determine their values during the declaration phase, they are assigned during the parse process. But I think there is a way (with major changes) to completely remove the use of locks. That would be a big upgrade, and similar optimization effect could be achieved. |
@KvanTTT I also support this modification, changing the generate-time option to compile-time. updated: |
@jimidle What is the multithreading speedup you measured in Go? I suppose that the scalability should be much better than Cpp target (the speedup ratio of 32 threads compared to single thread is only 1.34) |
I have not done any formal testing of this as I prefer to think of it as
experimental for this release. For one parser I did not get a huge
improvement, but I know that when I use something like 32 routines (usually
there is not much point using more than the number of cores though) there
is an overall performance benefit because there is no lock contention. I
will do some 'reality' testing at some point down the line. The other thing
with so many threads is of course memory cache which takes exploring as to
whether the effects are large. They sometimes can be very large. Hence I
don't require the users to pick one or the other - they can change per
project. That's easier in Go than most languages.
…On Wed, Apr 26, 2023 at 1:29 PM Tao Wang ***@***.***> wrote:
@jimidle <https://github.com/jimidle> What is the multithreading speedup
you measured in Go?
I suppose that the scalability should be much better than Cpp target (the
speedup ratio of 32 threads compared to single thread is only 1.34)
—
Reply to this email directly, view it on GitHub
<#4237 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJ7TMBBNQGTF2JTB4YMFHDXDCXE7ANCNFSM6AAAAAAXCGXR4M>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Signed-off-by: wangtao9 <[email protected]>
Signed-off-by: wangtao9 <[email protected]>
@sharwell, do you refer to the fact that sometimes references are replaced (e.g. during optimization/merge runs in |
@wangtao9 are you measuring DFA warm up time here or are you measuring throughput after the parser is warmed up. Or both together? I would bet that after warm up the existing system gets higher throughput with multiple threads. |
Most of what you say is correct, but I think this key point is not quite right. The locks are not "only used while building up the DFA", after the built is done, the read lock is used for EVERY read, resulting in poor concurrent performance of the C++ target (the speedup ratio of 32 threads is less than 2). |
@parrt |
@hzeller Totally agree, exactly what I meant. |
Signed-off-by: wangtao9 <[email protected]>
Signed-off-by: wangtao9 <[email protected]>
@KvanTTT Done. |
@wangtao9 OK, thanks for taking care to create such a patch, after analysing the code thoroughly. I also believe that the C++ runtime could benefit very much from removing locks (including shared_ptr). So I'm fine with your PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just fix that little typo.
Signed-off-by: wangtao9 <[email protected]>
@mike-lischke Thanks for your review. Like you I also think that removing locks is worth doing, maybe it can be put on the agenda in the near future? Until this big action is done, C++ runtime users can get similar benefits from this optimization :D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it's OK for me but please also fix minor issues with cpp-target.md
.
Co-authored-by: Ivan Kochurkin <[email protected]> Signed-off-by: Tao Wang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot!
Glad to contribute to antlr4 project! :D |
Thanks, everyone, especially @wangtao9 ! |
Does it lead to "memory leak", who responsible to release the memory when thread destroyed? |
@taodongl I solved the problem with Now no memory leaks detected, you can verify it with this link: https://github.com/wangtao9/antlr4-perfopt-test/tree/sanitizer_check |
Usage:
add-lock-free-cpp-target
option when generating parser, e.g.java -jar ${ANTLR_JAR} -Dlanguage=Cpp -lock-free-cpp-target Cypher.g4add compile option
-DANTLR4_USE_THREAD_LOCAL_CACHE=1
when compiling the Cpp lexer & parser.Related issues:
#2454
#2584
#3938
Why the C++ target is 6X slower than the Java target
Optimization result:
Test configuration:
Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz ; Cores: 16 ; Logical processors: 32
256GB memory
grammar file: https://s3.amazonaws.com/artifacts.opencypher.org/M21/Cypher.g4
test query: