Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Parse numbers as strings #872

Open
cryptochassis opened this issue Mar 20, 2023 · 16 comments
Open

Feature Request: Parse numbers as strings #872

cryptochassis opened this issue Mar 20, 2023 · 16 comments
Assignees
Labels

Comments

@cryptochassis
Copy link

When working with various counterparties dealing with monetary systems, we found that, quite often than not, we'd recieve json strings like [1.2345] instead of ["1.2345"]. If we parse that as a double, then we might loose precisions in some cases. In order to preserve precision, we have to parse that number as a string. rapidjson offers a solution by providing kParseNumbersAsStringsFlag: https://rapidjson.org/namespacerapidjson.html#a81379eb4e94a0386d71d15fda882ebc9a13981c0b803803f59d7a01aef3dfc987. Interesting enough, Python standard json library also offers the capability to parse numbers as strings:
https://docs.python.org/3/library/json.html#json.load (see parse_float and parse_int parameters).
We are looking into migrating to boost json library. Parsing numbers as strings is a key thing for us to preserve monetary precision. Thank you.

@vinniefalco
Copy link
Member

possible in theory, if we add it to the parse options. they will come in as strings. However, we can consider adding a flag to json::string somewhere (if we can find a spare bit) which indicates that the string contains a valid number. This should not affect performance if the option is not set.

@grisumbras
Copy link
Member

We technically already support this for parsing. Just use basic_parser with a custom handler. The caveat is that this is way more complicated than it should have been. We could make detail::handler public, and document how to override its functions to achieve custom handling of only a subset of parsing events.

The more complicated part of the eqation is serialisation. We don't have a customisable serialiser. On the other hand, custom serialisation is very easy to implement with iostreams.

So, no special bit for "this is actually a number" is required. BTW, I am sceptical that such change would not affect performance, even if only in a minor way.

@cryptochassis do you only need this special handling for parsing? Is using basic_parser with a custom handler enough for you?

@grisumbras
Copy link
Member

Here's an example of what I meant: https://godbolt.org/z/KE7YK7h97

@cryptochassis
Copy link
Author

@grisumbras Very sorry for the late reply. I completely missed your previous messages. Yes, we only need this special handling for parsing. Using basic_parser with a custom handler seems to be sufficient. Thanks a lot for providing a concrete example. One question: for the example, when the parser encounters a number, say, a double, will it still call std::stod behind the scene? Because we are a high-frequency-trading code provider, performance is of utmost importance to us. Without calling std::stod, I'd guess it'd save lots of CPU time.

@grisumbras
Copy link
Member

The number will still be parsed. But our parser doesn't call std::strod (or any other standard number parsing utility for that matter). We use custom number parsing functions, so maybe it will be fast enough for you.

Also, this made me think we might want a parser option that disables number parsing outright.

@cryptochassis
Copy link
Author

We parse about millions of json messages per second and therefore skipping string to number conversion would probably have visible impact on our system's performance. We'd appreciate if there could be provided a parser option that disables number parsing. Many thanks!

@vinniefalco
Copy link
Member

if you want the highest performance why don't you use simdjson? Do you need the ability to modify the JSON values?

@cryptochassis
Copy link
Author

We don't need the ability to modify the JSON values. At the time that we first started our library development in 2019 and published its first version, simdjson wasn't available. Based on the best judgement at that time, we picked rapidjson. We ourselves is a library rather than an end-user application. The reason that we are now aiming at migrating to boost json instead of simdjson is because a sizable part of our current users (or those who are thinking about using our library) comes from a Python background and therefore are intermediate to beginner levels in C++. They need a simple way of getting started to build their applications using our library. The simplest way is to only rely on the header-only components of boost but nothing else. And we are getting closer to that: currently we only depend on boost, websocketpp, and rapidjson. We are almost there of moving away from websocketpp by using your beast websocket. So now the only thing to trim is rapidjson after which our only dependency are the header-only components of boost. To sum up, the reason is to achieve a good balance between performance and usability aiming at a wide array of audience having vastly different C++ proficiencies.

@vinniefalco
Copy link
Member

Wow... that rationale is actually rather perfect :)

@cryptochassis
Copy link
Author

The number will still be parsed. But our parser doesn't call std::strod (or any other standard number parsing utility for that matter). We use custom number parsing functions, so maybe it will be fast enough for you.

Also, this made me think we might want a parser option that disables number parsing outright.

Let me know whether we can have such a parser option. Thanks a lot.

@grisumbras
Copy link
Member

An option to disable number parsing outright? I have a PR for that (#901). IIRC, my benchmarking shows that for number-heavy inputs the speed of parsing increases by 80% (but don't quote me on that). @vinniefalco should I pursue it?

@grisumbras
Copy link
Member

To be clear, it still sort of does number validation (we need it to know when the number ends and the parser should start parsing another value), it just doesn't convert the characters into a number.

@vinniefalco
Copy link
Member

its an interesting mode

@cryptochassis
Copy link
Author

An option to disable number parsing outright? I have a PR for that (#901). IIRC, my benchmarking shows that for number-heavy inputs the speed of parsing increases by 80% (but don't quote me on that). @vinniefalco should I pursue it?

Perfect. Looking forward to the finalization. Thanks a lot.

@grisumbras
Copy link
Member

#901 has been merged into develop

@grisumbras
Copy link
Member

Local benchmarking results:

                        imprecise   | precise    | none	
Parse gcc   apache_builds.json  754 | 753  -0,13%| 753  -0,13%
Parse gcc   canada.json         587 | 400 -31,86%|1064  81,26%
Parse gcc   citm_catalog.json   1231|1232   0,08%|1344   9,18%
Parse gcc   github_events.json  837 | 845   0,96%| 850   1,55%
Parse gcc   gsoc-2018.json      975 | 977   0,21%| 974  -0,10%
Parse gcc   instruments.json    630 | 640   1,59%| 659   4,60%
Parse gcc   marine_ik.json      531 | 404 -23,92%| 654  23,16%
Parse gcc   mesh.json           532 | 402 -24,44%| 690  29,70%
Parse gcc   mesh.pretty.json    996 | 758 -23,90%|1370  37,55%
Parse gcc   numbers.json        818 | 494 -39,61%|1814 121,76%
Parse gcc   random.json	gcc     383 | 384   0,26%| 385   0,52%
Parse gcc   twitter.json        521 | 524   0,58%| 530   1,73%
Parse gcc   twitterescaped.json 478 | 474  -0,84%| 488   2,09%
Parse gcc   update-center.json  660 | 664   0,61%| 663   0,45%
Parse clang apache_builds.json   757| 750  -0,92%| 751  -0,79%
Parse clang canada.json          613| 378 -38,34%| 905  47,63%
Parse clang citm_catalog.json   1225|1196  -2,37%|1234   0,73%
Parse clang github_events.json   800| 793  -0,88%| 807   0,88%
Parse clang gsoc-2018.json       721| 721   0,00%| 717  -0,55%
Parse clang instruments.json     674| 653  -3,12%| 664  -1,48%
Parse clang marine_ik.json       532| 400 -24,81%| 607  14,10%
Parse clang mesh.json            557| 418 -24,96%| 708  27,11%
Parse clang mesh.pretty.json    1086| 771 -29,01%|1373  26,43%
Parse clang numbers.json         854| 524 -38,64%|1742 103,98%
Parse clang random.json          377| 371  -1,59%| 372  -1,33%
Parse clang twitter.json         556| 558   0,36%| 557   0,18%
Parse clang twitterescaped.json  463| 470   1,51%| 468   1,08%
Parse clang update-center.json   594| 597   0,51%| 594   0,00%

canada.json is +81% on GCC and +48% on clang, numbers.json is +122% on GCC and +104% on clang.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants