Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use locale independ way to convert string to double #51

Closed
wants to merge 4 commits into from

Conversation

Dushistov
Copy link
Contributor

@Dushistov Dushistov commented Sep 20, 2019

Related to couchbase/couchbase-lite-core#849 , should fix issue.
Plus this change should speedup json parsing, because of according to dropbox/json11#38 (comment) when I replace in dropbox json library call strtod with google/double-conversation I see speedup in benchmark from 51ms to 40ms.

Though there is isdigit in vendor/jsonsl/jsonsl.c . isdigit locale dependent function also.
But this is related to other repository. I should note that I expect that replacing isidigit should also speedup things, because of most compilers can not inline isdigit, so check if (ch >= '0' && ch <= '9') is faster.

@borrrden
Copy link
Member

I hesitate to pull in an entire new dependency just for this one line. If JSON requires the dot to be used, then stringstreams could be used with an imbued locale in order to ignore the global one. What do you think of that approach?

@Dushistov
Copy link
Contributor Author

Dushistov commented Sep 20, 2019

@borrrden

then stringstreams
What do you think of that approach?

I heard that it 3-4 times slower in compare with strtod.
And looks like this parsing code is used a lot via c4db_encodeJSON,
which as I understand may be invoked every time of document saving.
So this why I choose one of the most faster string to double library around.

@Dushistov
Copy link
Contributor Author

Dushistov commented Sep 20, 2019

And I find out that this is not final fix,
I got "17,42342" in json after document has been retrieved from database.
I suspect that this is because of:

            case kFloatTag: {
                if (_byte[0] & 0x8)
                    sprintf(str, "%.16g", asDouble());
                else
                    sprintf(str, "%.6g", asFloat());
                break;
            }

yeah, sprintf depend on locale also.

This prevents invalid JSON output if locale define different then "."
for float/double conversation
setlocale used in similart way as Qt/QApplication does
@Dushistov
Copy link
Contributor Author

Dushistov commented Sep 20, 2019

@borrrden

I fix also sprintf case.

Not that after reading code I can say that I did not only fix locale issue,
but potentially also fix information loss.

The code before this patch used %.6g and %.16g, that works for most cases,
but not for all.
For example:

char buf[32];
float origin_f = 1.0f / 3.0f;
sprintf(s, "%.6g", x);
float new_f = strtod(buf, nullptr);

after this new_f != origin_f, so during save to database, load from database you can
loose bits of information in your number, for this particular case you need %.8g.
Google's ToShortest function take care about this case. So after that patch store/load from database should not cause information lost.

@borrrden
Copy link
Member

borrrden commented Sep 21, 2019

c4db_encodeJson is actually not used anywhere in the library, so no need to optimize it (it is a convenience function that Couchbase Lite does not need, but may be useful to others and we use it in our tests too). I have been doing some experimentation and I found a locale specific sprintf and strtod and even used it before I realized that it's only available on certain platforms (BSD and Windows....). I'm leaning toward an implementation that can somehow push and pop the locale setting though, rather than a full on 3rd party library. The loss of information from float to string is an acceptable one and our advice has been "if you need that level of precision then use strings instead of numbers."

@borrrden
Copy link
Member

Let me know what you think of #53

@Dushistov
Copy link
Contributor Author

c4db_encodeJson is actually not used anywhere in the library,

I find out about this function from reading something like native_c4database.cc from couchbase-lite-java , and I used it a lot in in my Rust wrapper around couchbase-lite-core. Though have no idea that this is not frequently used function, but I don't see alternative for saving JSON directly.

The loss of information from float to string is an acceptable one and our advice has been "if you need > that level of precision then use strings instead of numbers."

But why? According to google developers 17 characters enough to encode any number (see kBase10MaximalLength constant in double-conversation code).
Why loose any information, if it is possible to not loose anything?

And it is really inconvenient, you see JSON encoding/decoding function is generated from language types, obviously I can mark any field with float/double to use special function that encode/decode them to stirng, but it is a lot of manual work, and what if I forget to mark field as such?
The more robust solution will be patch fleece every time and do not care about such issue in the main application.

Plus I want to share this data with somebody who use other language (frontend), it will be inconvenient to document every field like - it is really number, but I save it as string,
it is really, really strange.

@Dushistov
Copy link
Contributor Author

@borrrden

Let me know what you think of #53

Access to thread-local variables in shared-library is not cheap.
And you call it every sprintf/strtod call (I mean uselocale). It may potentially slow down all program.

@borrrden
Copy link
Member

Well there is no alternative for saving JSON directly because that's not what we do in Couchbase Lite. We turn domain objects (C# classes, etc) into Fleece by using FLEncoder. Thus we now have hundreds of lines of 3rd party code added for something that doesn't help the Couchbase Lite distributed product. We based the 6 and 16 values off of information about how many accurate bytes there are in each type (their precision).

I don't remember hearing about the Rust wrapper before, that's pretty neat!

@snej What are your thoughts on this?

@snej
Copy link
Contributor

snej commented Sep 23, 2019

there is no alternative for saving JSON directly because that's not what we do in Couchbase Lite.

The replicator encodes documents into JSON when pushing. So JSON-generation issues are definitely important to Couchbase Lite.

@snej
Copy link
Contributor

snej commented Sep 23, 2019

[multiply edited]

What about using strtod_l and sprintf_l, which explicitly take a locale_t parameter?

The (macOS) man page for newlocale says:

In order to create a C locale_t value, use newlocale(LC_ALL_MASK, NULL, NULL).

We can just put that in a static variable for use in Fleece.

@snej
Copy link
Contributor

snej commented Sep 23, 2019

Found this in a post on the lua mailing list:

  • strtod_l is available on linux (at least both glibc and musl) if _GNU_SOURCE is defined
  • _strtod_l is available in MSVC since VS2005
  • strtod_l is available on OSX since Darwin 8.
  • strtod_l is available on FreeBSD since 9.1

I'm working on a fix using that.

@snej
Copy link
Contributor

snej commented Sep 23, 2019

after this new_f != origin_f, so during save to database, load from database you can
loose bits of information in your number

Comparing floating point numbers for equality before and after some transformation is always a problem, because of roundoff. I've had to explain to testers many times that they can't write assertions that compare floats for equality.

This is especially bad with 32-bit floats. I don't think anyone who cares about accuracy is going to choose those over 64-bit, when you get fewer than 7 significant [decimal] digits.

I came up with the %.6g and %.16g formats after some experimentation, to get the best accuracy without getting a lot of badly-rounded strings for short decimal values. For example, 0.9f formatted to 8 figures is 0.89999998. We got several bug reports about this before I tweaked the formatting! I just did some tests, formatting all multiples of 0.001f from 0 to 1, and 7 figures seems to work well; I'm not sure why I went down to 6 originally, but I must have found some cases that formatted badly.

@Dushistov
Copy link
Contributor Author

Dushistov commented Sep 23, 2019

@snej

Comparing floating point numbers for equality before and after some transformation is always a problem,

In this particular case I suppose that comparing is the only and right way to test equality.

You see one thing that you calculate something, obviously float and double give you some round error. Another thing is serialization / deserialization.

For example you have JSON document with two properties A and B,
A has float point number type.
You change B, save and reload JSON document from database.
Repeat it several times. What if after several such transformations the bits representation of float/double was changed in the way that error would be accumulated?

If you store and load the same bits representation you can be sure that you have no problem at all. I am not sure that after several cycles for save/load .6g and .16g the error will be not accumulated for some specific numbers.

@borrrden
Copy link
Member

Closing in favor of #56 (Thanks for getting us to address the issue though!)

@borrrden borrrden closed this Sep 25, 2019
@snej
Copy link
Contributor

snej commented Sep 25, 2019

Though there is isdigit in vendor/jsonsl/jsonsl.c . isdigit locale dependent function also.

Actually it isn't; I just checked the C reference docs: "isdigit and isxdigit are the only standard narrow character classification functions that are not affected by the currently installed C locale."

Although you may be referring to this: "...although some implementations (e.g. Microsoft in 1252 codepage) may classify additional single-byte characters as digits."

I checked <ctype.h> in the macOS/iOS SDKs, and the implementation isn't locale-dependent but it looks less efficient than it could be — it's using table lookup.

@snej
Copy link
Contributor

snej commented Sep 26, 2019

I expect that replacing isdigit [in jsonsl] should also speedup things

I tried this yesterday [on macOS] and the speedup was very small. I guess Clang's (Apple's?) isdigit is inlined.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants