Partial parsing #88

canatella · 2016-12-29T20:09:42Z

Here is an attempt at providing partial parsing support. It's orthogonal to the event parsing branch I proposed before so those changes are not included here.

There are still problems for me with this branch:

7ed92d1: I'm coming from a C programming background, so I'm still not used to the C++ way of doing things. I'd like to modify the stacked value in place but for some reason it's not working so I pop them and push them back again, I suppose there is a better way.
Testing: I did some test locally but I did not modify the test.cpp, on purpose. It's a bit of a mess and I wanted to know how and what you'd want me to add as tests.

I also suppose you'll have other remarks and they'd be welcomed.

smarx · 2016-12-29T20:09:44Z

Automated message from Dropbox CLA bot

@canatella, thanks for the pull request! It looks like you haven't yet signed the Dropbox CLA. Please sign it here.

artwyman

There's a lot of code here without much context. I could really use a high-level summary of the design, and intended usage before diving in. Preferably that should include some documentation in the header file, as well as some examples in test.cpp. What does partial parsing do? What's the advantage vs. buffering up the string externally to this library and parsing it when it's complete?

Don't worry about keeping test.cpp highly structured. It's kinda a mess, but intended as a set of examples plus a sanity-check that they all compile and run without crashing. It's not really a thorough unit test suite.

I'm not sure where in the code is the point where you wanted to modify the stack, but you should be able to change the top element of a stack like this:
stack.top() = <new_value>;

canatella · 2016-12-30T09:05:46Z

This PR is to go along with PR #83.

We develop mobile applications. We use json11 and djinni to connect to REST apis. We have a lot of data to transfert at first application startup on possibly bad connection (mobile data), and we want to be able to display the data ASAP. This PR allows parsing the json body without having to buffer the entire HTTP response body. As soon as there is data, you can feed the parser with it by calling consume and the parser will continue where it left. Used in conjunction with PR PR #83, it will trigger event as soon as any JSON value is available. With these two PR, we could then also make the parser drop already parsed code to limit memory usage when parsing large json strings, which is part of my use case. They could also allow parsing an io directly without having to buffer the entire string in memory.

From an architectural point of view, we need to be able to restart parsing from where we left at any point. For that,

we need to keep the full parser state up to date and self contained
we need to make the parsing loop use the state information.

At the api level, this implies that we expose a parser object with the aforementioned state information that the user can manage by itself, and a function to feed data to the parser.

The first commit deals with this api change, it creates a public parser class which hold the current parser implementation as private member and has a consume method.

The next commits are mostly changes needed to store the full parsing state, besides commit eb177d0 which uses the introduced state information to provide the depth limit check.

Finally, the last commit is the one making the parsing loop explicit.

I'll add some test that feed the parser with one character at a time and some documentation and examples in the header file.

artwyman · 2017-01-04T02:38:38Z

Okay, I understand your intent, and how this fits together with your other PR. I'm a bit iffy on merging this feature on it's own since it feels somewhat half-baked without the other pieces you describe to make a useful system. I'd be okay with it as a half-way step to a final good state, though, barring any concerns about safety or performance I might have on deeper review.

I'll wait for your next update with docs and examples before doing the full review.

canatella · 2017-01-04T10:15:33Z

I can continue the patch series with the freeing of already parsed string and the event parsing and the doc and example for everything if you prefer.

artwyman · 2017-01-05T00:36:33Z

I could go either way. I see the value of bite-sized pieces for me to review, and to built/tested separately. But I also prefer the final public result is something worth using. Possible compromises might be to do multiple stages of review in the same PR (I'm not sure how well GItHub supports this flow) or to merge multiple PRs onto a public branch then merge into master when all the planned work is done. The branch might be the easiest option for us to collaborate, if you want to make this a multi-part project.

My overall philosophy on PR merges/reviews (and on the internal code-reviews I do at Dropbox): Strictly, each public merge should not be a step backwards. No merging code which is broken, messy, unperformant, etc. with promises to fix it later. Less strictly, I prefer if each public merge is a step forward, meaning no half-baked features which aren't really ready for use yet, but there can be exceptions to this. The point so ensure that users who pull for the first time don't get a broken/confusing intermediate state, and that the code isn't left in the bad state if an author gets busy and doesn't finish the next step.

For partial input parsing, we need to know when failure is caused by missing data so that we can retry parsing later when we have more data. This commit adds a need_data boolean flag alongs the failed one. It adds a stop function to set that flag and propagate the failure. It replaces checks for failure with a check on both failed and need_data flags. It also adds a eos function to check for end of input. It also replaces the fail calls due to end of input by stop calls.

As of now, the program stack is implicitly used to build the json object tree. For partial parsing support, we cannot use the program stack as parsing can be called at anytime. This commit adds a values stack and uses it to transmit value back along the object tree. All parsing functions now return void and instead pushes their result on the stack. The functions needing the result of other parsing functions (mostly for arrays and objects) read values from the stack and pop them.

As we now have an explicit stack, we don't need the depth variable argument to count the recursion levels. We can simply use the value stack size instead.

For partial support, when continuing parsing, we need to restart in the same state as we left. For that, we need to skip some actions if they were already done. Left as is, the code would be littered with ifs so refactor the parse_json function into multiple function, one per type of objects.

For partial parsing support, we need to remember what we were doing so that we can continue parsing. This commit adds an enum with all our parsing states and a states stack to store the current states.

For partial building, we need to temporary store the object being build as we maybe interrupted. This commits change object and array parsing to use the values stack has a temporary place for storing the object being build.

For parsing partial json, we need to be able to restart parsing from a valid position in the input stream. This commit stores the current position when switching state and restores it when more data is needed so that when parsing again, the parser restarts at the right position.

This commit implements a consume method that takes a chunk of json, append it to the current data and parses it.

canatella · 2017-01-06T15:07:25Z

I added some documentation in the README.md

I added two tests where I used the already existing test results and compare them to the result of parsing the stream one char at a time. I also feeded a big (+/- 300) corpus of various valid json object to it from https://github.com/jdorfman/awesome-json-datasets without problems.

I added some other fixes detected by testing parsing json with comments enabled.

This patch set can be applied on his own has it does not change anything for current users: api and behaviour are the same except for the chunked parsing addition. So you could either apply it to master where it would get a bit of exposure, or apply it to another branch. I should be able to provide fixes in cases of trouble in the next weeks but I think I won't have the time to update the event parsing PR and the freeing of already parsed data quickly as I have more urgent stuff now.

Speaking of which, what would be the best way to free up the already parsed strings, and at which point would it be best to execute ? I could see the use for a circular buffer here but there are no implementation in std, and it might add too much complexity. Other solution I see is using a list of string, that way we can easily drop parsed strings, or replace the string at some point with a substring, but that would entail a lot of memory operations I think.

artwyman

I went deeper on this review, but still haven't reviewed everything. Enough changes suggested for me to give you a chance to respond to the structural and stylistic ideas here before going further. I'll follow-up with some logistical discussion on the conversation thread too.

artwyman · 2017-01-13T23:53:53Z

json11.hpp

@@ -229,4 +229,26 @@ class JsonValue {
    virtual ~JsonValue() {}
 };

+struct JsonParserPriv;


This can be a private inner class inside JsonParser.

artwyman · 2017-01-13T23:54:32Z

json11.hpp

@@ -229,4 +229,26 @@ class JsonValue {
    virtual ~JsonValue() {}
 };

+struct JsonParserPriv;
+
+class JsonParser {


This class and its methods need some documentation in comments.

artwyman · 2017-01-13T23:54:56Z

json11.hpp

+    ~JsonParser();
+
+    Json json();
+    void consume(const std::string &in);


Why doesn't this return some sort of error status?

artwyman · 2017-01-13T23:55:57Z

json11.hpp

+    }
+
+private:
+    std::string error;


m_ prefix for members for consistency with the rest of this library.

artwyman · 2017-01-13T23:56:46Z

test.cpp

+
+    my_json = parser.json();
+    JSON11_TEST_ASSERT(parser.last_error().empty());
+    JSON11_TEST_ASSERT(my_json == json_comment);
 }


I'd like to see a negative test, which handles an error in the new parser.

artwyman · 2017-01-14T00:56:04Z

json11.cpp

        // Check for another object
        parser.consume_garbage();
-        if (!parser.failed)
+        if (!parser.failed && !parser.need_data) {


It might be clearer to reverse this test and use a break, rather than having to test these two clauses twice (once in the while, once here).

artwyman · 2017-01-14T00:57:50Z

json11.cpp

-    Json result = parser.parse_json(0);
+    JsonParserPriv parser { in, err, strategy };
+    assert(parser.states.size() == 1);
+    parser.eof = true;


It would be better to add more functions to the JsonParserPriv object and make the members private, rather than than mixing direct manipulation of members with calling methods.

artwyman · 2017-01-14T00:58:59Z

json11.cpp


-    return result;
+#ifndef NDEBUG


Here and elsewhere, I think you can just merge the if into the assert(s) so as to not to need the #ifdefs.

artwyman · 2017-01-14T01:00:23Z

json11.cpp

+    parser->eof = true;
+    parser->consume();
+    if (!parser->failed && !parser->need_data)
+        error.clear();


Why does "consume" not do this as part of updating the state.

artwyman · 2017-01-14T01:14:54Z

json11.cpp

+            case VALUE_NULL:
+                parse_null();
+                break;
+            case VALUE_COMMENT:


A comment isn't really a value, which makes this state name confusing. It's hard for me to tell from the code if you actually have limitations on where comments can appear (e.g. only where a JSON value would go) or if it's just the name which is odd.

artwyman · 2017-01-14T01:25:32Z

Some high-level feedback after the deeper (but still not complete) code-review. It seems like your long-term goal is a good one, but the usability of this current step on its own isn't clear yet. It also seems that you've already diverged quite far from the original json11 code (it's not a simple 1:1 translation which is obviously correct, for instance), and will have gone far further down that path by the time you're done. In particular you're moving away from the "tiny library" descriptor which the README starts with.

By the time you're done I feel the library you'll be left with will be a newly-written library inspired by json11, so I wonder what the value is of calling it json11 at that point, vs. forking off and publishing your own thing under another name. I don't feel like I'll give much value as gatekeeper on your efforts, given I/Dropbox isn't motivated to use them. I want to encourage you to think about just going your own way and building the advanced parsing engine you want without worrying about what fits in with our needs.

One possible more limited integration would be if you might want to provide a more advanced parser which produces json11 objects as output, rather than replacing the original parser outright. In that case your parser could be a separate add-on library of its own, or maybe eventually a separate optional class/file in the json11 repo. If there are aspects of the json11 data which hold you back from doing so (like a missing interface for building the inner value types) I would be open to PRs which make json11 more open to replaceable parsers. I.e. make small changes to json11 to enable you and others to build on top of it freely, rather than making large changes to json11 to do exactly what you want.

canatella · 2017-02-01T11:05:57Z

Thank you for the review. I have a bit more time now but before going further, I want to say that if you don't want to pursue on this, that's no problem for me, the code is there and it's working for my use case. I wanted to share back the code but if you feel it's a burden for you, I can just keep it in my repository. I just don't want to end up going through all the review process to finally have the PR rejected.

So to make things clear, I'm willing to go forward with the review process if it's going to be accepted and from your previous comment it seems that it's more for you to say that.

Oh, and I do understand your point of view of course, I agree it's a lot of changes you mostly don't need, so no problem for me if you say we stop here ;)

artwyman · 2017-02-02T00:10:35Z

I'd be interested to hear from the original author @j4cbo on his interest in such a submission. From my point of view, Dropbox doesn't have a need for this, so I'd have to take on the CR and validation of this as a side project, and if this is going beyond Dropbox territory into community/side-project territory I feel like @j4cbo has more ownership here than I do.

artwyman added the enhancement label Dec 30, 2016

artwyman suggested changes Dec 30, 2016

View reviewed changes

canatella added 6 commits January 6, 2017 15:21

Add partial parsing api.

166371b

Use value stack for checking maximum depth.

1771606

As we now have an explicit stack, we don't need the depth variable argument to count the recursion levels. We can simply use the value stack size instead.

Add explicit state management.

4e0e31c

For partial parsing support, we need to remember what we were doing so that we can continue parsing. This commit adds an enum with all our parsing states and a states stack to store the current states.

canatella force-pushed the partial-parsing branch from 8711fc1 to 4e4eeca Compare January 6, 2017 14:35

canatella added 6 commits January 6, 2017 15:50

Store builded object on the stack.

c9d9436

For partial building, we need to temporary store the object being build as we maybe interrupted. This commits change object and array parsing to use the values stack has a temporary place for storing the object being build.

Implement partial parsing support.

ab45c17

This commit implements a consume method that takes a chunk of json, append it to the current data and parses it.

Add reset method to be able to reuse parser.

e6b1df6

Add chunk parsing tests.

5ea3854

Add documentation for chunk parsing.

7d97cfc

canatella force-pushed the partial-parsing branch from 4e4eeca to 7d97cfc Compare January 6, 2017 14:51

artwyman suggested changes Jan 14, 2017

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partial parsing #88

Partial parsing #88

canatella commented Dec 29, 2016

smarx commented Dec 29, 2016

artwyman left a comment

canatella commented Dec 30, 2016

artwyman commented Jan 4, 2017

canatella commented Jan 4, 2017

artwyman commented Jan 5, 2017

canatella commented Jan 6, 2017

artwyman left a comment

artwyman Jan 13, 2017

artwyman Jan 13, 2017

artwyman Jan 13, 2017

artwyman Jan 13, 2017

artwyman Jan 13, 2017

artwyman Jan 14, 2017

artwyman Jan 14, 2017

artwyman Jan 14, 2017

artwyman Jan 14, 2017

artwyman Jan 14, 2017

artwyman commented Jan 14, 2017

canatella commented Feb 1, 2017 •

edited

Loading

artwyman commented Feb 2, 2017

Partial parsing #88

Are you sure you want to change the base?

Partial parsing #88

Conversation

canatella commented Dec 29, 2016

smarx commented Dec 29, 2016

artwyman left a comment

Choose a reason for hiding this comment

canatella commented Dec 30, 2016

artwyman commented Jan 4, 2017

canatella commented Jan 4, 2017

artwyman commented Jan 5, 2017

canatella commented Jan 6, 2017

artwyman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

artwyman commented Jan 14, 2017

canatella commented Feb 1, 2017 • edited Loading

artwyman commented Feb 2, 2017

canatella commented Feb 1, 2017 •

edited

Loading