Skip to content
This repository has been archived by the owner on Mar 26, 2020. It is now read-only.

Partial parsing #88

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open

Conversation

canatella
Copy link

Here is an attempt at providing partial parsing support. It's orthogonal to the event parsing branch I proposed before so those changes are not included here.

There are still problems for me with this branch:

  • 7ed92d1: I'm coming from a C programming background, so I'm still not used to the C++ way of doing things. I'd like to modify the stacked value in place but for some reason it's not working so I pop them and push them back again, I suppose there is a better way.

  • Testing: I did some test locally but I did not modify the test.cpp, on purpose. It's a bit of a mess and I wanted to know how and what you'd want me to add as tests.

I also suppose you'll have other remarks and they'd be welcomed.

@smarx
Copy link

smarx commented Dec 29, 2016

Automated message from Dropbox CLA bot

@canatella, thanks for the pull request! It looks like you haven't yet signed the Dropbox CLA. Please sign it here.

Copy link
Contributor

@artwyman artwyman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of code here without much context. I could really use a high-level summary of the design, and intended usage before diving in. Preferably that should include some documentation in the header file, as well as some examples in test.cpp. What does partial parsing do? What's the advantage vs. buffering up the string externally to this library and parsing it when it's complete?

Don't worry about keeping test.cpp highly structured. It's kinda a mess, but intended as a set of examples plus a sanity-check that they all compile and run without crashing. It's not really a thorough unit test suite.

I'm not sure where in the code is the point where you wanted to modify the stack, but you should be able to change the top element of a stack like this:
stack.top() = <new_value>;

@canatella
Copy link
Author

This PR is to go along with PR #83.

We develop mobile applications. We use json11 and djinni to connect to REST apis. We have a lot of data to transfert at first application startup on possibly bad connection (mobile data), and we want to be able to display the data ASAP. This PR allows parsing the json body without having to buffer the entire HTTP response body. As soon as there is data, you can feed the parser with it by calling consume and the parser will continue where it left. Used in conjunction with PR PR #83, it will trigger event as soon as any JSON value is available. With these two PR, we could then also make the parser drop already parsed code to limit memory usage when parsing large json strings, which is part of my use case. They could also allow parsing an io directly without having to buffer the entire string in memory.

From an architectural point of view, we need to be able to restart parsing from where we left at any point. For that,

  • we need to keep the full parser state up to date and self contained
  • we need to make the parsing loop use the state information.

At the api level, this implies that we expose a parser object with the aforementioned state information that the user can manage by itself, and a function to feed data to the parser.

The first commit deals with this api change, it creates a public parser class which hold the current parser implementation as private member and has a consume method.

The next commits are mostly changes needed to store the full parsing state, besides commit eb177d0 which uses the introduced state information to provide the depth limit check.

Finally, the last commit is the one making the parsing loop explicit.

I'll add some test that feed the parser with one character at a time and some documentation and examples in the header file.

@artwyman
Copy link
Contributor

artwyman commented Jan 4, 2017

Okay, I understand your intent, and how this fits together with your other PR. I'm a bit iffy on merging this feature on it's own since it feels somewhat half-baked without the other pieces you describe to make a useful system. I'd be okay with it as a half-way step to a final good state, though, barring any concerns about safety or performance I might have on deeper review.

I'll wait for your next update with docs and examples before doing the full review.

@canatella
Copy link
Author

I can continue the patch series with the freeing of already parsed string and the event parsing and the doc and example for everything if you prefer.

@artwyman
Copy link
Contributor

artwyman commented Jan 5, 2017

I could go either way. I see the value of bite-sized pieces for me to review, and to built/tested separately. But I also prefer the final public result is something worth using. Possible compromises might be to do multiple stages of review in the same PR (I'm not sure how well GItHub supports this flow) or to merge multiple PRs onto a public branch then merge into master when all the planned work is done. The branch might be the easiest option for us to collaborate, if you want to make this a multi-part project.

My overall philosophy on PR merges/reviews (and on the internal code-reviews I do at Dropbox): Strictly, each public merge should not be a step backwards. No merging code which is broken, messy, unperformant, etc. with promises to fix it later. Less strictly, I prefer if each public merge is a step forward, meaning no half-baked features which aren't really ready for use yet, but there can be exceptions to this. The point so ensure that users who pull for the first time don't get a broken/confusing intermediate state, and that the code isn't left in the bad state if an author gets busy and doesn't finish the next step.

For partial input parsing, we need to know when failure is caused by missing
data so that we can retry parsing later when we have more data.

This commit adds a need_data boolean flag alongs the failed one. It adds a stop
function to set that flag and propagate the failure. It replaces checks for
failure with a check on both failed and need_data flags. It also adds a eos
function to check for end of input. It also replaces the fail calls due to
end of input by stop calls.
As of now, the program stack is implicitly used to build the json object
tree. For partial parsing support, we cannot use the program stack as parsing
can be called at anytime. This commit adds a values stack and uses it to
transmit value back along the object tree. All parsing functions now return void
and instead pushes their result on the stack. The functions needing the result
of other parsing functions (mostly for arrays and objects) read values from the
stack and pop them.
As we now have an explicit stack, we don't need the depth variable argument to
count the recursion levels. We can simply use the value stack size instead.
For partial support, when continuing parsing, we need to restart in the same
state as we left. For that, we need to skip some actions if they were already
done. Left as is, the code would be littered with ifs so refactor the parse_json
function into multiple function, one per type of objects.
For partial parsing support, we need to remember what we were doing so that we
can continue parsing. This commit adds an enum with all our parsing states and a
states stack to store the current states.
For partial building, we need to temporary store the object being build as we
maybe interrupted. This commits change object and array parsing to use the
values stack has a temporary place for storing the object being build.
For parsing partial json, we need to be able to restart parsing from a valid
position in the input stream. This commit stores the current position when
switching state and restores it when more data is needed so that when parsing
again, the parser restarts at the right position.
This commit implements a consume method that takes a chunk of json, append it to
the current data and parses it.
@canatella
Copy link
Author

I added some documentation in the README.md

I added two tests where I used the already existing test results and compare them to the result of parsing the stream one char at a time. I also feeded a big (+/- 300) corpus of various valid json object to it from https://github.com/jdorfman/awesome-json-datasets without problems.

I added some other fixes detected by testing parsing json with comments enabled.

This patch set can be applied on his own has it does not change anything for current users: api and behaviour are the same except for the chunked parsing addition. So you could either apply it to master where it would get a bit of exposure, or apply it to another branch. I should be able to provide fixes in cases of trouble in the next weeks but I think I won't have the time to update the event parsing PR and the freeing of already parsed data quickly as I have more urgent stuff now.

Speaking of which, what would be the best way to free up the already parsed strings, and at which point would it be best to execute ? I could see the use for a circular buffer here but there are no implementation in std, and it might add too much complexity. Other solution I see is using a list of string, that way we can easily drop parsed strings, or replace the string at some point with a substring, but that would entail a lot of memory operations I think.

Copy link
Contributor

@artwyman artwyman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went deeper on this review, but still haven't reviewed everything. Enough changes suggested for me to give you a chance to respond to the structural and stylistic ideas here before going further. I'll follow-up with some logistical discussion on the conversation thread too.

@@ -229,4 +229,26 @@ class JsonValue {
virtual ~JsonValue() {}
};

struct JsonParserPriv;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a private inner class inside JsonParser.

@@ -229,4 +229,26 @@ class JsonValue {
virtual ~JsonValue() {}
};

struct JsonParserPriv;

class JsonParser {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class and its methods need some documentation in comments.

~JsonParser();

Json json();
void consume(const std::string &in);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't this return some sort of error status?

}

private:
std::string error;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m_ prefix for members for consistency with the rest of this library.


my_json = parser.json();
JSON11_TEST_ASSERT(parser.last_error().empty());
JSON11_TEST_ASSERT(my_json == json_comment);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see a negative test, which handles an error in the new parser.

// Check for another object
parser.consume_garbage();
if (!parser.failed)
if (!parser.failed && !parser.need_data) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be clearer to reverse this test and use a break, rather than having to test these two clauses twice (once in the while, once here).

Json result = parser.parse_json(0);
JsonParserPriv parser { in, err, strategy };
assert(parser.states.size() == 1);
parser.eof = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to add more functions to the JsonParserPriv object and make the members private, rather than than mixing direct manipulation of members with calling methods.


return result;
#ifndef NDEBUG
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and elsewhere, I think you can just merge the if into the assert(s) so as to not to need the #ifdefs.

parser->eof = true;
parser->consume();
if (!parser->failed && !parser->need_data)
error.clear();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does "consume" not do this as part of updating the state.

case VALUE_NULL:
parse_null();
break;
case VALUE_COMMENT:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment isn't really a value, which makes this state name confusing. It's hard for me to tell from the code if you actually have limitations on where comments can appear (e.g. only where a JSON value would go) or if it's just the name which is odd.

@artwyman
Copy link
Contributor

Some high-level feedback after the deeper (but still not complete) code-review. It seems like your long-term goal is a good one, but the usability of this current step on its own isn't clear yet. It also seems that you've already diverged quite far from the original json11 code (it's not a simple 1:1 translation which is obviously correct, for instance), and will have gone far further down that path by the time you're done. In particular you're moving away from the "tiny library" descriptor which the README starts with.

By the time you're done I feel the library you'll be left with will be a newly-written library inspired by json11, so I wonder what the value is of calling it json11 at that point, vs. forking off and publishing your own thing under another name. I don't feel like I'll give much value as gatekeeper on your efforts, given I/Dropbox isn't motivated to use them. I want to encourage you to think about just going your own way and building the advanced parsing engine you want without worrying about what fits in with our needs.

One possible more limited integration would be if you might want to provide a more advanced parser which produces json11 objects as output, rather than replacing the original parser outright. In that case your parser could be a separate add-on library of its own, or maybe eventually a separate optional class/file in the json11 repo. If there are aspects of the json11 data which hold you back from doing so (like a missing interface for building the inner value types) I would be open to PRs which make json11 more open to replaceable parsers. I.e. make small changes to json11 to enable you and others to build on top of it freely, rather than making large changes to json11 to do exactly what you want.

@canatella
Copy link
Author

canatella commented Feb 1, 2017

Thank you for the review. I have a bit more time now but before going further, I want to say that if you don't want to pursue on this, that's no problem for me, the code is there and it's working for my use case. I wanted to share back the code but if you feel it's a burden for you, I can just keep it in my repository. I just don't want to end up going through all the review process to finally have the PR rejected.

So to make things clear, I'm willing to go forward with the review process if it's going to be accepted and from your previous comment it seems that it's more for you to say that.

Oh, and I do understand your point of view of course, I agree it's a lot of changes you mostly don't need, so no problem for me if you say we stop here ;)

@artwyman
Copy link
Contributor

artwyman commented Feb 2, 2017

I'd be interested to hear from the original author @j4cbo on his interest in such a submission. From my point of view, Dropbox doesn't have a need for this, so I'd have to take on the CR and validation of this as a side project, and if this is going beyond Dropbox territory into community/side-project territory I feel like @j4cbo has more ownership here than I do.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants