Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to goyacc-generated parser from participle library #37

Merged
merged 70 commits into from
Jul 1, 2020

Conversation

itchyny
Copy link
Owner

@itchyny itchyny commented Jun 28, 2020

I'm working on rewriting the query parser. I have been using participle library but I will switch to goyacc-based parser.

Motivation

Using the participle library makes the struct tree too deep. We have to create structs for each precedence priorities. In the current gojq implementation, I embed the builtin functions to skip parsing them every command execution. Deep structure increases the executable size and initial allocation time. By switching to goyacc-based parser, we can reduce the nested depth for struct of the binary operators. Reducing the executable size would be nice for gojq library users too.

participle-based parser (29b689a) goyacc-based parser
executable size 6762616 byte 5129736 byte (-24.14%)
execution time for gojq -n . 10.6 ms ±0.9 ms 6.1 ms ±1.1 ms (1.75 times faster)

The participle library makes use of reflection. The author seems to take much care about the performance, but it can't be faster than goyacc-based parser. I want to make gojq fast enough as jq.

participle-based parser (29b689a) goyacc-based parser
gojq -n -L . 'include "builtin"; .' 44.4 ms ±1.9 ms 10.3 ms ±1.5 ms (4.32 times faster)

The jq command uses yacc-based parser. Sometimes it is difficult to fix some coner cases of parsing problems. I want to improve the parser compatibility by switching to goyacc.

Also, it is too hard to parse string interpolation using participle library. I solved this problem by generating the regular expression for the string literal of fixed nested depth. But I want to fix is fundamentally by doing the same method of jq.

Breaking changes

This parser switching breaks compatibility for the library users.

  • Query stuct is changed a lot.
  • Module struct is now removed.
    • The type of the methods of ModuleLoader is changed.
    • ParseModule is removed in favor of Parse.

Drawbacks

The error message of participle is great. It reports the expected token on error. It is hard to implement (not impossible) in yacc-based parser, but maybe it's ok.

The problem is using goyacc is that it does not seem to be developped actively. I'm not sure but the tool may be deprecated in the future. But I believe someone would create yacc style parser then.

Thanks to participle library

I would like to express my gratitude to the author of the participle library. The idea of this library is great. It brought me great productivity so far. It makes it very easy to change the parsing rules or adding some optional tokens. It was always fun to resolving some syntax problem by just moving the struct fields (like #9). I still think the library is great and recommend it to beginners of implementing some interpreters.

@itchyny itchyny force-pushed the goyacc-parser branch 10 times, most recently from 59c21fb to 4c33630 Compare June 30, 2020 11:10
@itchyny itchyny force-pushed the goyacc-parser branch 2 times, most recently from 9311958 to eee30c9 Compare June 30, 2020 16:52
@itchyny itchyny force-pushed the goyacc-parser branch 3 times, most recently from cf837f2 to 1c470cd Compare July 1, 2020 11:41
@itchyny
Copy link
Owner Author

itchyny commented Jul 1, 2020

With this improvement, the performance of include "x" has been improved by around 8.5~9.0 times. Within this comparison, the duration includes the time for both parsing and code generation phases. Switching to goyacc-based parser not only improves the parsing phases, but makes the Query struct much smaller (less deep syntax tree) and this improves the code generation performance.
I found that jq seems to have some code whose cost grows much faster than the code size. It seems that (with more profiling data and fitting) it grows in cubic against the code size. I don't think it's fair to compare the performance with this unrealistic size of code (I think in most situation jq is run with less than 1KiB query), but I believe this is an important find and we might improve the performance of jq (EDIT: I noticed that block_bind_self is too slow).

I also noticed that jq's implementation for parsing the string interpolation is slightly inefficient. It has a stack for recording the parenthesis state for finding the closing parenthesis of interpolation. But I noticed (at 4e21bd9) that we don't need to distinguish the expression closing parenthesis and string interpolation closing parenthesis. Maybe I will submit a patch to jq soon.
Anyway, I've learned a lot from this project. Writing a lexer by hand is so much fun. When starting to implement the parser, we should check that the grammar has string interpolation or not. We should design structs best for represent the data structure, not for parsing. The executable size is reduced about 24.2%. Now gojq begins its new phase!

@itchyny itchyny merged commit a7b5f11 into master Jul 1, 2020
@itchyny itchyny deleted the goyacc-parser branch July 1, 2020 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant