Switch to goyacc-generated parser from participle library #37

itchyny · 2020-06-28T12:42:18Z

I'm working on rewriting the query parser. I have been using participle library but I will switch to goyacc-based parser.

Motivation

Using the participle library makes the struct tree too deep. We have to create structs for each precedence priorities. In the current gojq implementation, I embed the builtin functions to skip parsing them every command execution. Deep structure increases the executable size and initial allocation time. By switching to goyacc-based parser, we can reduce the nested depth for struct of the binary operators. Reducing the executable size would be nice for gojq library users too.

	participle-based parser (`29b689a`)	goyacc-based parser
executable size	6762616 byte	5129736 byte (-24.14%)
execution time for `gojq -n .`	10.6 ms ±0.9 ms	6.1 ms ±1.1 ms (1.75 times faster)

The participle library makes use of reflection. The author seems to take much care about the performance, but it can't be faster than goyacc-based parser. I want to make gojq fast enough as jq.

	participle-based parser (`29b689a`)	goyacc-based parser
`gojq -n -L . 'include "builtin"; .'`	44.4 ms ±1.9 ms	10.3 ms ±1.5 ms (4.32 times faster)

The jq command uses yacc-based parser. Sometimes it is difficult to fix some coner cases of parsing problems. I want to improve the parser compatibility by switching to goyacc.

Also, it is too hard to parse string interpolation using participle library. I solved this problem by generating the regular expression for the string literal of fixed nested depth. But I want to fix is fundamentally by doing the same method of jq.

Breaking changes

This parser switching breaks compatibility for the library users.

Query stuct is changed a lot.
Module struct is now removed.
- The type of the methods of ModuleLoader is changed.
- ParseModule is removed in favor of Parse.

Drawbacks

The error message of participle is great. It reports the expected token on error. It is hard to implement (not impossible) in yacc-based parser, but maybe it's ok.

The problem is using goyacc is that it does not seem to be developped actively. I'm not sure but the tool may be deprecated in the future. But I believe someone would create yacc style parser then.

Thanks to participle library

I would like to express my gratitude to the author of the participle library. The idea of this library is great. It brought me great productivity so far. It makes it very easy to change the parsing rules or adding some optional tokens. It was always fun to resolving some syntax problem by just moving the struct fields (like #9). I still think the library is great and recommend it to beginners of implementing some interpreters.

itchyny · 2020-07-01T15:54:26Z

With this improvement, the performance of include "x" has been improved by around 8.5~9.0 times. Within this comparison, the duration includes the time for both parsing and code generation phases. Switching to goyacc-based parser not only improves the parsing phases, but makes the Query struct much smaller (less deep syntax tree) and this improves the code generation performance.
I found that jq seems to have some code whose cost grows much faster than the code size. It seems that (with more profiling data and fitting) it grows in cubic against the code size. I don't think it's fair to compare the performance with this unrealistic size of code (I think in most situation jq is run with less than 1KiB query), but I believe this is an important find and we might improve the performance of jq (EDIT: I noticed that block_bind_self is too slow).

I also noticed that jq's implementation for parsing the string interpolation is slightly inefficient. It has a stack for recording the parenthesis state for finding the closing parenthesis of interpolation. But I noticed (at 4e21bd9) that we don't need to distinguish the expression closing parenthesis and string interpolation closing parenthesis. Maybe I will submit a patch to jq soon.
Anyway, I've learned a lot from this project. Writing a lexer by hand is so much fun. When starting to implement the parser, we should check that the grammar has string interpolation or not. We should design structs best for represent the data structure, not for parsing. The executable size is reduced about 24.2%. Now gojq begins its new phase!

itchyny force-pushed the goyacc-parser branch 10 times, most recently from 59c21fb to 4c33630 Compare June 30, 2020 11:10

itchyny added 18 commits June 30, 2020 20:23

start to implement goyacc based parser

e17e6b8

parse recurse query

3b1be64

parse null, true, false and function call without arguments

4c18681

parse query pipe

445964e

skip white characters

0b5a8cd

report parse error with the byte offset and the last token

6130f2c

parse number

ef9c930

parse query in parenthesis

9084381

parse array

7233007

parse object indexing

af5b66c

parse unary operators

bc69ffb

parse function call arguments

bb1d440

parse object

02691b4

parse if then elif else expression

c466e70

parse try catch

5944f9e

parse variable

d6302c2

parse array indexing

1f14eba

report error token in UTF-8 rune

bf8323b

itchyny force-pushed the goyacc-parser branch 2 times, most recently from 9311958 to eee30c9 Compare June 30, 2020 16:52

itchyny added 19 commits July 1, 2020 12:51

flatten structs to Query

0c8bb4c

remove numberPatternStr constant

f664c2f

add a test for comment only query

e1f16d1

deprecate Module struct in favor of Query

a2408bc

remove struct tags for participle library

4caa433

move Label struct to Term

b9f2055

fix number pattern for tonumber function

5d62cb3

use pointer for parsing PatternObject

51c4fa2

use interface{} for yacc symbol type value

49152eb

change parser output file

d88d51a

use tokIdentVariable for function arguments

c0d7bf2

use pointer slices instead of struct slices

9a620f1

remove SuffixIndex in favor of Index

c783106

resolve shift/reduce conflicts

57ff6a5

resolve reduce/reduce conflicts

73e8001

implement prepend functions for performance improvement

48bc6f9

reimplement string parsing and string interpolation in elegant fashion

e2674ef

drop leading dot from Name of Index

5eb286a

stop calculating the parenthesis depth for string interpolation

4e21bd9

itchyny force-pushed the goyacc-parser branch 3 times, most recently from cf837f2 to 1c470cd Compare July 1, 2020 11:41

unquote string in the lexer

22ac346

itchyny force-pushed the goyacc-parser branch from 1c470cd to 22ac346 Compare July 1, 2020 12:47

improve Stringer for the query structs

d1e93e3

itchyny force-pushed the goyacc-parser branch from d2ce507 to d1e93e3 Compare July 1, 2020 13:28

include the generated parser code

dd56c1d

itchyny merged commit a7b5f11 into master Jul 1, 2020

itchyny deleted the goyacc-parser branch July 1, 2020 15:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to goyacc-generated parser from participle library #37

Switch to goyacc-generated parser from participle library #37

itchyny commented Jun 28, 2020

itchyny commented Jul 1, 2020 •

edited

Loading

Switch to goyacc-generated parser from participle library #37

Switch to goyacc-generated parser from participle library #37

Conversation

itchyny commented Jun 28, 2020

Motivation

Breaking changes

Drawbacks

Thanks to participle library

itchyny commented Jul 1, 2020 • edited Loading

itchyny commented Jul 1, 2020 •

edited

Loading