Incremental Parsing support #414

dberlin · 2019-04-05T22:43:41Z

Hey folks,
This pull request adds incremental parsing support to antlr4ts.
i'm explicitly not seeking to get merged in this exact form, i'm more curious whether y'all think it is worth the time/energy to try to get it merged into ANTLR (either the reference or optimized runtime).

I wrote the code for the typescript runtime first because i'm using it with vscode. I am happy to back port it to java.

As you can see, it is deliberately structured to be simple and small, have no extra dependencies outside of the existing runtime. It does not require modifying any existing part of the runtime.

I've tested it on fairly complex grammars (the included tests test it on the JavaLR grammar among others).

I added a doc (IncrementalParser.md) that explains how it works at a high level, as well as the outstanding issues. The tests currently test basic add/remove/etc on a simple grammar as well as the JavaLR grammar, and verify the right parts of the parse tree did what they were supposed to.

My use case is a little weird - i am parsing GCode files, which can get quite large (hundreds of megabytes). While ANTLR is quite fast, the parse time on a 6.5 meg gcode file is already 3-6 seconds (depending on whether parse trees are built). With the incremental parser, adding a new line is O(10ms).

We don't do incremental lexing but it's not difficult for a lot of languages to do by hand (particularly if you only care about text being right).

As mentioned, happy to do the work for TS/Java, and happy to push on it, just trying to understand if i'm the only one in the world who cares :)

Thanks for any thoughts/feedback!

…erInterface

…when the position/lookahead has not changed

…e distinguished

…stinguisher

dberlin · 2019-04-21T22:51:33Z

I have actually back ported this to ANTLR4 java and submitted it to the main repo.
I will keep this up to date with that (and if it gets turned down, i'll close this)

BurtHarris · 2019-07-02T03:20:43Z

@dberlin this sounds interesting, I'll have a look.

dberlin · 2019-07-02T03:25:33Z

SGTM. Incremental parsing is in good shape.

I got slammed at work and then had another child, so i haven't had time to finish the incremental lexing.

dberlin · 2019-07-02T03:32:22Z

(the incremental lexing info can be found here:
antlr/antlr4#2534
The TL;DR is that it works but i did not finish changed token list generation.
So incremental parsing and lexing work fine individually, they just don't automatically integrate)

AlexGustafsson · 2020-04-08T11:32:08Z

A bit late to the party, but this is super cool. Are there any plans on finishing this up and merging it?

BurtHarris · 2020-04-10T18:23:44Z

@dberlin, can you help me understand if incremental parsing can help with the a fundamental mismatch between the Java stream model (which uses blocking I/O) and the JavaScript stream model, where no blocking for I/O is permitted?

In JavaScript, rather than pulling data from a stream, data arrives in chunks, which are delivered by a callback (continuation passing style), or more recently using Promises. Promises have lead to language extension such as async/await where the code can look very much as if it supported blocking I/O, but it's an illusion.

sharwell · 2020-04-10T19:03:55Z

@dberlin can you help me understand the parts of this feature which currently require changes to the core library? I'm hoping to find a way that it can be used without needing to change the core code generator or runtime.

dberlin · 2020-04-10T21:45:58Z

Let me do my best to go in order.

@AlexGustafsson I won't have a chance to finish this up anymore (had another kid since then, and moving across the country)
@sharwell It is not possible to do it without adding support to the core runtime in various places (IE adding Incremental* classes), or at least, it doesn't occur to me a way to do it. I can express what it's doing, and did my best to do that already in IncrementalParsing.md, happy to help understand any specifics.

I'm not as familiar with the ANTLR core runtime as y'all, i expected.

It may be possible to do this without modifying the code generator, but i'm not sure how to do it.
What would have to happen would be to move the guard check out of the code generator, and into the core runtime.
This would require further modifying IncrementalParser to try to make that work.
I tried it at the start and it was non-obvious to figure out all the changes and places, so i gave up in favor of small obvious changes to the code generator.

Beyond that, a lot of the complexity is because of the intermediate use of a token stream.
You could simplify all of this a lot more if you required incremental lexing, and drove the incremental lexing from the incremental parser as it walked.
You could then, for example, get rid of IncrementalTokenStream and put most of the that into the IncrementalParser. It would also be a lot faster.
What happens right now is that ANTLR parsers expect token streams to be complete, and seeks/skips by calling nextToken, which also blocks.

We waste a bunch of time and energy returning/tracking and processing unchanged tokens.

If instead the incrementalparser required an incrementallexer, it could be made to only ever request tokens for changed areas.

This would also cleanup the whole interface you see here.
It would also let the incremental parser tell the lexer what next tokens would be acceptable, so that it knows whether it needs to relex further or not.

(All of this unfortunately requires random access to the underlying thing being lexed, but, on the plus side, can be made non-blocking as a result)

The incremental lexer changes i posted have an IncrementalLexer.md file that goes into this a bit.

If you want to see another implementation that uses a similar strategy to this , to see if you can figure out a way, take a look at tree sitter. I find it somewhat impenetrable, even knowing exactly how it works, but ...

It's also for LR, but both this and that are based on the same paper. The incremental lexing is driven by the parser in tree-sitter, but is otherwise identical algorithm to the incremental lexer i posted.

BurtHarris · 2020-04-10T23:51:09Z

@dberlin, Is this designed to deal with incremental parsing as in a portion of the text might have changed (like in an IDE doing syntax checking), or as in the input stream continues to deliver characters beyond those previously available?

dberlin · 2020-04-10T23:55:39Z

The former.

…

On Fri, Apr 10, 2020, 4:51 PM Burt Harris ***@***.***> wrote: @dberlin <https://github.com/dberlin>, Is this designed to deal with incremental parsing as a portion of the text might have changed (like in an IDE doing syntax checking), or because the input stream continues to deliver characters beyond those previously available? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#414 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACPI24XNBXG2LH4EWLGLP3RL6WHRANCNFSM4HD7LT7Q> .

BurtHarris · 2020-04-11T00:02:55Z

@sharwell, is there a pre-existing method in the code generator to alter what base class the generated class(es) are derived from?

dberlin · 2020-04-11T00:04:04Z

Let me differentiate capability from speed.
It is capable of handling your case now.

Your case is equivalent to the case where all the text is added at the end, but it would be slow for the reasons the .md files cover (antlr's current way of doing this forces dealing with unchanged tokens (

It could be made to deal with this case very well if you did the "incremental parser drives incremental lexer" way described in those files.

In fact, it would be optimal and should be not slower than doing it all at once.

dberlin · 2020-04-11T00:05:37Z

(and for example, tree sitter, with the same algorithms, is used to parse character at a time)

dberlin · 2020-04-11T00:20:45Z

Yes there is. I started with that approach, and if I remember correctly the context creation in each rule is side effecting or something that made it very hard to move the guard rule check from where it is. I probably still have the code around to do it that way somewhere if you want to see if you can make it work

…

On Fri, Apr 10, 2020, 5:03 PM Burt Harris ***@***.***> wrote: @sharwell <https://github.com/sharwell>, is there a pre-existing method in the code generator to alter what base class the generated class(es) are derived from? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#414 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACPI253L5TPMLVX6J54MCDRL6XTZANCNFSM4HD7LT7Q> .

BurtHarris · 2020-04-11T02:13:06Z

@dberlin: Returning to a higher level discussion, language support in IDEs like vscode often have two different lexers for the same language:

An incremental syntax highlighting lexer which operates on a line-by-line basis, and who's only job it to classify tokens for color display purposes. These often function tightly integrated with the editor's buffer functionality.
A full lexer and parser with semantic checking, which runs in a separate process (at least in vscode) to analyze for "problems" and generate the red squiggles. This process is a language service for the language.

Sub-second responsiveness in syntax highlighting is important, but highlighting only calls for an incremental lexer, no parsing, at least as I've seen it (in Visual Studio and Visual Studio Code.) In fact, the current trend seems to be to use TextMate grammars for highlighting purposes, as vscode supports.

Thus it's use-case two (where the lexer and parser are integrated) where this sort of incremental parsing begins to become interesting. If the typical parse time you get for a multi-megabyte g code file is measured in seconds, that seems pretty acceptable for use-case 2. So I think while this sort of incremental lexing and parsing may be interesting, there doesn't seem to be a pressing need for it.

I see this as different from streaming lexing and parsing I thought could be related to incremental parsing. The goal in streaming of managing back-pressure so that the memory requirements for a simple command-line tool don't expand excessively.

BurtHarris · 2020-04-11T02:22:33Z

P.S. because of the clever way ANTR's ALL(*) works, the recognizer may perform much better after its been warmed-up on similar input. Did the 6.5 meg g code file in 3-6 seconds include the warm-up overhead?

dberlin · 2020-04-11T05:02:51Z

Hey Burt, (The ANTLR timings are for second+ parse. First parse is a lot slower due to the need to build out the state cache, etc, as you say.) I'm fairly familiar with this style, i've worked on a lot of IDE's in the past 25 years. My day job is actually owning IDE's, programming languages, and a few other things for Google :) A few things: 1. Lexing is actually not good enough anymore for a lot of languages, and in fact, there is a large and growing chorus of complaints against vscode to move to tree-sitter as a result (which does sub-second responsiveness with parsing). See the long discussion starting here: microsoft/vscode#50140, for example (there are other issues here). See https://marketplace.visualstudio.com/items?itemName=georgewfraser.vscode-tree-sitter for an example comparing syntax highlighting. Note that the tree-sitter version is not just better, it's actually *faster* than the textmate grammar engine in vscode by far. (even though vscode is doing delayed update and tree-sitter is doing reparse on every keystroke) Truthfully, at least in the IDEs i worked on, lexing never really *was* good enough, its just that outside of Pan/Ensemble/Harmonia research projects, nobody bothered to make parsing fast enough (I actually worked in some incremental parsing IDEs in the past). Other IDEs than vscode, such as Atom, already use tree-sitter as an incremental parsing systems. See https://github.blog/2018-10-31-atoms-new-parsing-system/ 2. Atom and other IDEs that use tree-sitter (again, same incremental algorithms here) do reparsing on a per-keystroke basis, and have no issue even with multi-megabyte files. I have a tree-sitter grammar for the same file, and the parsing/lexing time on changes is zero. So overall i'd say "I believe what you say has been true in the past, i definitely would not agree it is true anymore". In fact, i'd put a significant amount of money that tree-sitter will take over the IDE space (This is an easy bet to make because it already is outside of a few mainstays :P) But, in the end, what y'all choose to do is up to you. Feel free to close this if you aren't interested. I wrote it mostly because i've been using ANTLR since the PCCTS days, and it was fun. I already have tree-sitter grammars for my work that I can use in IDEs (including vscode), etc.

…

On Fri, Apr 10, 2020, 7:13 PM Burt Harris ***@***.***> wrote: Returning to a higher level discussion, language support in IDEs like vscode often have two different lexers: 1. An incremental syntax highlighting lexer which operates on a line-by-line basis, and who's only job it to classify tokens for color display purposes. 2. A full lexer and parser with semantic checking, which runs in a separate process to generate the "problems" and red squiggles. This process is a language service for the language. Sub-second responsiveness in syntax highlighting is important, but highlighting only calls for an incremental lexer, no parsing, at least as I've seen it (in Visual Studio and Visual Studio Code.) In fact, the current trend seems to be to use TextMate grammars <https://macromates.com/manual/en/language_grammars> for highlighting purposes, as vscode supports. Thus it's use-case two (where the lexer and parser are integrated) where this sort of incremental parsing begins to become interesting. To be frank, if a multi-megabyte g code file parses in 10s of seconds, I think the priority of adding incremental support wouldn't come in an IDE environment. If the typical parse time you get for a multi-megabyte g code file is measured in seconds, that seems pretty acceptable for use-case 2. So I think while this sort of incremental lexing and parsing may be interesting, there doesn't seem to be a pressing need for it. I guess this is different from *streaming* lexing and parsing I thought could be related. The goal in streaming of managing back-pressure so that the memory requirements for a simple command-line tool don't expand excessively. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#414 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACPI25DFMAPNTUORM33L3LRL7G33ANCNFSM4HD7LT7Q> .

BurtHarris · 2020-04-14T23:33:58Z

Thanks @dberlin, you make a good case. I certainly am a bit out-of-date, I retired from MSFT a number of years back, and IDEs were never my focus.

I think introducing incremental parsing to ANTLR thru Java antlr/antlr4#2527 is the way to go. It looks like that PR needs some minor rebasing, and response to a threading concern.

@sharwell (and of course @parrt) are really the ANTLR4 / ALL(*) experts. I just tackled antlr4ts to get some hands-on experience with Typescript, and I didn't care for the existing JavaScript target last time I tried it. Anyway, take my input with a grain of salt, I am no expert.

I am a little surprised Google's interested in GCode IDE, or is that a side project? Is there anything you can point me at about that?

dberlin added 8 commits March 31, 2019 21:09

Initial tool work

e48dd26

Introduce basic incremental parser

d17eeb5

Make recursion context work and greatly simplify by using ParseListen…

52bdbd4

…erInterface

ANTLR looks at the same tokens a lot, don't do useless interval work …

2a38dc3

…when the position/lookahead has not changed

Forgot the package.json changes

144cf99

Update with a lot more comments

1465a87

Disable recursive context use since the contexts can't meaningfully b…

c690d09

…e distinguished

Re-enable recursion contexts now that we can rely on depth being a di…

6e5215f

…stinguisher

dberlin force-pushed the incremental branch from f11396a to bacd450 Compare April 7, 2019 04:37

Cleanups made while backporting to Java

de17e6b

dberlin force-pushed the incremental branch from bacd450 to f4025ed Compare April 7, 2019 20:09

Clean up tslint errors

566ecec

dberlin force-pushed the incremental branch from f4025ed to 566ecec Compare April 7, 2019 20:14

dberlin mentioned this pull request Apr 7, 2019

Add incremental parsing support antlr/antlr4#2527

Open

Fix changed rule detection during IncrementalParserData walk

a202d10

BurtHarris marked this pull request as ready for review April 10, 2020 18:04

BurtHarris marked this pull request as draft April 10, 2020 20:59

BurtHarris mentioned this pull request Apr 11, 2020

continuous-integration/travis-ci appears stuck #448

Closed

BurtHarris added this to To do in New Features Apr 18, 2020

BurtHarris added enhancement pull request labels Apr 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental Parsing support #414

Incremental Parsing support #414

dberlin commented Apr 5, 2019

dberlin commented Apr 21, 2019

BurtHarris commented Jul 2, 2019

dberlin commented Jul 2, 2019

dberlin commented Jul 2, 2019

AlexGustafsson commented Apr 8, 2020

BurtHarris commented Apr 10, 2020

sharwell commented Apr 10, 2020 •

edited

Loading

dberlin commented Apr 10, 2020

BurtHarris commented Apr 10, 2020 •

edited

Loading

dberlin commented Apr 10, 2020 via email

BurtHarris commented Apr 11, 2020

dberlin commented Apr 11, 2020

dberlin commented Apr 11, 2020

dberlin commented Apr 11, 2020 via email

BurtHarris commented Apr 11, 2020 •

edited

Loading

BurtHarris commented Apr 11, 2020 •

edited

Loading

dberlin commented Apr 11, 2020 via email

BurtHarris commented Apr 14, 2020 •

edited

Loading

Incremental Parsing support #414

Are you sure you want to change the base?

Incremental Parsing support #414

Conversation

dberlin commented Apr 5, 2019

dberlin commented Apr 21, 2019

BurtHarris commented Jul 2, 2019

dberlin commented Jul 2, 2019

dberlin commented Jul 2, 2019

AlexGustafsson commented Apr 8, 2020

BurtHarris commented Apr 10, 2020

sharwell commented Apr 10, 2020 • edited Loading

dberlin commented Apr 10, 2020

BurtHarris commented Apr 10, 2020 • edited Loading

dberlin commented Apr 10, 2020 via email

BurtHarris commented Apr 11, 2020

dberlin commented Apr 11, 2020

dberlin commented Apr 11, 2020

dberlin commented Apr 11, 2020 via email

BurtHarris commented Apr 11, 2020 • edited Loading

BurtHarris commented Apr 11, 2020 • edited Loading

dberlin commented Apr 11, 2020 via email

BurtHarris commented Apr 14, 2020 • edited Loading

sharwell commented Apr 10, 2020 •

edited

Loading

BurtHarris commented Apr 10, 2020 •

edited

Loading

BurtHarris commented Apr 11, 2020 •

edited

Loading

BurtHarris commented Apr 11, 2020 •

edited

Loading

BurtHarris commented Apr 14, 2020 •

edited

Loading