Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental Parsing support #414

Draft
wants to merge 11 commits into
base: master
Choose a base branch
from
65 changes: 65 additions & 0 deletions IncrementalParsing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
### Incremental parsing

#### Basics

We'll start by explaining how incremental parsing works for LL, then how we store that data. We are not going to talk about incremental _lexing_.

Let's start with LL(1) and ignore semantic predicates and other things. Fundamentally, the problem of incremental parsing is one of knowing what can change about how a given parser rule processes the tokens (and the resulting output parse tree) given a set of new/deleted/changed tokens. For LL(1), this turns out to be very easy. Given LL(1) can only look ahead one token, the only token changes that can even matter to a given parser rule (and the output parse tree) are changes to whatever the tokens the rule looked at last time, plus 1 token forward. If no tokens have changed in that [startToken, stopToken+1] bound, the rule cannot be affected (assuming it gets run). The referenced paper explains this in detail and shows how to make it work for LR parsers. Terrence also explains variants of the above in a few github issues where people have asked about incremental parsing.

ANTLR already tracks the token bounds of rules in the parse tree (startIndex/stopIndex). Thus, for LL(1) you don't even need extra information to do incremental parsing, you could simply use what exists.

#### Making it work for LL(1)

So how do we effect this incremental parsing for LL(1) in practice?

For our purposes, we need the list of token changes and the previous parse tree. We guard each parser rule with a check as to whether any of the changed tokens were in the bounds of the rule (including possible lookahead) during the last parse. If so, we re-run the rule and take its output. If not, we reuse the context the parse tree has from last time (later we'll cover fixing up the token data). We seek the token index to the stopIndex the rule had last time. This happens all the way down the rule tree as we parse top-down.

#### Making it work for LL(k)

Making the above work for LL(k) is just changing the 1 we add to the bounds, as it's still just a constant number. You can just add k to the bounds instead of 1.

#### Making it work for LL(\*)

LL(\*) unfortunately adds a little bit of trickiness because the lookahead can be infinite. To make this work correctly, we need to know how far the parser _actually did_ look the last time we ran it. To account for this, we need to adjust how the token stream works a little bit. Thankfully ANTLR is well modularized, and all lookahead/lookbehind goes through the token streams through a well defined interface. So we create an IncrementalTokenStream class, and keep the information we need there. The information we need is to have a stack of min/max token bounds in the token stream[1]. When the parser enters a rule it pushes the current token index as the min/max onto the minmax stack. The token stream updates the min/max bounds of the top of the minmax stack whenever lookahead/lookbehind is called. When the parser exits a rule, it pops the minmax stack,and sets the min/max information on the rule context. If there is a parent rule context, it unions the child interval into the parent (so that the parent ends up with a token range spanning the entire set of children). This accounts for the _actual_ lookahead or lookbehind performed during a parse.

[1] You can track it more exactly than this but it is highly unlikely to ever be worth it. The main issue this affects is changes to hidden tokens, which will cause reparsing even though the parser can't see them.

#### Adaptive parsing, SLL, etc

None of these change anything because they also go through the proper lookahead/lookbehind interfaces. At worst, they look at too much context and we reparse a little more than we should.

#### Predicates

Predicates that do lookahead or lookbehind are covered by the LL(\*) method with no additional work.
Past that, hopefully the bounds are somewhat obvious: Predicates that are idempotent and don't depend on context forward of a given rule/lookahead, work fine.
Others cannot be supported (and their failure can't easily be detected).

#### Actions, parse listeners, etc

Parse listeners attached directly to Parsers will not see rules that are skipped. This is fixable (but unclear if it is worth it). Actions that occur during skipped rules will not occur.
Once the tree is generated it is no different than any other tree.

#### Tree fixup

ANTLR tracks start/end position, line info, source stream info, in tokens, so when the parse tree pieces are reused, all of that may be wrong because they point to old tokens. The text in the parse tree will actually be right (by definition, otherwise the incremental parser is broken). Currently, we pass through the tree and replace old tokens to point to new ones. This is because updating the old token offsets/source/inputstream/etc turns out to be quite difficult (ANTLR is designed for the tokens to be immutable). The downside is that we have to retrieve the new tokens from the new lexer.
Tree fixup is actually the most expensive part of the parser data right now, and for those who only care about text being correct, it is a waste of time.

#### Outstanding issues

- The (rule, startindex) map stuff can be avoided if we really want (though it's
tricky and involves trying to walk the old parse tree as we build the new one).
- The way the incremental grammar option is parsed/used in the stg file should
obviously be moved to antlr4 core.
- There is code that could be cleaned up if we included an IntervalMap datastructure
(or at least a NonOverlappingIntervalList. IntervalSet does not do what we need). To ensure i did not add dependencies, i didn't do this, but it will likely be worth it in the future.
- We currently eagerly fixup the old parse tree in IncrementalParserData, etc. We
may want to be lazier and just do it in the parser when the context gets reused instead.
- We use the parselistener interface as an easy way to ensure we get to see entry/
exit events at the right time. This turned out to be easier than handling
recursion/left factoring through overriding the relevant parser interface pieces.
- Top level recursion contexts are now reused, but we won't reuse individaul recursion contexts yet.


#### References

"Efficient and Flexible Incremental Parsing" by Tim Wagner and Susan Graham
7 changes: 4 additions & 3 deletions package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "antlr4ts-root",
"version": "0.5.0-dev",
"version": "0.5.1-dev",
"description": "Root project for ANTLR 4 runtime for Typescript",
"private": true,
"main": "index.js",
Expand All @@ -11,9 +11,10 @@
"buildtool": "cd tool && npm link",
"unlinktool": "cd tool && npm unlink",
"clean": "npm run unlink && git clean -idx",
"antlr4ts": "npm run antlr4ts-runtime-xpath && npm run antlr4ts-test-runtime && npm run antlr4ts-test-labels && npm run antlr4ts-test-pattern && npm run antlr4ts-test-rewriter && npm run antlr4ts-test-xpath && npm run antlr4ts-benchmark",
"antlr4ts": "npm run antlr4ts-runtime-xpath && npm run antlr4ts-test-runtime && npm run antlr4ts-test-labels && npm run antlr4ts-test-pattern && npm run antlr4ts-test-rewriter && npm run antlr4ts-test-xpath && npm run antlr4ts-test-incremental && npm run antlr4ts-benchmark",
"antlr4ts-runtime-xpath": "cd src/tree/xpath && antlr4ts XPathLexer.g4 -DbaseImportPath=../..",
"antlr4ts-test-runtime": "cd test/runtime && antlr4ts TestGrammar.g4 -DbaseImportPath=../../../../src -o gen/typescript_only",
"antlr4ts-test-incremental": "cd test/tool && antlr4ts TestIncremental1.g4 TestIncrementalJava.g4 -DbaseImportPath=../../../../src -o gen/incremental",
"antlr4ts-test-labels": "cd test/runtime/TestReferenceToListLabels && antlr4ts T.g4 -no-listener -DbaseImportPath=antlr4ts -o gen",
"antlr4ts-test-pattern": "cd test/tool && antlr4ts ParseTreeMatcherX1.g4 ParseTreeMatcherX2.g4 ParseTreeMatcherX3.g4 ParseTreeMatcherX4.g4 ParseTreeMatcherX5.g4 ParseTreeMatcherX6.g4 ParseTreeMatcherX7.g4 ParseTreeMatcherX8.g4 -no-listener -DbaseImportPath=../../../../src -o gen/matcher",
"antlr4ts-test-rewriter": "cd test/tool && antlr4ts RewriterLexer1.g4 RewriterLexer2.g4 RewriterLexer3.g4 -DbaseImportPath=../../../../src -o gen/rewriter",
Expand Down Expand Up @@ -77,7 +78,7 @@
"istanbul": "^0.4.5",
"mocha": "^5.2.0",
"mocha-typescript": "^1.1.14",
"nyc": "^13.1.0",
"nyc": "^13.3.0",
"source-map-support": "^0.5.6",
"std-mocks": "^1.0.1",
"tslint": "^5.11.0",
Expand Down
173 changes: 173 additions & 0 deletions src/IncrementalParser.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
/*!
* Copyright 2019 The ANTLR Project. All rights reserved.
* Licensed under the BSD-3-Clause license. See LICENSE file in the project root for license information.
*/

import { IncrementalParserRuleContext } from "./IncrementalParserRuleContext";
import { IncrementalTokenStream } from "./IncrementalTokenStream";
import { Parser } from "./Parser";
import { ParserRuleContext } from "./ParserRuleContext";
import { IncrementalParserData } from "./IncrementalParserData";
import { ParseTreeListener } from "./tree/ParseTreeListener";

/**
* Incremental parser implementation
*
* There are only two differences between this parser and the underlying regular
* Parser - guard rules and min/max tracking
*
* The guard rule API is used in incremental mode to know when a rule context
* can be reused. It looks for token changes in the bounds of the rule.
*
* The min/max tracking is used to track how far ahead/behind the parser looked
* to correctly detect whether a token change can affect a parser rule in the future (IE when
* handed to the guard rule of the next parse)
*
* @notes See IncrementalParsing.md for more details on the theory behind this.
* In order to make this easier in code generation, we use the parse listener
* interface to do most of our work.
*
*/
export abstract class IncrementalParser extends Parser
implements ParseTreeListener {
// Current parser epoch. Incremented every time a new incremental parser is created.
private static _GLOBAL_PARSER_EPOCH: number = 0;
public static get GLOBAL_PARSER_EPOCH() {
return this._GLOBAL_PARSER_EPOCH;
}
protected incrementParserEpoch() {
return ++IncrementalParser._GLOBAL_PARSER_EPOCH;
}
public parserEpoch = -1;

private parseData: IncrementalParserData | undefined;
constructor(
input: IncrementalTokenStream,
parseData?: IncrementalParserData,
) {
super(input);
this.parseData = parseData;
this.parserEpoch = this.incrementParserEpoch();
// Register ourselves as our own parse listener. Life is weird.
this.addParseListener(this);
}

// Push the current token data onto the min max stack for the stream.
private pushCurrentTokenToMinMax() {
let incStream = this.inputStream as IncrementalTokenStream;
let token = this._input.LT(1);
incStream.pushMinMax(token.tokenIndex, token.tokenIndex);
}

// Pop the min max stack the stream is using and return the interval.
private popCurrentMinMax(ctx: IncrementalParserRuleContext) {
let incStream = this.inputStream as IncrementalTokenStream;
return incStream.popMinMax();
}

/**
* Guard a rule's previous context from being reused.
*
* This routine will check whether a given parser rule needs to be rerun, or if we already have context that can be
* reused for this parse.
*/
public guardRule(
parentCtx: IncrementalParserRuleContext,
state: number,
ruleIndex: number,
): IncrementalParserRuleContext | undefined {
// If we have no previous parse data, the rule needs to be run.
if (!this.parseData) {
return undefined;
}
// See if we have seen this state before at this starting point.
let existingCtx = this.parseData.tryGetContext(
parentCtx ? parentCtx.depth() + 1 : 1,
state,
ruleIndex,
this._input.LT(1).tokenIndex,
);
// We haven't see it, so we need to rerun this rule.
if (!existingCtx) {
return undefined;
}
// We have seen it, see if it was affected by the parse
if (this.parseData.ruleAffectedByTokenChanges(existingCtx)) {
return undefined;
}
// Everything checked out, reuse the rule context - we add it to the
// parent context as enterRule would have;
let parent = this._ctx as IncrementalParserRuleContext | undefined;
// add current context to parent if we have a parent
if (parent != null) {
parent.addChild(existingCtx);
}
return existingCtx;
}

/**
* Pop the min max stack the stream is using and union the interval
* into the passed in context. Return the interval for the context
*
* @param ctx Context to union interval into.
*/
private popAndHandleMinMax(ctx: IncrementalParserRuleContext) {
let interval = this.popCurrentMinMax(ctx);
ctx.minMaxTokenIndex = ctx.minMaxTokenIndex.union(interval);
// Returning interval is wrong because there may have been child
// intervals already merged into this ctx.
return ctx.minMaxTokenIndex;
}
/*
This is part of the regular Parser API.
The super method must be called.
*/

/**
* The new recursion context is an unfortunate edge case for us.
* It reparents the relationship between the contexts,
* so we need to merge intervals here.
*/
public pushNewRecursionContext(
localctx: ParserRuleContext,
state: number,
ruleIndex: number,
): void {
// This context becomes the child
let previous = this._ctx as IncrementalParserRuleContext;
// The incoming context becomes the parent
let incLocalCtx = localctx as IncrementalParserRuleContext;
incLocalCtx.minMaxTokenIndex = incLocalCtx.minMaxTokenIndex.union(
previous.minMaxTokenIndex,
);
super.pushNewRecursionContext(localctx, state, ruleIndex);
}

/*
These two functions are parse of the ParseTreeListener API.
We do not need to call super methods
*/

public enterEveryRule(ctx: ParserRuleContext) {
// During rule entry, we push a new min/max token state.
this.pushCurrentTokenToMinMax();
let incCtx = ctx as IncrementalParserRuleContext;
incCtx.epoch = this.parserEpoch;
}
public exitEveryRule(ctx: ParserRuleContext) {
// On exit, we need to merge the min max into the current context,
// and then merge the current context interval into our parent.

// First merge with the interval on the top of the stack.
let incCtx = ctx as IncrementalParserRuleContext;
let interval = this.popAndHandleMinMax(incCtx);

// Now merge with our parent interval.
if (incCtx._parent) {
let parentIncCtx = incCtx._parent as IncrementalParserRuleContext;
parentIncCtx.minMaxTokenIndex = parentIncCtx.minMaxTokenIndex.union(
interval,
);
}
}
}
Loading