Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error recovery does not work correctly for grammars that contain fully-optional rule bodies. #84

Open
br0nstein opened this issue Oct 26, 2023 · 0 comments · May be fixed by #85
Open

Comments

@br0nstein
Copy link

Root cause: LL1Analyzer returns incorrect follow tokens when exploring a rule with epsilon transitions from start to rule stop state, by exploring the rule stop target transition.

Background: During parsing, when an error is encountered, before exiting the rule the parser attempts error recovery. DefaultErrorStrategy's recover implementation works by resynchronizing the parser by consuming tokens until it finds one in the resynchronization set. If it successful, the input stream is then in a state where the next token is known to be able to be consumed by some ancestor rule being recursively parsed, and parsing can continue. The key is to correctly compute the resynchronization set--the set of tokens that may follow the immediate rule or any ancestor rule we are parsing. That is done using the ATN: each rule transition object stores a follow state (the state the rule resumes to after the sub-rule invocation returns); in broad strokes, starting from the follow state of the invoking state of each ancestor rule context, LL1 Analyzer walks the epsilon, predicate, and rule transitions of the ATN, adding tokens (defined by atom and wildcard transitions) it discovers. Importantly, it doesn't transition past these token transitions to ensure it operates with LL(1), and it doesn't continue past the stop state of each invoking rule.

Why does it matter that it not look past the stop state of some rule X? Because the stop state has a transition to the follow state for all possible transitions to rule X. These may not be possible states that can produce the next token as defined by the current rule context (including its ancestor rule contexts). For example, in the following grammar, rule2's stop state has an epsilon transition to the follow state of the rule transition to rule2 of both the start rule and rule1, but only ABC may follow HELLO WORLD when parsed from the start rule. If LL1 Analyzer continued past the stop state of rule2, it would transition to a state in rule1 and ZZZ would be erroneously added to the resynchronization set:

start : rule2 'ABC' EOF ;
rule1 : rule2 'ZZZ' ;
rule2 : rule3 ;
rule3 : 'HELLO' 'WORLD' ;

The bug in this issue comes into play when writing rules that may consume no tokens (i.e. a rule in which the stop state may be reached from the start state purely through epsilon/predicate transitions). For example, rule4 in the following grammar:

start : rule1 'ABC' EOF ;
rule1 : rule2 rule4 ;
rule2 : 'HELLO' 'WORLD' ;
rule3 : rule4 'ZZZ' ;
rule4 : 'DEF'* ;

Here, ZZZ should not be in the resynchronization set of the parser state after HELLO in rule2, but it is because given the rule contexts (start) -> (rule1) -> (rule2 HELLO) it can be reached by a rule transition from rule1 -> rule4 and epsilon transitions from rule4 to rule3 and an atom transition to ZZZ, before a depth-first arrival at rule1's stop state. This means parsing "HELLO ZZZ ZZZ ABC" would recover from a MismatchedTokenException in rule2 by consuming no tokens, rule1 completes normally, then the start rule also hits a MismatchedTokenException since the input stream is currently at the first ZZZ token and it is expecting ABC.

The fix is to ensure that LL1 Analyzer correctly computes the result set by never transitioning from a rule stop state, outside of special scenarios where the stop state is not set such as SLL(1) lookaheads. Then, in our example, given the correct resynchronization set {DEF, ABC} recover will correctly resynchronize by consuming the two ZZZ tokens and leaving the input stream at token ABC for the start rule to successfully parse. Note: The original ANTLR4 repository does not have this defect, as it already implements correct stopping.

As a side note, what if rule1 had a 'JKL'? optional token after the call to rule4? LL1 Analyzer already correctly handles this by storing the follow state of each rule transition that it walks, and when reaching a rule stop state it continues from that follow state if there is one. This doesn't conflict with what was described above, as we are not transitioning to the rule stop state target states but rather to the follow state of the invoking rule transition (in this example, that means when we encounter the rule4 stop state we transition to the state in rule1 after the rule4 call, in order to add token 'JKL' to the result set).

br0nstein added a commit to br0nstein/antlr4-optimized that referenced this issue Oct 26, 2023
daniellansun pushed a commit to daniellansun/antlr4 that referenced this issue Nov 18, 2023
daniellansun pushed a commit to daniellansun/antlr4 that referenced this issue Nov 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant