Lexing Issue in ANTLR4 Grammar for Fortran 2018: Token Misclassification #4640

AkhilAkkapelli · 2024-06-08T16:30:55Z

I am developing a Fortran 2018 grammar in ANTLR4 using the ISO standard. I am encountering an issue during the lexing phase with some of the lexer rules. Specifically, certain keywords are being misclassified. Below is the minimal grammar demonstrating the problem:

Grammar: FortranTestF18.g4

grammar FortranTestF18;

//LEXER RULES

LINE_COMMENT : '!' .*? '\r'? '\n' -> skip ;

BLOCK_COMMENT: '/*' .*? '*/' -> skip;

WS: [ \t\r\n]+ -> skip;

PROGRAM: 'PROGRAM' | 'Program' | 'program';

END: 'END' | 'End' | 'end';

COMMA: ',';

LPAREN: '(';

RPAREN: ')';

ASTERIK: '*';

NONE: 'NONE' | 'None' | 'none';

IMPLICIT: 'IMPLICIT' | 'Implicit' | 'implicit';

FORMAT: 'FORMAT' | 'Format' | 'format';

PLUS: '+';

// R765 binary-constant -> B ' digit [digit]... ' | B " digit [digit]... "
BINARYCONSTANT: B APOSTROPHE DIGIT+ APOSTROPHE | B QUOTE DIGIT+ QUOTE;

// R766 octal-constant -> O ' digit [digit]... ' | O " digit [digit]... "
OCTALCONSTANT: O APOSTROPHE DIGIT+ APOSTROPHE | O QUOTE DIGIT+ QUOTE;

//R0003 RepChar
APOSTROPHEREPCHAR: APOSTROPHE (~[\u0000-\u001F\u0027])*  APOSTROPHE;

QUOTEREPCHAR: QUOTE (~[\u0000-\u001F\u0022])*  QUOTE;

APOSTROPHE: '\'';

QUOTE: '"';

DOT: '.';

C: 'C';

// R603 name -> letter [alphanumeric-character]...
NAME: LETTER (ALPHANUMERICCHARACTER)*;

// R711 digit-string -> digit [digit]...
DIGITSTRING: DIGIT+; 

MINUS: '-';

B: 'B';

O: 'O';

Z: 'Z';

A: 'A';

F: 'F';

D: 'D';

E: 'E';

I: 'I';

G: 'G';

L: 'L';

DT: 'DT';

EN: 'EN';

ES: 'ES';

EX: 'EX';

T: 'T';

TL: 'TL';

TR: 'TR';

X: 'X';

SS: 'SS';

SP: 'SP';

S: 'S';

BN: 'BN';

BZ: 'BZ';

RU: 'RU';

RD: 'RD';

RZ: 'RZ';

RN: 'RN';

RC: 'RC';

RP: 'RP';

DC: 'DC';

DP: 'DP';

P: 'P';

// R602 UNDERSCORE -> _
UNDERSCORE: '_';

// R601 alphanumeric-character -> letter | digit | underscore
ALPHANUMERICCHARACTER: LETTER | DIGIT | UNDERSCORE;

// R0002 Letter ->
//         A | B | C | D | E | F | G | H | I | J | K | L | M |
//         N | O | P | Q | R | S | T | U | V | W | X | Y | Z
LETTER: 'A'..'Z' | 'a'..'z'; 

// R0001 Digit -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
DIGIT: '0'..'9';

//PARSER RULES

programName: NAME;

// R1402 program-stmt -> PROGRAM program-name
programStmt: PROGRAM programName;

typeName: NAME;

// R516 keyword -> name
keyword: NAME;

// R863 implicit-stmt -> IMPLICIT implicit-spec-list | IMPLICIT NONE [( [implicit-name-spec-list] )]
implicitStmt:
		IMPLICIT NONE;

// R709 kind-param -> digit-string | scalar-int-constant-name
kindParam: DIGITSTRING;

// R708 int-literal-constant -> digit-string [_ kind-param]
intLiteralConstant: DIGITSTRING (UNDERSCORE kindParam)?;

// R712 sign -> + | -
sign: PLUS | MINUS;

// R707 signed-int-literal-constant -> [sign] int-literal-constant
signedIntLiteralConstant: sign? intLiteralConstant;

// R1306 r -> int-literal-constant
r: intLiteralConstant;

// R1308 w -> int-literal-constant
w: intLiteralConstant;

// R1309 m -> int-literal-constant
m: intLiteralConstant;

// R1310 d -> int-literal-constant
d: intLiteralConstant;

// R1311 e -> int-literal-constant
e: intLiteralConstant;

// R1312 v -> signed-int-literal-constant
v: signedIntLiteralConstant;

vList: v (COMMA v)*;

// R724 char-literal-constant -> [kind-param _] ' [rep-char]... ' | [kind-param _] " [rep-char]... "
charLiteralConstant: 
		(kindParam UNDERSCORE)? APOSTROPHEREPCHAR
	| (kindParam UNDERSCORE)? QUOTEREPCHAR;

// R1307 data-edit-desc ->
//         I w [. m] | B w [. m] | O w [. m] | Z w [. m] | F w . d |
//         E w . d [E e] | EN w . d [E e] | ES w . d [E e] | EX w . d [E e] |
//         G w [. d [E e]] | L w | A [w] | D w . d |
//         DT [char-literal-constant] [( v-list )]
dataEditDesc:
    I w (DOT m)? |
    B w (DOT m)? |
    O w (DOT m)? |
    Z w (DOT m)? |
    F w DOT d |
    E w DOT d ( E e )? |
    EN w DOT d ( E e )? |
    ES w DOT d ( E e )? |
    EX w DOT d ( E e )? |
    G w (DOT d ( E e )?)? |
    L w |
    A w? |
    D w DOT d |
    DT charLiteralConstant? ( LPAREN vList RPAREN )?;

// R1304 format-item ->
//         [r] data-edit-desc | control-edit-desc | char-string-edit-desc | [r] ( format-items )
formatItem: r? dataEditDesc;

// R1303 format-items -> format-item [[,] format-item]...
formatItems: formatItem (COMMA? formatItem)*;

// R1305 unlimited-format-item -> * ( format-items )
unlimitedFormatItem: ASTERIK LPAREN formatItems RPAREN;

// R1302 format-specification ->
//         ( [format-items] ) | ( [format-items ,] unlimited-format-item )
formatSpecification:
    LPAREN formatItems? RPAREN |  LPAREN (formatItems COMMA)? unlimitedFormatItem  RPAREN;

// R1301 format-stmt -> FORMAT format-specification
formatStmt: FORMAT formatSpecification;

//R506 implicit-part-stmt -> implicit-stmt | parameter-stmt | format-stmt | entry-stmt
implicitPartStmt:
	  implicitStmt
	| formatStmt;

//R505 implicit-part -> [implicit-part-stmt]... implicit-stmt
implicitPart: (implicitPartStmt)* implicitStmt;

//R504 specification-part -> [use-stmt]... [import-stmt]... [implicit-part]
// [declaration-construct]...
  specificationPart:
    (implicitPart)?;

// R1403 end-program-stmt -> END [PROGRAM [program-name]]
endProgramStmt: END (PROGRAM programName?)?;

// R1401 main-program ->
//         [program-stmt] [specification-part] [execution-part]
//         [internal-subprogram-part] end-program-stmt
///COMMENT: WHY ? after programStmt
  mainProgram:
      programStmt? specificationPart? endProgramStmt;

//R502 program-unit -> main-program | external-subprogram | module | submodule | block-data
programUnit:
    mainProgram;

//R501 program -> program-unit [program-unit]...    
program: programUnit (programUnit)*;

Test File: FortranTest.f90

FORMAT(I 12)

Commands:

antlr4 FortranTestF18.g4 
javac *.java
grun FortranTestF18 formatStmt -tokens FortranTest.f90

Grun Output:

[@0,0:5='FORMAT',<FORMAT>,1:0]
[@1,6:6='(',<'('>,1:6]
[@2,7:7='I',<NAME>,1:7]
[@3,9:10='12',<DIGITSTRING>,1:9]
[@4,11:11=')',<')'>,1:11]
[@5,12:11='<EOF>',<EOF>,1:12]
line 1:7 no viable alternative at input '(I'

Here, token I is recognized as NAME but I want it to be recognized as token I: 'I';. But if I move the lexer rule I to top of NAME then the identifiers cannot be named as 'I'. How do I solve this problem?

The text was updated successfully, but these errors were encountered:

jimidle · 2024-06-08T16:59:59Z

You need to take a step back before attempting this. You are trying to construct a grammar from a normative spec instead of thinking about the how grammar should be compatible with the normative spec. Start with this:

The lexer has no idea about context. It will always return the longest or first match and is entirely independent of the parser.
So make your lexer rules match exactly one type of token. If "I" means something at one place in the parse and something else at another place then chose a generic name.
Do not try to enforce semantic rules in the parser. Make it sloppy and as small as possible and do verification after parsing, in a parser tree walker or listener
If an input sequence is valid in more than one context, do not duplicate that sequence - try to give the ANLTR parser just one path through the possible token sequence. Ignore the spec, which is written to explain the language, not how to parser it.
Remember that the parser does not influence the lexer - tokens are created before the parser even starts.

You will get nowhere with the lexer and parser you have right now and it will frustrate you. Take a look at some existing grammars to get a feel for it, and write yourself some small parsers like a calculator or equation parser or something else simple. The mistakes you make on the small tasks wil lguide you in creation of a larger system such as a Fortran parser.

Beware of X Y questions, which is what you have here.

kaby76 mentioned this issue Jun 11, 2024

[fortran] Add latest Fortran grammar, addressing issues in older versions antlr/grammars-v4#4096

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lexing Issue in ANTLR4 Grammar for Fortran 2018: Token Misclassification #4640

Lexing Issue in ANTLR4 Grammar for Fortran 2018: Token Misclassification #4640

AkhilAkkapelli commented Jun 8, 2024 •

edited

Loading

jimidle commented Jun 8, 2024 •

edited

Loading

Lexing Issue in ANTLR4 Grammar for Fortran 2018: Token Misclassification #4640

Lexing Issue in ANTLR4 Grammar for Fortran 2018: Token Misclassification #4640

Comments

AkhilAkkapelli commented Jun 8, 2024 • edited Loading

jimidle commented Jun 8, 2024 • edited Loading

AkhilAkkapelli commented Jun 8, 2024 •

edited

Loading

jimidle commented Jun 8, 2024 •

edited

Loading