Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lexing Issue in ANTLR4 Grammar for Fortran 2018: Token Misclassification #4640

Open
AkhilAkkapelli opened this issue Jun 8, 2024 · 1 comment

Comments

@AkhilAkkapelli
Copy link

AkhilAkkapelli commented Jun 8, 2024

I am developing a Fortran 2018 grammar in ANTLR4 using the ISO standard. I am encountering an issue during the lexing phase with some of the lexer rules. Specifically, certain keywords are being misclassified. Below is the minimal grammar demonstrating the problem:

Grammar: FortranTestF18.g4

grammar FortranTestF18;

//LEXER RULES

LINE_COMMENT : '!' .*? '\r'? '\n' -> skip ;

BLOCK_COMMENT: '/*' .*? '*/' -> skip;

WS: [ \t\r\n]+ -> skip;

PROGRAM: 'PROGRAM' | 'Program' | 'program';

END: 'END' | 'End' | 'end';

COMMA: ',';

LPAREN: '(';

RPAREN: ')';

ASTERIK: '*';

NONE: 'NONE' | 'None' | 'none';

IMPLICIT: 'IMPLICIT' | 'Implicit' | 'implicit';

FORMAT: 'FORMAT' | 'Format' | 'format';

PLUS: '+';

// R765 binary-constant -> B ' digit [digit]... ' | B " digit [digit]... "
BINARYCONSTANT: B APOSTROPHE DIGIT+ APOSTROPHE | B QUOTE DIGIT+ QUOTE;

// R766 octal-constant -> O ' digit [digit]... ' | O " digit [digit]... "
OCTALCONSTANT: O APOSTROPHE DIGIT+ APOSTROPHE | O QUOTE DIGIT+ QUOTE;

//R0003 RepChar
APOSTROPHEREPCHAR: APOSTROPHE (~[\u0000-\u001F\u0027])*  APOSTROPHE;

QUOTEREPCHAR: QUOTE (~[\u0000-\u001F\u0022])*  QUOTE;

APOSTROPHE: '\'';

QUOTE: '"';

DOT: '.';

C: 'C';

// R603 name -> letter [alphanumeric-character]...
NAME: LETTER (ALPHANUMERICCHARACTER)*;

// R711 digit-string -> digit [digit]...
DIGITSTRING: DIGIT+; 

MINUS: '-';

B: 'B';

O: 'O';

Z: 'Z';

A: 'A';

F: 'F';

D: 'D';

E: 'E';

I: 'I';

G: 'G';

L: 'L';

DT: 'DT';

EN: 'EN';

ES: 'ES';

EX: 'EX';

T: 'T';

TL: 'TL';

TR: 'TR';

X: 'X';

SS: 'SS';

SP: 'SP';

S: 'S';

BN: 'BN';

BZ: 'BZ';

RU: 'RU';

RD: 'RD';

RZ: 'RZ';

RN: 'RN';

RC: 'RC';

RP: 'RP';

DC: 'DC';

DP: 'DP';

P: 'P';

// R602 UNDERSCORE -> _
UNDERSCORE: '_';

// R601 alphanumeric-character -> letter | digit | underscore
ALPHANUMERICCHARACTER: LETTER | DIGIT | UNDERSCORE;

// R0002 Letter ->
//         A | B | C | D | E | F | G | H | I | J | K | L | M |
//         N | O | P | Q | R | S | T | U | V | W | X | Y | Z
LETTER: 'A'..'Z' | 'a'..'z'; 

// R0001 Digit -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
DIGIT: '0'..'9';

//PARSER RULES

programName: NAME;

// R1402 program-stmt -> PROGRAM program-name
programStmt: PROGRAM programName;

typeName: NAME;

// R516 keyword -> name
keyword: NAME;

// R863 implicit-stmt -> IMPLICIT implicit-spec-list | IMPLICIT NONE [( [implicit-name-spec-list] )]
implicitStmt:
		IMPLICIT NONE;

// R709 kind-param -> digit-string | scalar-int-constant-name
kindParam: DIGITSTRING;

// R708 int-literal-constant -> digit-string [_ kind-param]
intLiteralConstant: DIGITSTRING (UNDERSCORE kindParam)?;

// R712 sign -> + | -
sign: PLUS | MINUS;

// R707 signed-int-literal-constant -> [sign] int-literal-constant
signedIntLiteralConstant: sign? intLiteralConstant;

// R1306 r -> int-literal-constant
r: intLiteralConstant;

// R1308 w -> int-literal-constant
w: intLiteralConstant;

// R1309 m -> int-literal-constant
m: intLiteralConstant;

// R1310 d -> int-literal-constant
d: intLiteralConstant;

// R1311 e -> int-literal-constant
e: intLiteralConstant;

// R1312 v -> signed-int-literal-constant
v: signedIntLiteralConstant;

vList: v (COMMA v)*;

// R724 char-literal-constant -> [kind-param _] ' [rep-char]... ' | [kind-param _] " [rep-char]... "
charLiteralConstant: 
		(kindParam UNDERSCORE)? APOSTROPHEREPCHAR
	| (kindParam UNDERSCORE)? QUOTEREPCHAR;

// R1307 data-edit-desc ->
//         I w [. m] | B w [. m] | O w [. m] | Z w [. m] | F w . d |
//         E w . d [E e] | EN w . d [E e] | ES w . d [E e] | EX w . d [E e] |
//         G w [. d [E e]] | L w | A [w] | D w . d |
//         DT [char-literal-constant] [( v-list )]
dataEditDesc:
    I w (DOT m)? |
    B w (DOT m)? |
    O w (DOT m)? |
    Z w (DOT m)? |
    F w DOT d |
    E w DOT d ( E e )? |
    EN w DOT d ( E e )? |
    ES w DOT d ( E e )? |
    EX w DOT d ( E e )? |
    G w (DOT d ( E e )?)? |
    L w |
    A w? |
    D w DOT d |
    DT charLiteralConstant? ( LPAREN vList RPAREN )?;

// R1304 format-item ->
//         [r] data-edit-desc | control-edit-desc | char-string-edit-desc | [r] ( format-items )
formatItem: r? dataEditDesc;

// R1303 format-items -> format-item [[,] format-item]...
formatItems: formatItem (COMMA? formatItem)*;

// R1305 unlimited-format-item -> * ( format-items )
unlimitedFormatItem: ASTERIK LPAREN formatItems RPAREN;

// R1302 format-specification ->
//         ( [format-items] ) | ( [format-items ,] unlimited-format-item )
formatSpecification:
    LPAREN formatItems? RPAREN |  LPAREN (formatItems COMMA)? unlimitedFormatItem  RPAREN;

// R1301 format-stmt -> FORMAT format-specification
formatStmt: FORMAT formatSpecification;

//R506 implicit-part-stmt -> implicit-stmt | parameter-stmt | format-stmt | entry-stmt
implicitPartStmt:
	  implicitStmt
	| formatStmt;

//R505 implicit-part -> [implicit-part-stmt]... implicit-stmt
implicitPart: (implicitPartStmt)* implicitStmt;

//R504 specification-part -> [use-stmt]... [import-stmt]... [implicit-part]
// [declaration-construct]...
  specificationPart:
    (implicitPart)?;

// R1403 end-program-stmt -> END [PROGRAM [program-name]]
endProgramStmt: END (PROGRAM programName?)?;

// R1401 main-program ->
//         [program-stmt] [specification-part] [execution-part]
//         [internal-subprogram-part] end-program-stmt
///COMMENT: WHY ? after programStmt
  mainProgram:
      programStmt? specificationPart? endProgramStmt;

//R502 program-unit -> main-program | external-subprogram | module | submodule | block-data
programUnit:
    mainProgram;

//R501 program -> program-unit [program-unit]...    
program: programUnit (programUnit)*;      

Test File: FortranTest.f90

FORMAT(I 12)

Commands:

antlr4 FortranTestF18.g4 
javac *.java
grun FortranTestF18 formatStmt -tokens FortranTest.f90 

Grun Output:

[@0,0:5='FORMAT',<FORMAT>,1:0]
[@1,6:6='(',<'('>,1:6]
[@2,7:7='I',<NAME>,1:7]
[@3,9:10='12',<DIGITSTRING>,1:9]
[@4,11:11=')',<')'>,1:11]
[@5,12:11='<EOF>',<EOF>,1:12]
line 1:7 no viable alternative at input '(I'

Here, token I is recognized as NAME but I want it to be recognized as token I: 'I';. But if I move the lexer rule I to top of NAME then the identifiers cannot be named as 'I'. How do I solve this problem?

@jimidle
Copy link
Collaborator

jimidle commented Jun 8, 2024

You need to take a step back before attempting this. You are trying to construct a grammar from a normative spec instead of thinking about the how grammar should be compatible with the normative spec. Start with this:

  • The lexer has no idea about context. It will always return the longest or first match and is entirely independent of the parser.
  • So make your lexer rules match exactly one type of token. If "I" means something at one place in the parse and something else at another place then chose a generic name.
  • Do not try to enforce semantic rules in the parser. Make it sloppy and as small as possible and do verification after parsing, in a parser tree walker or listener
  • If an input sequence is valid in more than one context, do not duplicate that sequence - try to give the ANLTR parser just one path through the possible token sequence. Ignore the spec, which is written to explain the language, not how to parser it.
  • Remember that the parser does not influence the lexer - tokens are created before the parser even starts.

You will get nowhere with the lexer and parser you have right now and it will frustrate you. Take a look at some existing grammars to get a feel for it, and write yourself some small parsers like a calculator or equation parser or something else simple. The mistakes you make on the small tasks wil lguide you in creation of a larger system such as a Fortran parser.

Beware of X Y questions, which is what you have here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants