You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am developing a Fortran 2018 grammar in ANTLR4 using the ISO standard. I am encountering an issue during the lexing phase with some of the lexer rules. Specifically, certain keywords are being misclassified. Below is the minimal grammar demonstrating the problem:
Grammar:FortranTestF18.g4
grammar FortranTestF18;
//LEXER RULESLINE_COMMENT : '!' .*? '\r'? '\n' -> skip ;
BLOCK_COMMENT: '/*' .*? '*/' -> skip;
WS: [ \t\r\n]+ -> skip;
PROGRAM: 'PROGRAM' | 'Program' | 'program';
END: 'END' | 'End' | 'end';
COMMA: ',';
LPAREN: '(';
RPAREN: ')';
ASTERIK: '*';
NONE: 'NONE' | 'None' | 'none';
IMPLICIT: 'IMPLICIT' | 'Implicit' | 'implicit';
FORMAT: 'FORMAT' | 'Format' | 'format';
PLUS: '+';
// R765 binary-constant -> B ' digit [digit]... ' | B " digit [digit]... "BINARYCONSTANT: BAPOSTROPHEDIGIT+ APOSTROPHE | BQUOTEDIGIT+ QUOTE;
// R766 octal-constant -> O ' digit [digit]... ' | O " digit [digit]... "OCTALCONSTANT: OAPOSTROPHEDIGIT+ APOSTROPHE | OQUOTEDIGIT+ QUOTE;
//R0003 RepCharAPOSTROPHEREPCHAR: APOSTROPHE (~[\u0000-\u001F\u0027])* APOSTROPHE;
QUOTEREPCHAR: QUOTE (~[\u0000-\u001F\u0022])* QUOTE;
APOSTROPHE: '\'';
QUOTE: '"';
DOT: '.';
C: 'C';
// R603 name -> letter [alphanumeric-character]...NAME: LETTER (ALPHANUMERICCHARACTER)*;
// R711 digit-string -> digit [digit]...DIGITSTRING: DIGIT+;
MINUS: '-';
B: 'B';
O: 'O';
Z: 'Z';
A: 'A';
F: 'F';
D: 'D';
E: 'E';
I: 'I';
G: 'G';
L: 'L';
DT: 'DT';
EN: 'EN';
ES: 'ES';
EX: 'EX';
T: 'T';
TL: 'TL';
TR: 'TR';
X: 'X';
SS: 'SS';
SP: 'SP';
S: 'S';
BN: 'BN';
BZ: 'BZ';
RU: 'RU';
RD: 'RD';
RZ: 'RZ';
RN: 'RN';
RC: 'RC';
RP: 'RP';
DC: 'DC';
DP: 'DP';
P: 'P';
// R602 UNDERSCORE -> _UNDERSCORE: '_';
// R601 alphanumeric-character -> letter | digit | underscoreALPHANUMERICCHARACTER: LETTER | DIGIT | UNDERSCORE;
// R0002 Letter ->// A | B | C | D | E | F | G | H | I | J | K | L | M |// N | O | P | Q | R | S | T | U | V | W | X | Y | ZLETTER: 'A'..'Z' | 'a'..'z';
// R0001 Digit -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9DIGIT: '0'..'9';
//PARSER RULESprogramName: NAME;
// R1402 program-stmt -> PROGRAM program-nameprogramStmt: PROGRAM programName;
typeName: NAME;
// R516 keyword -> namekeyword: NAME;
// R863 implicit-stmt -> IMPLICIT implicit-spec-list | IMPLICIT NONE [( [implicit-name-spec-list] )]implicitStmt:
IMPLICITNONE;
// R709 kind-param -> digit-string | scalar-int-constant-namekindParam: DIGITSTRING;
// R708 int-literal-constant -> digit-string [_ kind-param]intLiteralConstant: DIGITSTRING (UNDERSCORE kindParam)?;
// R712 sign -> + | -sign: PLUS | MINUS;
// R707 signed-int-literal-constant -> [sign] int-literal-constantsignedIntLiteralConstant: sign? intLiteralConstant;
// R1306 r -> int-literal-constantr: intLiteralConstant;
// R1308 w -> int-literal-constantw: intLiteralConstant;
// R1309 m -> int-literal-constantm: intLiteralConstant;
// R1310 d -> int-literal-constantd: intLiteralConstant;
// R1311 e -> int-literal-constante: intLiteralConstant;
// R1312 v -> signed-int-literal-constantv: signedIntLiteralConstant;
vList: v (COMMA v)*;
// R724 char-literal-constant -> [kind-param _] ' [rep-char]... ' | [kind-param _] " [rep-char]... "charLiteralConstant:
(kindParam UNDERSCORE)? APOSTROPHEREPCHAR
| (kindParam UNDERSCORE)? QUOTEREPCHAR;
// R1307 data-edit-desc ->// I w [. m] | B w [. m] | O w [. m] | Z w [. m] | F w . d |// E w . d [E e] | EN w . d [E e] | ES w . d [E e] | EX w . d [E e] |// G w [. d [E e]] | L w | A [w] | D w . d |// DT [char-literal-constant] [( v-list )]dataEditDesc:
I w (DOT m)? |
B w (DOT m)? |
O w (DOT m)? |
Z w (DOT m)? |
F w DOT d |
E w DOT d ( E e )? |
EN w DOT d ( E e )? |
ES w DOT d ( E e )? |
EX w DOT d ( E e )? |
G w (DOT d ( E e )?)? |
L w |
A w? |
D w DOT d |
DT charLiteralConstant? ( LPAREN vList RPAREN )?;
// R1304 format-item ->// [r] data-edit-desc | control-edit-desc | char-string-edit-desc | [r] ( format-items )formatItem: r? dataEditDesc;
// R1303 format-items -> format-item [[,] format-item]...formatItems: formatItem (COMMA? formatItem)*;
// R1305 unlimited-format-item -> * ( format-items )unlimitedFormatItem: ASTERIKLPAREN formatItems RPAREN;
// R1302 format-specification ->// ( [format-items] ) | ( [format-items ,] unlimited-format-item )formatSpecification:
LPAREN formatItems? RPAREN | LPAREN (formatItems COMMA)? unlimitedFormatItem RPAREN;
// R1301 format-stmt -> FORMAT format-specificationformatStmt: FORMAT formatSpecification;
//R506 implicit-part-stmt -> implicit-stmt | parameter-stmt | format-stmt | entry-stmtimplicitPartStmt:
implicitStmt
| formatStmt;
//R505 implicit-part -> [implicit-part-stmt]... implicit-stmtimplicitPart: (implicitPartStmt)* implicitStmt;
//R504 specification-part -> [use-stmt]... [import-stmt]... [implicit-part]// [declaration-construct]...
specificationPart:
(implicitPart)?;
// R1403 end-program-stmt -> END [PROGRAM [program-name]]endProgramStmt: END (PROGRAM programName?)?;
// R1401 main-program ->// [program-stmt] [specification-part] [execution-part]// [internal-subprogram-part] end-program-stmt///COMMENT: WHY ? after programStmt
mainProgram:
programStmt? specificationPart? endProgramStmt;
//R502 program-unit -> main-program | external-subprogram | module | submodule | block-dataprogramUnit:
mainProgram;
//R501 program -> program-unit [program-unit]... program: programUnit (programUnit)*;
[@0,0:5='FORMAT',<FORMAT>,1:0]
[@1,6:6='(',<'('>,1:6]
[@2,7:7='I',<NAME>,1:7]
[@3,9:10='12',<DIGITSTRING>,1:9]
[@4,11:11=')',<')'>,1:11]
[@5,12:11='<EOF>',<EOF>,1:12]
line 1:7 no viable alternative at input '(I'
Here, token I is recognized as NAME but I want it to be recognized as token I: 'I';. But if I move the lexer rule I to top of NAME then the identifiers cannot be named as 'I'. How do I solve this problem?
The text was updated successfully, but these errors were encountered:
You need to take a step back before attempting this. You are trying to construct a grammar from a normative spec instead of thinking about the how grammar should be compatible with the normative spec. Start with this:
The lexer has no idea about context. It will always return the longest or first match and is entirely independent of the parser.
So make your lexer rules match exactly one type of token. If "I" means something at one place in the parse and something else at another place then chose a generic name.
Do not try to enforce semantic rules in the parser. Make it sloppy and as small as possible and do verification after parsing, in a parser tree walker or listener
If an input sequence is valid in more than one context, do not duplicate that sequence - try to give the ANLTR parser just one path through the possible token sequence. Ignore the spec, which is written to explain the language, not how to parser it.
Remember that the parser does not influence the lexer - tokens are created before the parser even starts.
You will get nowhere with the lexer and parser you have right now and it will frustrate you. Take a look at some existing grammars to get a feel for it, and write yourself some small parsers like a calculator or equation parser or something else simple. The mistakes you make on the small tasks wil lguide you in creation of a larger system such as a Fortran parser.
Beware of X Y questions, which is what you have here.
I am developing a Fortran 2018 grammar in ANTLR4 using the ISO standard. I am encountering an issue during the lexing phase with some of the lexer rules. Specifically, certain keywords are being misclassified. Below is the minimal grammar demonstrating the problem:
Grammar:
FortranTestF18.g4
Test File:
FortranTest.f90
Commands:
antlr4 FortranTestF18.g4 javac *.java grun FortranTestF18 formatStmt -tokens FortranTest.f90
Grun Output:
Here, token
I
is recognized asNAME
but I want it to be recognized as tokenI: 'I';
. But if I move the lexer ruleI
to top ofNAME
then the identifiers cannot be named as 'I'. How do I solve this problem?The text was updated successfully, but these errors were encountered: