Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Java] Fix complaints from ErrorProne static analysis #3380

Merged
merged 1 commit into from
Dec 1, 2021

Conversation

jcking
Copy link
Collaborator

@jcking jcking commented Dec 1, 2021

This pull request addresses issues identified by https://github.com/google/error-prone.

@jcking
Copy link
Collaborator Author

jcking commented Dec 1, 2021

@parrt PTAL

@@ -61,7 +61,7 @@ public ST getMessageTemplate(ANTLRMessage msg) {
}
if (msg.fileName != null) {
String displayFileName = msg.fileName;
if (format.equals("antlr")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow. weird that we never ran into this. seems like this would never be true.

@parrt
Copy link
Member

parrt commented Dec 1, 2021

thanks guys...

@parrt parrt merged commit a5dfd61 into antlr:master Dec 1, 2021
parrt added a commit to parrt/antlr4 that referenced this pull request Dec 1, 2021
@parrt
Copy link
Member

parrt commented Dec 1, 2021

I just push to commit to mimic the 2^31 fix for JavaScript; added a comment that java class file format might not require the same limit for other targets..

@ericvergnaud
Copy link
Contributor

Hey, there is no such limit in JavaScript, it's plain text! I put 2 ^31 as max positive int = 2 gb

@KvanTTT
Copy link
Member

KvanTTT commented Dec 3, 2021

But ^ operation is xor in JavaScript, it's not exponentiation. I think you meant 1 << 31 = 2147483648 but not 29.

@ericvergnaud
Copy link
Contributor

This code is in the tool, in Java

@ericvergnaud
Copy link
Contributor

The big lexer test shows that this value is not causing any problem.

@KvanTTT
Copy link
Member

KvanTTT commented Dec 3, 2021

This code is in the tool, in Java

Yes, my mistake. But in Java it's also xor but not exponentiation.

The big lexer test shows that this value is not causing any problem.

It does not mean the value is correct and shouldn't be fixed.

@ericvergnaud
Copy link
Contributor

I don't quite remember how I ended up writing this misleading code, but what's certain is that the result is desired i.e. an array of readable strings, as opposed to a veeeeeryyyyy long string that no IDE is able to display nicely.
I kind of recall an horrible moment where splitting the veeeeeryyyyy long string into a + b + c ... small strings was producing a stack overflow.
So suggest to change the code to return 32

@KvanTTT
Copy link
Member

KvanTTT commented Dec 3, 2021

I don't quite remember how I ended up writing this misleading code

It was not you: https://github.com/antlr/antlr4/blame/2d445f5466e096c39af315697fd4d5f531020281/tool/src/org/antlr/v4/codegen/target/JavaScriptTarget.java#L91

I kind of recall an horrible moment where splitting the veeeeeryyyyy long string into a + b + c ... small strings was producing a stack overflow.

Yes, it's possible is any runtime. It also depends on the size of generated code.

So suggest to change the code to return 32

In this case, I don't understand such a big difference between JavaScript (32) and other runtimes (65535 / 3). It looks weird.

@parrt
Copy link
Member

parrt commented Dec 3, 2021

Should I just make it Math.pow(2, 31) for javaScript? I can make a comment to explain

@ericvergnaud
Copy link
Contributor

I would just return 32 or 64.
Deserializing the ATN only happens once, and optimizing the serialized data matters less than being able to load the generated file in dev tools.

@parrt
Copy link
Member

parrt commented Dec 4, 2021

Do you mean 2 to the power 32 or 64? It's the len of the string not num bits right?

@ericvergnaud
Copy link
Contributor

yes: return 32; not return Math.pow(2, 32)

@parrt
Copy link
Member

parrt commented Dec 4, 2021

sorry i am not following. why would we limit strings to 32 char?

@ericvergnaud
Copy link
Contributor

Because it's much more convenient when opening the generated file.

Most words require 6 chars.
So you get (actual example):
const serializedATN = ["\u0003\u608b\ua72a\u8133\ub9ed\u417c\u3be7\u7786",
"\u5964\u0002\u00bc\u06fc\b\u0001\u0004\u0002\t\u0002\u0004\u0003\t\u0003",
"\u0004\u0004\t\u0004\u0004\u0005\t\u0005\u0004\u0006\t\u0006\u0004\u0007",
... 2000 lines here
"\t\u0007\u0004\b\t\b\u0004\t\t\t\u0004\n\t\n\u0004\u000b\t\u000b\u0004",
"\u069b\u069e\u06a3\u06a7\u06ac\u06b2\u06b8\u06be\u06c4\u06ca\u06d0\u06d7",
"\u06dc\u06f5\u06fa\u0003\u0002\u0003\u0002"].join("");

The above is much easier to load by editors than if we don't limit:
const serializedATN = "\u0003\u608b\ua72a\u8133\ub9ed\u417c\u3be7\u7786\u5964\u0002\u00bc\u06fc\b\u0001\u0004\u0002\t\u0002\u0004\u0003\t\u0003\u0004\u0004\t\u0004\u0004\u0005\t\u0005\u0004\u0006\t\u0006\u0004\u0007\t\u0007\u00... 100000 characters here .... 04\b\t\b\u0004\t\t\t\u0004\n\t\n\u0004\u000b\t\u000b\u0004\u069b\u069e\u06a3\u06a7\u06ac\u06b2\u06b8\u06be\u06c4\u06ca\u06d0\u06d7\u06dc\u06f5\u06fa\u0003\u0002\u0003\u0002";
and comes with no measurable penalty (join is ultrafast)

@KvanTTT
Copy link
Member

KvanTTT commented Dec 4, 2021

In this case, I suggest using the most widespread values of recommended line length in text editors: 80 or 120. Also, it's still unclear why the value for JavaScript differs from other runtimes, they also can be changed.

@parrt
Copy link
Member

parrt commented Dec 4, 2021

okay, I spent a few minutes and looked at the source code again. I had completely forgotten what this does. It is a segment not the entire maximum length so it's fine to do whatever the target developer wants for a particular target I guess.

@ericvergnaud I think that Java is a special case here because it has such fundamental limits on strings in the class file. Should we go for consistency across all targets other than Java and then make Java a special case? there's definitely something about the maximum length of a single string in a class file, which I guess is why we had to break them up... it also might slow things down if we had, say, 2000 strings in the class file versus one big one. The initialization of strings pulled from the class file is weird if I remember.

@KvanTTT
Copy link
Member

KvanTTT commented Dec 4, 2021

I think that Java is a special case here because it has such fundamental limits on strings in the class file

ANTLR generator returns concatenation for serialized ATN for Java. Java compiler folds constant string concatenation to one single string for efficiency. So, class file contains one big string instead of concatenation. Thus, it does not make sense to take care of string length in Java source files.

Also, string concatenation could be replaced with StringBuilder. But actually, I don't think it's a good idea. I don't think there is a limit in class file since we haven't encountered such a problem before. Maybe it's a limit on string in Java compiler itself, but not in class file.

@parrt
Copy link
Member

parrt commented Dec 4, 2021

I remember the class file format limiting the length of any single string and I doubt that has changed for backward compatibility reasons. Maybe take a look at this to see if I'm still correct?

https://stackoverflow.com/questions/8323082/size-of-initialisation-string-in-java

https://stackoverflow.com/questions/45275732/increase-string-literal-length-limit

@KvanTTT
Copy link
Member

KvanTTT commented Dec 4, 2021

The second link doesn't relate to Java because it's about the C language. The first link is also unlikely related to string size in class file, it's about bytecode:

It shows that the problem does not relate to string literals, but to array initializers.

Also, there are other quotes from the answer:

So, a pure ASCII string literal can have up to 65535 characters, while a string consisting of characters in the range U+0800 ...U+FFFF have only one third of these. And the ones encoded as surrogate pairs in UTF-8 (i.e. U+10000 to U+10FFFF) take up 6 bytes each.
The Java Language Specification does not mention any limit for string literals:

So, the current value of 65535 / 3 looks correct for Java (ideally it should be checked). But if it's a limit in class file I'm not sure this limit affects result generation because of compiler string folding.

I've found another answer: https://stackoverflow.com/a/18361275/1046374:

Note that a String object can have up to 2^31 - 1 characters. The 2^16 -1 limit is for String literals; e.g. String constants that are embedded in the source code of a Java program.

Maybe the size of generated code was not so big all the time that's why we haven't encountered such a problem.

@ericvergnaud
Copy link
Contributor

ericvergnaud commented Dec 5, 2021 via email

@ericvergnaud
Copy link
Contributor

ericvergnaud commented Dec 5, 2021 via email

@parrt
Copy link
Member

parrt commented Dec 27, 2021

Heh, I think we left this discussion incomplete. we need to update the JavaScript per @ericvergnaud right? just trying to get my to do list together.

@ericvergnaud
Copy link
Contributor

I think a small size of strings is highly desirable because the ability to read code in an idea is worth the microseconds spent joining those strings once at startup.
So I'd return 64 or 80 or whatever small number that fits on screen.

@parrt
Copy link
Member

parrt commented Dec 27, 2021

For python3 it says:

// set to something stupid to avoid segmentation
return Integer.MAX_VALUE;

Should all but java be about 80 char?

@parrt
Copy link
Member

parrt commented Dec 27, 2021

Please check out #3438

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants