Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiler should error when encountering invalid regular expressions #3432

Closed
DanielRosenwasser opened this issue Jun 9, 2015 · 15 comments · Fixed by #55600
Closed

Compiler should error when encountering invalid regular expressions #3432

DanielRosenwasser opened this issue Jun 9, 2015 · 15 comments · Fixed by #55600
Labels
Bug A bug in TypeScript Help Wanted You can do this

Comments

@DanielRosenwasser
Copy link
Member

In tests/cases/conformance/parser/ecmascript5/RegressionTests/parser579071.ts we have an invalid regex:

var x = /fo(o/;

This is not valid JavaScript, but we don't give any errors. It would be helpful to let users know when their regular expressions are invalid.

@DanielRosenwasser DanielRosenwasser added Suggestion An idea for TypeScript Help Wanted You can do this labels Jun 9, 2015
@DanielRosenwasser DanielRosenwasser added this to the Community milestone Jun 9, 2015
@mhegazy mhegazy added Bug A bug in TypeScript and removed Suggestion An idea for TypeScript labels Jun 9, 2015
@RyanCavanaugh RyanCavanaugh modified the milestones: Community, Backlog Mar 7, 2019
@pygy
Copy link

pygy commented May 4, 2022

I had missed this by only searching for RegExp...

I'm working on a Spec-based RegExp tokenizer written in JS, hopefully you'll be able to make use of it.

@graphemecluster
Copy link
Contributor

graphemecluster commented Jun 23, 2023

Moved from #54744:

Previously worked on #51837, I found that TypeScript gives almost no syntax errors for regular expressions. I would like to file a PR about it.

Something I would like to do are:

  • Check for duplicated or unknown flags
  • Check for unbalanced parentheses, which is the most common mistake people make
  • Check for invalid escapes
    But this should be done only for RegExps with u or v flag, i.e. in UnicodeMode, that means if we encounter a u or v flag we will need to rescan the whole RegExp again (!!) (i.e. redoing what is done in the current reScanSlashToken method)
    And to check for invalid DecimalEscapes and k<GroupName>s we will also need to count the number of capture groups and record the names of all named capture groups along the way.

Am I doing too much or too less? I know doing too much may cause serious performace regressions (well, luckily regular expression literals are not that common compared with string literals). It should be better than doing nothing after all though.

@RyanCavanaugh
Copy link
Member

RyanCavanaugh commented Jun 23, 2023

Imbalanced parens / bracket seems like where 99% of the value is.

Duplicate flags seems like an error no one's ever made before; usually its /g or /m, /gmi, etc, I can't imagine writing /mgm unless a cat walked on my keyboard.

Erroring on escapes that are invalid regardless of flags seems like a fine compromise.

IMO it's really actually fine if once a year your program unconditionally crashes on startup in the cases where you made an extremely rare mistake. The value is in flagging errors that are made every day.

@graphemecluster
Copy link
Contributor

Yup, duplicate flags checking is valueless 😅, but unknown seems worth a bit (like if someone accidentally typed the Cyrillic у instead of the Latin y for some reason with the "Editor > Unicode Highlight: Ambiguous Characters" config turned off or any of the "Editor > Unicode Highlight: Allowed Locales" permit the character). (Plus, there are no reasons not to show errors if the flag part isn’t really flags (like foo_bar).) So far checking for flag availability according to target language version seems to be the most worthwhile part for flag check.

I am still wondering if a full parse should be done. Doing that does affect performance, but it would be helpful to further TypeScript extensions like #41160. Of course, this requires sub-nodes to be added under RegularExpressionLiteral. The actual implementation is not that tough, as we can take engine262 as a reference. (If we go with this, where should the class be put? scanner.ts? parser.ts? Or a new file?)

@graphemecluster
Copy link
Contributor

Pinging @RyanCavanaugh and @DanielRosenwasser for opinions.

@RyanCavanaugh
Copy link
Member

I feel like we should be able to do a parse-only pass (i.e. just scan and descend in order to validate) of the regex without creating nodes as you go, since there's no consumer of that output, just the production of errors as a side effect

@graphemecluster
Copy link
Contributor

The output are useful to TypeScript API users for creating RegExp-related type utilities. We could make use of the parsed result to make methods like String.prototype.replace safer too.
After all, I think we should at least store the number of capture groups and the names of the named captured groups.

@RyanCavanaugh
Copy link
Member

The downstream tools can re-parse if they really need that data.

We're very sensitive to perf papercuts and not likely to accept the feature if the perf cost is nontrivial, and allocating more objects is something that is likely to incur broad perf hits due to slowing down GC, etc..

@zm-cttae
Copy link

zm-cttae commented Jul 4, 2023

Surely this is a case of attempting to reserialise the raw regexp to the string representation then into RegExp class? If it blows up, the string is invalid! That should be fast too.

@zm-cttae
Copy link

zm-cttae commented Jul 5, 2023

I see Ryan has suggested the same thing but in CS speak 😛

@RyanCavanaugh
Copy link
Member

I actually hadn't thought of it in that much simpler way!

@graphemecluster
Copy link
Contributor

@zm-cttae @RyanCavanaugh There are already attempts like #4387 and #35957 using this approach but was closed. And this is definitely not a good solution, because:

  1. If there are any errors, the whole RegExp expression is underlined, it’s not useful especially when the expression is long. Plus, the platform-specified error messages are not always clear, and we can’t translate them into other locales.
  2. There might be multiple errors in the RegExp, but only the first error is revealed.
  3. TypeScript is not limited to being executed with Node.js. It may also be run, for example, in a web browser via monaco-editor. That means by using the built-in RegExp constructor the behavior is not guaranteed and may differ due to features implemented in the JS engine. Actually, I plan to include some features that are currently Stage 3 proposals into my implementation in advance.

Luckily a simple parser without node generation has little effect on the performance, and I plan to make my PR available at the end of this month.

Besides I have a follow-up proposal to further enforce type safety of RegExp-related methods and enhance UX (providing auto-completion) that make use of the implementation, but that would be another separate issue and PR afterwards.

@zm-cttae
Copy link

zm-cttae commented Jul 7, 2023

To be fair there are only two types of error that can be assessed independently of each other - bad escapes, and invalid grouping

@zm-cttae
Copy link

zm-cttae commented Jul 7, 2023

I agree however that the error isn't too useful for assessing what broke the regex in the first place.
Combine it with the platform-specificity problem and yes we will be needing a parser

@graphemecluster
Copy link
Contributor

I didn’t manage to finish the implementation within this month due to my work but it’s coming along nicely. I’ll be back on it shortly and make the PR available ASAP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug A bug in TypeScript Help Wanted You can do this
Projects
None yet
6 participants