Any plan on adding some support for regular expressions to the language? #241

fenollp · 2019-09-09T20:32:07Z

This has many parts:

a builtin type
the usual methods
special syntax (sigils, ...)
what set of regexps to support
what amount of backtracking should be allowed if at all

One can definitely implement the first 2 with minimum friction, but as I find Starlark's strength to be in its Turing-incompleteness guarantees I am wondering if this was / is being discussed as an addition to the language.

alandonovan · 2019-09-09T20:56:34Z

After time and json,

time https://github.com/google/starlark-go/tree/wip-skylark-time
json https://github.com/google/starlark-go/tree/json

regexp is the most requested package for the Starlark "standard library". No-one has sketched a prototype yet, but it shouldn't be hard to come up with a sound Starlark API for regular expressions that can be backed by the Go, Java, and C++ implementations of RE2, which is linear time (non-backtracking; see github.com/google/re2), though we will need to disallow \C.

I imagine the basic API would look something like the Go regexp API:

   load("regexp")
   rx = regexp.compile("b(an)*a")
   rx.matches(str src) # returns bool
   rx.find(str src) # returns text of match
   rx.find_all(str src, int max) # returns list of matches
   rx.replace_all(str src , str or function) # returns replaced text
   rx.find_submatches(str) # returns list of complete match and each subpattern match

We might also want variants that return the indices (not text) of each match.

bobjackman · 2020-07-09T20:12:34Z

Any update on this, yet?

alandonovan · 2020-07-09T20:25:31Z

Sorry, no.

tv42 · 2020-07-23T19:59:27Z

For what it's worth, I stumbled on https://godoc.org/github.com/qri-io/starlib/re

essobedo · 2021-03-16T08:31:24Z

@b5 What about this one? Ready for it? :-)

b5 · 2021-04-08T00:16:37Z

I think this would be great to get to attack next given the roll we're on with the time and math packages, but the design & implementation of any regex package or language level support requires more of upfront design discussion:

do we write this a package? write a proposal for inclusion in the language? both?
If we're designing a package, does it take inspiration from go's regexp or python's re?
What's providing the underlying regex parser in go?

My answers to these questions:

It should be a package. We should avoid introducing overhead into the language, even if it adds parity to python. I'm thinking specifically of the cost of implementing a starlark interpreter in another language that doesn't have easy support for RE2-style regex syntax. Having it as a package makes it easier to opt out.
Our initial regex implementation follows the python re style, but I think sitting down & working through a minimal design draft is in order.
In my opinion using anything other than go's regexp would be a mistake. But with the choice to use regexp it's important to document any differences for users coming from python.

That said, I'd love to get input from others. Would a re package suit your needs? Any reason we should instead be dedicating time to writing a regex proposal for language inclusion instead?

@alandonovan, would you accept a package if we could get one to pass muster? What's your bandwidth like for either review and/or design input?

b5 · 2021-04-08T00:21:56Z

🤦 I literally hadn't read the comments above where @adonovan laying out answers to all of my questions. Ok @essobedo I think we can take a stab at implementing the prototype laid out above & get a PR going from there.

mahmoudimus · 2021-04-08T22:31:06Z

I actually implemented this for the Java version using re2. It works really well for our use case. You can see it here:

It exposed a bug in re2j implementation (PR is coming soon-ish).

adonovan · 2021-04-09T03:09:38Z

I actually implemented this for the Java version using re2. It works really well for our use case. You can see it here: ...

Interesting, thanks.

It's funny, I wrote RE2/J a decade back for use in the regexp package of a Google in-house embedded configuration language that had implementations in multiple languages and needed a consistent regexp interface. That description now applies equally to Starlark, and it's using RE2/J for the same reason.

BTW, although RE2/J is asymptotically faster than j.u.regex, in practice it's usually significantly slower. Not that that really matters here.

It appear that you are using Starlark as a "secure" embedded language within your VGS product. By design, Starlark is hermetic in many ways, but we've never claimed that it is secure for running untrusted code. Scripts can easily cause denial of service by exhausting all memory, or by hash flooding. Also, RE2/J can be induced to run out of stack when parsing a deep expression such as "(((...(((". (The Go version is somewhat less susceptible to such attacks because Go stacks can grow 100x larger than JVM threads.)

mahmoudimus · 2021-04-09T03:16:53Z

Scripts can easily cause denial of service by exhausting all memory, or by hash flooding

Yes, I'm aware of this (actually noticed this discussion re: hash flooding in the Fnv32 hash function discussions in the bytes PR :D). We've enlisted Trail of Bits for a code audit (and we'll be happy to contribute whatever findings upstream) and what you've mentioned so far was on our roadmap. This isn't in production yet, not until we've checked all the bells and whistles, but it's actually a pretty neat idea if it works for what we want to use it for.

Also, RE2/J can be induced to run out of stack when parsing a deep expression such as "(((...(((". (The Go version is somewhat less susceptible to such attacks because Go stacks can grow 100x larger than JVM threads.)

I was not aware of this problem, that's good to know, thank you!

essobedo · 2021-04-23T12:29:18Z

@b5 please, let me know if you cannot work on this feature because if so, no worries, I can do it alone this time 😄

b5 · 2021-04-23T12:41:33Z

@essobedo sounds great! I'd be delighted to jump on peer review if you can write the implementation! I think it'll get done much quicker that way 😊

essobedo · 2021-04-23T12:55:20Z

@b5 If so, may I rely on what you have done for https://github.com/qri-io/starlib/commits/master/re?

adonovan · 2021-04-23T13:07:54Z

Some preliminary questions:

What regexp syntax are implementations to use? I think the only sensible answer, for reasons of portability and asymptotic efficiency, is RE2. We should link to the RE2 Syntax page.
We should also make clear that byte-oriented features (i.e. just \C) are not supported, since these are infeasible to implement in Java or other languages whose strings are encoded as UTF-16.
Should we offer separate compile/match operations? In a compiled language this is the norm, for reasons of efficiency and early detection of errors, but interpreted languages tend to offer the "convenience" of recompiling the expression every time. (And presumably gaining back the cost by caching.)
Do we need to offer variants of the match function that return indices, not substrings? What does Python do here?
We should offer a function to quote RE metacharacters so that quote(s) is a regexp matching exactly and only s.

b5 · 2021-04-23T13:48:10Z

@b5 If so, may I rely on what you have done for https://github.com/qri-io/starlib/commits/master/re?

Absolutely I see two ways to get started:

If you'd like to file a PR into github.com/qri-io/starlib I can promise a through review there before we open a PR into this repo.
We can use the same workflow we've been using, I'll open an initial PR that you can take over.

Either works for me, just let me know which you'd prefer @essobedo.

@adonovan:

What regexp syntax are implementations to use? I think the only sensible answer, for reasons of portability and asymptotic efficiency, is RE2. We should link to the RE2 Syntax page.

Very much agreed here. The PR we file should use RE2 syntax and link to current spec in package documentation.

We should also make clear that byte-oriented features (i.e. just \C) are not supported, since these are infeasible to implement in Java or other languages whose strings are encoded as UTF-16.

Noted. Bonus points if we can provide a clear error when a user attempts to construct a regex with \C.

Should we offer separate compile/match operations?

I think we should go the interpreted language route, at least at first. It's a non-breaking change to add regex compilation later. Python's re package offers compile.

My question for you @adonovan: can we compile expressions on .Freeze()?

Do we need to offer variants of the match function that return indices, not substrings? What does Python do here?

I need to do a little more digging on this & report back. re.Match objects (the result of calling re.match) can be addressed by index. Your initial package API from above in this thread does not spec any kind of Match return type, which may allow us to sidestep the issue at first?

We should offer a function to quote RE metacharacters so that quote(s) is a regexp matching exactly and only s.

Noted!

Another thing to consider long-term is a cache of compiled regular expressions, my question here is more about where that should be stored, and establishing a policy on how packages should handle something like a cache. I'm assuming this should be stored on the starlark thread. I don't think this is needed for an initial implementation, but would be great to get some initial guidance

essobedo · 2021-04-23T14:09:42Z

Absolutely I see two ways to get started:

If you'd like to file a PR into github.com/qri-io/starlib I can promise a through review there before we open a PR into this repo.

We can use the same workflow we've been using, I'll open an initial PR that you can take over.

The way #2 would be great if it is possible for you

adonovan · 2021-04-23T14:42:17Z

can we compile expressions on .Freeze()?

I'm not sure I understand what you're getting at. Patterns are immutable, so their freeze method is a no-op. And compilation can fail, so you need to report the error immediately. Also, you can compile and run a pattern in the same module, before freeze comes into play.

b5 · 2021-04-23T15:01:27Z

Your confusion answers my question 😄. I was having trouble building a mental model of the point where compilation is happening

b5 · 2021-04-23T15:59:24Z

Before moving forward, I think we should double check that this is the API we'd like to target for an initial package. This is @adonovan's suggested API, and the one I think we should go with:

load("regexp")
rx = regexp.compile("b(an)*a")
rx.matches(str src)                       # returns bool
rx.find(str src)                          # returns text of match
rx.find_all(str src, int max)             # returns list of matches
rx.replace_all(str src , str or function) # returns replaced text
rx.find_submatches(str)                   # returns list of complete match and each subpattern match

I made a mistake in my earlier comment regarding compile/match. I should have said that I think se can slip the match operations at first, and just use this compile-only API to start with.

jacobbogdanov mentioned this issue Dec 30, 2020

Replace com_apt_itude_rules_pip with com_github_ali5h_rules_pip 128technology/rules_128tech#16

Merged

essobedo mentioned this issue Apr 2, 2021

Add support of timestamps in nanoseconds #365

Merged

b5 mentioned this issue Apr 23, 2021

feat: introduce regexp package #369

Closed

5 tasks

adam-azarchs mentioned this issue Jun 11, 2022

Allow starlark rules to filter File sets using glob. bazelbuild/bazel#15659

Open

afq984 mentioned this issue Apr 4, 2023

Clarify safety on untrusted code bazelbuild/starlark#249

Open

jawilking mentioned this issue Aug 30, 2023

Feature Request: Official Lib for re/regex #500

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any plan on adding some support for regular expressions to the language? #241

Any plan on adding some support for regular expressions to the language? #241

fenollp commented Sep 9, 2019

alandonovan commented Sep 9, 2019 •

edited

Loading

bobjackman commented Jul 9, 2020

alandonovan commented Jul 9, 2020

tv42 commented Jul 23, 2020

essobedo commented Mar 16, 2021

b5 commented Apr 8, 2021 •

edited

Loading

b5 commented Apr 8, 2021

mahmoudimus commented Apr 8, 2021

adonovan commented Apr 9, 2021

mahmoudimus commented Apr 9, 2021

essobedo commented Apr 23, 2021

b5 commented Apr 23, 2021

essobedo commented Apr 23, 2021

adonovan commented Apr 23, 2021 •

edited

Loading

b5 commented Apr 23, 2021 •

edited

Loading

essobedo commented Apr 23, 2021

adonovan commented Apr 23, 2021

b5 commented Apr 23, 2021

b5 commented Apr 23, 2021

Any plan on adding some support for regular expressions to the language? #241

Any plan on adding some support for regular expressions to the language? #241

Comments

fenollp commented Sep 9, 2019

alandonovan commented Sep 9, 2019 • edited Loading

bobjackman commented Jul 9, 2020

alandonovan commented Jul 9, 2020

tv42 commented Jul 23, 2020

essobedo commented Mar 16, 2021

b5 commented Apr 8, 2021 • edited Loading

b5 commented Apr 8, 2021

mahmoudimus commented Apr 8, 2021

adonovan commented Apr 9, 2021

mahmoudimus commented Apr 9, 2021

essobedo commented Apr 23, 2021

b5 commented Apr 23, 2021

essobedo commented Apr 23, 2021

adonovan commented Apr 23, 2021 • edited Loading

b5 commented Apr 23, 2021 • edited Loading

essobedo commented Apr 23, 2021

adonovan commented Apr 23, 2021

b5 commented Apr 23, 2021

b5 commented Apr 23, 2021

alandonovan commented Sep 9, 2019 •

edited

Loading

b5 commented Apr 8, 2021 •

edited

Loading

adonovan commented Apr 23, 2021 •

edited

Loading

b5 commented Apr 23, 2021 •

edited

Loading