Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any plan on adding some support for regular expressions to the language? #241

Open
fenollp opened this issue Sep 9, 2019 · 19 comments
Open

Comments

@fenollp
Copy link
Contributor

fenollp commented Sep 9, 2019

This has many parts:

  • a builtin type
  • the usual methods
  • special syntax (sigils, ...)
  • what set of regexps to support
  • what amount of backtracking should be allowed if at all

One can definitely implement the first 2 with minimum friction, but as I find Starlark's strength to be in its Turing-incompleteness guarantees I am wondering if this was / is being discussed as an addition to the language.

@alandonovan
Copy link
Contributor

alandonovan commented Sep 9, 2019

After time and json,

time https://github.com/google/starlark-go/tree/wip-skylark-time
json https://github.com/google/starlark-go/tree/json

regexp is the most requested package for the Starlark "standard library". No-one has sketched a prototype yet, but it shouldn't be hard to come up with a sound Starlark API for regular expressions that can be backed by the Go, Java, and C++ implementations of RE2, which is linear time (non-backtracking; see github.com/google/re2), though we will need to disallow \C.

I imagine the basic API would look something like the Go regexp API:

   load("regexp")
   rx = regexp.compile("b(an)*a")
   rx.matches(str src) # returns bool
   rx.find(str src) # returns text of match
   rx.find_all(str src, int max) # returns list of matches
   rx.replace_all(str src , str or function) # returns replaced text
   rx.find_submatches(str) # returns list of complete match and each subpattern match

We might also want variants that return the indices (not text) of each match.

@bobjackman
Copy link

Any update on this, yet?

@alandonovan
Copy link
Contributor

Sorry, no.

@tv42
Copy link

tv42 commented Jul 23, 2020

For what it's worth, I stumbled on https://godoc.org/github.com/qri-io/starlib/re

@essobedo
Copy link
Contributor

@b5 What about this one? Ready for it? :-)

@b5
Copy link
Contributor

b5 commented Apr 8, 2021

I think this would be great to get to attack next given the roll we're on with the time and math packages, but the design & implementation of any regex package or language level support requires more of upfront design discussion:

  1. do we write this a package? write a proposal for inclusion in the language? both?
  2. If we're designing a package, does it take inspiration from go's regexp or python's re?
  3. What's providing the underlying regex parser in go?

My answers to these questions:

  1. It should be a package. We should avoid introducing overhead into the language, even if it adds parity to python. I'm thinking specifically of the cost of implementing a starlark interpreter in another language that doesn't have easy support for RE2-style regex syntax. Having it as a package makes it easier to opt out.
  2. Our initial regex implementation follows the python re style, but I think sitting down & working through a minimal design draft is in order.
  3. In my opinion using anything other than go's regexp would be a mistake. But with the choice to use regexp it's important to document any differences for users coming from python.

That said, I'd love to get input from others. Would a re package suit your needs? Any reason we should instead be dedicating time to writing a regex proposal for language inclusion instead?

@alandonovan, would you accept a package if we could get one to pass muster? What's your bandwidth like for either review and/or design input?

@b5
Copy link
Contributor

b5 commented Apr 8, 2021

🤦 I literally hadn't read the comments above where @adonovan laying out answers to all of my questions. Ok @essobedo I think we can take a stab at implementing the prototype laid out above & get a PR going from there.

@adonovan
Copy link
Collaborator

adonovan commented Apr 9, 2021

I actually implemented this for the Java version using re2. It works really well for our use case. You can see it here: ...

Interesting, thanks.

It's funny, I wrote RE2/J a decade back for use in the regexp package of a Google in-house embedded configuration language that had implementations in multiple languages and needed a consistent regexp interface. That description now applies equally to Starlark, and it's using RE2/J for the same reason.

BTW, although RE2/J is asymptotically faster than j.u.regex, in practice it's usually significantly slower. Not that that really matters here.

It appear that you are using Starlark as a "secure" embedded language within your VGS product. By design, Starlark is hermetic in many ways, but we've never claimed that it is secure for running untrusted code. Scripts can easily cause denial of service by exhausting all memory, or by hash flooding. Also, RE2/J can be induced to run out of stack when parsing a deep expression such as "(((...(((". (The Go version is somewhat less susceptible to such attacks because Go stacks can grow 100x larger than JVM threads.)

@mahmoudimus
Copy link

Scripts can easily cause denial of service by exhausting all memory, or by hash flooding

Yes, I'm aware of this (actually noticed this discussion re: hash flooding in the Fnv32 hash function discussions in the bytes PR :D). We've enlisted Trail of Bits for a code audit (and we'll be happy to contribute whatever findings upstream) and what you've mentioned so far was on our roadmap. This isn't in production yet, not until we've checked all the bells and whistles, but it's actually a pretty neat idea if it works for what we want to use it for.

Also, RE2/J can be induced to run out of stack when parsing a deep expression such as "(((...(((". (The Go version is somewhat less susceptible to such attacks because Go stacks can grow 100x larger than JVM threads.)

I was not aware of this problem, that's good to know, thank you!

@essobedo
Copy link
Contributor

@b5 please, let me know if you cannot work on this feature because if so, no worries, I can do it alone this time 😄

@b5
Copy link
Contributor

b5 commented Apr 23, 2021

@essobedo sounds great! I'd be delighted to jump on peer review if you can write the implementation! I think it'll get done much quicker that way 😊

@essobedo
Copy link
Contributor

@b5 If so, may I rely on what you have done for https://github.com/qri-io/starlib/commits/master/re?

@adonovan
Copy link
Collaborator

adonovan commented Apr 23, 2021

Some preliminary questions:

  • What regexp syntax are implementations to use? I think the only sensible answer, for reasons of portability and asymptotic efficiency, is RE2. We should link to the RE2 Syntax page.
  • We should also make clear that byte-oriented features (i.e. just \C) are not supported, since these are infeasible to implement in Java or other languages whose strings are encoded as UTF-16.
  • Should we offer separate compile/match operations? In a compiled language this is the norm, for reasons of efficiency and early detection of errors, but interpreted languages tend to offer the "convenience" of recompiling the expression every time. (And presumably gaining back the cost by caching.)
  • Do we need to offer variants of the match function that return indices, not substrings? What does Python do here?
  • We should offer a function to quote RE metacharacters so that quote(s) is a regexp matching exactly and only s.

@b5
Copy link
Contributor

b5 commented Apr 23, 2021

@b5 If so, may I rely on what you have done for https://github.com/qri-io/starlib/commits/master/re?

Absolutely I see two ways to get started:

  1. If you'd like to file a PR into github.com/qri-io/starlib I can promise a through review there before we open a PR into this repo.
  2. We can use the same workflow we've been using, I'll open an initial PR that you can take over.

Either works for me, just let me know which you'd prefer @essobedo.

@adonovan:

What regexp syntax are implementations to use? I think the only sensible answer, for reasons of portability and asymptotic efficiency, is RE2. We should link to the RE2 Syntax page.

Very much agreed here. The PR we file should use RE2 syntax and link to current spec in package documentation.

We should also make clear that byte-oriented features (i.e. just \C) are not supported, since these are infeasible to implement in Java or other languages whose strings are encoded as UTF-16.

Noted. Bonus points if we can provide a clear error when a user attempts to construct a regex with \C.

Should we offer separate compile/match operations?

I think we should go the interpreted language route, at least at first. It's a non-breaking change to add regex compilation later. Python's re package offers compile.

My question for you @adonovan: can we compile expressions on .Freeze()?

Do we need to offer variants of the match function that return indices, not substrings? What does Python do here?

I need to do a little more digging on this & report back. re.Match objects (the result of calling re.match) can be addressed by index. Your initial package API from above in this thread does not spec any kind of Match return type, which may allow us to sidestep the issue at first?

We should offer a function to quote RE metacharacters so that quote(s) is a regexp matching exactly and only s.

Noted!

Another thing to consider long-term is a cache of compiled regular expressions, my question here is more about where that should be stored, and establishing a policy on how packages should handle something like a cache. I'm assuming this should be stored on the starlark thread. I don't think this is needed for an initial implementation, but would be great to get some initial guidance

@essobedo
Copy link
Contributor

Absolutely I see two ways to get started:

  1. If you'd like to file a PR into github.com/qri-io/starlib I can promise a through review there before we open a PR into this repo.
  2. We can use the same workflow we've been using, I'll open an initial PR that you can take over.

The way #2 would be great if it is possible for you

@adonovan
Copy link
Collaborator

can we compile expressions on .Freeze()?

I'm not sure I understand what you're getting at. Patterns are immutable, so their freeze method is a no-op. And compilation can fail, so you need to report the error immediately. Also, you can compile and run a pattern in the same module, before freeze comes into play.

@b5
Copy link
Contributor

b5 commented Apr 23, 2021

Your confusion answers my question 😄. I was having trouble building a mental model of the point where compilation is happening

@b5
Copy link
Contributor

b5 commented Apr 23, 2021

Before moving forward, I think we should double check that this is the API we'd like to target for an initial package. This is @adonovan's suggested API, and the one I think we should go with:

load("regexp")
rx = regexp.compile("b(an)*a")
rx.matches(str src)                       # returns bool
rx.find(str src)                          # returns text of match
rx.find_all(str src, int max)             # returns list of matches
rx.replace_all(str src , str or function) # returns replaced text
rx.find_submatches(str)                   # returns list of complete match and each subpattern match

I made a mistake in my earlier comment regarding compile/match. I should have said that I think se can slip the match operations at first, and just use this compile-only API to start with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants