Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve document about the absence operator #87

Open
k-takata opened this issue Mar 27, 2017 · 8 comments
Open

Improve document about the absence operator #87

k-takata opened this issue Mar 27, 2017 · 8 comments

Comments

@k-takata
Copy link
Owner

k-takata commented Mar 27, 2017

After the release of Ruby 2.4.1, some blog posts about the absent operator were written, and they caused some discussions.

Blog posts:

Discussions:

Unfortunately, some people might not understand the advantage of the absent operator. Maybe one of the reason is that the original paper by Tanaka Akira is written in Japanese. Another reason would be that the document of the operator in Onigmo is not enough.

@k-takata
Copy link
Owner Author

k-takata commented Mar 27, 2017

First, I'd like to translate the important points of his paper very very roughly.


  1. Introduction

Complement sets are useful to express C-style comments, CR LF terminated lines,
Python's """...""" strings, HTTP header terminated by CR LF CR LF, SMTP body
terminated by CR LF . CR LF, etc.
These can be expressed with the complement set of (any string)(terminator)(any string). However, it it not easy to construct a regex which matches such set.
This proposes the absent operator which is easy to write and fits well to regular language theory.

  1. Regular expressions

Regular expressions consist of:
e: empty string
c: character
rr: concatenation
r|r: alternation
r*: Kleene closure

  1. Implementation of regular expressions engine

A regular expressions engine can be implemented with DFA or backtracking.
Script languages nowadays like Perl, Python and Ruby uses backtracking,
because it is not easy to implement capturing with DFA.
He implemented a basic backtracking engine using Ruby.

3.1. Abstract Syntax Tree of regular expressions

3.2. A basic regular expressions engine

A basic implementation by Ruby.

# re: AST of a regex
# str: An array of characters
# pos: Start position
# block: A block executed when it is matched
#        (Callback)
def try(re, str, pos, &block)
  1. Proposal for the absent operator

If a regex engine uses DFA, complement sets can be handled easily.
However it is not easy for backtracking engine. (See 7.3 for detail.)

  1. The absent operator !r

(Note: Onigmo actually uses (?~r) instead of !r, because !r is not compatible
with current syntax.)
The language generated by !r can be defined as the following:

L(!r) = \Sigma * - L(.*r.*)

L(!r) is a complement set of L(.*r.*).

  1. Implementation of the absent operator

This is also described in pp.26-27 of his slide:
http://www.a-k-r.org/pub/prosym49-akr-presen.pdf
See also the next comment.

  1. Advantage of the absent operator

7.1. Easy to write

C-style comments can be expressed with the following:

ab[^b]*b+(([^ba][^b]*)b+)*a

("a" and "b" are used here instead of "/" and "*" to reduce the complexity of
escaping.)

If the absent operator is used, it can be:

ab!(ba)ba

7.2. Fit with regular language theory

7.2.1. Repetition of lazy match

ab.*?ba

This doesn't work well when concatenated with other regex. E.g.

ab.*?ba\n

This wrongly matches "/*2*/3/*4*/\n".
However,

ab!(ba)ba\n

this correctly matches "/*4*/\n".

7.2.2. No backtracking

ab(?>.*?ba)

This works well when concatenated with "\n":

ab(?>.*?ba)\n

However this hardly depends on the strategy of backtracking, and it doesn't
have theoretical backgrounds. E.g. it cannot be converted to a DFA.

7.2.3. Negative lookahead

ab((?!ba).)*ba

This works well.
However, it is not possible to define a language L((?!r)) which is generated
by (?!r).

Note: I'm not sure that the negative lookahead is regular or not.
There is a paper (in Japanese) which says that it is regular.

7.3. Inefficiency of complement sets

It is possible to implement a negation operator which matches the complement
set of r. This is, however, inefficient to implement on a backtracking engine.
The absent operator is much efficient than the negation operator.

  1. Related researches

8.1. Ragel

Ragel has the following operators:

  • negation operator !r
    This is the same as the complement set operator discussed in 7.3.
  • difference operator r1-r2
    This is the same as the complement set operator when r1 generates all the possible strings.
  • strong difference operator r1--r2
    This is the same as the absent operator when r1 generates all the possible strings.

8.2. Perl

8.3. Grail

  1. Conclusions

@k-takata
Copy link
Owner Author

k-takata commented Mar 27, 2017

And partial translation of his slide.


p.26
Behavior of the absent operator

Search - not match
  Move start pos and search
    Move again and search
      Found
      Move again and search
        Found another one
	  Move and search
	    No need to search after this

The range which matches are not included.

p.27
Behavior of the absent operator

  • Search r in the right part of the given start position pos.
  • Search it starting from pos and go to right one by one character.
  • If nothing found, all part of the right of pos are the possible points which the absent operator can match.
  • If one or more found, select one which the right end is most left.
  • Just one character left before the right end of the selected one is the last possible point which the absent operator can match.
  • Right part of the found one is not needed to be searched.

@k-takata
Copy link
Owner Author

BTW, should the name of the operator be "absence operator" instead of "absent operator"?

@hachi8833
Copy link

Good work! I'd help translate them if needed.

@jaynetics
Copy link

BTW, should the name of the operator be "absence operator" instead of "absent operator"?

I think so. "absent operator" is a confusing name, because it could just as well refer to an operator that is not there. "absence operator" does not have this problem.

On the other hand, would you say that it is also a group? In the docs you have put it under "7. Extended groups". If so, it might be more consistent to use the adjective form, as with "passive" or "atomic". I'd still choose clarity over consistency, though.

@k-takata
Copy link
Owner Author

k-takata commented Apr 4, 2017

Thank you for the explanation.

would you say that it is also a group?

In Akira's paper, ! is used for the operator. However, I decided to use (?~...) in Onigmo, and yes, it is also a group (unlike !).
So "absence operator" or "absence group" seems better than "absent operator".

k-takata added a commit that referenced this issue Apr 5, 2017
Rename absent operator to absence operator.
Add more description.
@k-takata
Copy link
Owner Author

k-takata commented Apr 5, 2017

I have slightly updated the document: 7911409
I changed the name of the operator to "absence operator".
(PR for fixing document (including grammar or typo) is always welcome.)

@k-takata
Copy link
Owner Author

k-takata commented Apr 5, 2017

(Upvoting all comments is meaningless, I think...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants