Skip to content
Devin Jeanpierre edited this page May 5, 2013 · 34 revisions

1. Announcement to mailing list

Notes from discussion on mailing list

2. Research of standards and techniques

  1. Standard: Unicode Technical Standard #18: UNICODE REGULAR EXPRESSIONS - http://www.unicode.org/reports/tr18/
  2. Standard: Open Group Base Specification (Regular Expressions) - http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html - There are other specifications for UNIX/POSIX, not sure which are preferred (and free)
  3. Technique: Thompson NFA - http://swtch.com/~rsc/regexp/regexp1.html
  4. Technique: Pike VM / Laurikari TNFA - http://swtch.com/~rsc/regexp/regexp2.html (VM, with submatch extraction) - http://swtch.com/~rsc/regexp/regexp3.html (tour of RE2, a production implementation of VM) - http://laurikari.net/ville/regex-submatch.pdf (TNFA, thesis) - http://laurikari.net/ville/spire2000-tnfa.ps (TNFA and compilation to TDFA)
  5. Technique: Backtracking Search - http://en.wikipedia.org/wiki/Backtracking (not a very hard algorithm, this is sufficient)
  6. Technique: Memoized Backtracking Search - Discussed in Russ Cox Regexp2 above ( http://swtch.com/~rsc/regexp/regexp2.html )
  7. Technique: OBDD NFA simulation: - http://link.springer.com/chapter/10.1007%2F978-3-642-15512-3_4 (without submatch extraction) - http://www.hpl.hp.com/techreports/2012/HPL-2012-215.pdf (with submatch extraction)

Summary of research on standards and leading techniques

Classification of techniques.

Pike VM is an extension of Thompson NFA to handle submatch extraction. We care about that, so let's ignore Thompson NFA as a separate topic.

Pike VM and Laurikari TNFA are apparently similar techniques. I am not very familiar with TNFA currently, so for now I will lump them together.

Summary of techniques

(Performance details are given where m is the size of the regex, and n is the size of the string being matched.)

Backtracking search is the simplest to implement and has the least overhead when nothing goes wrong. However, in the presence of a few regex operators and bad input, it also has the worst performance, with worst case O(2^n) time and O(n) space.

Pike VM / TNFA have more overhead, but worst-case O(nm) time and O(m) space complexity. They can't implement backreferences, and may not be able to implement zero width assertions.

Memoized backtracking search is O(nm) time and O(nm) space. It can't implement backreferences, but can implement zero width assertions. It is not difficult to imagine combining this approach with Pike/TNFA so that the memo cache is only used for keeping track of zero width assertions, so as to get the benefits of both approaches.

It is also not difficult to imagine falling back to a backtracking implementation in the presence of backreferences, if those are to be supported.

<FIXME: summarize OBDD methods>

Relevant standards and techniques exist?

Those intended to follow (and why)

Those intended to ignore (and why)

3. Research of libraries from other languages

  1. Language: Perl - http://perldoc.perl.org/perlre.html
  2. Language: C (PCRE) - http://www.pcre.org/
  3. Language: C++ (RE2) - http://code.google.com/p/re2/
  4. Language: D - http://dlang.org/regular-expression.html
  5. Language: Python - http://docs.python.org/2/library/re.html

Summary of research from other languages:

Structures and functions commonly appearing

Variations on implementation seen

Pitfalls and hazards associated with each variant

Relationship to other libraries and/or abstract interfaces

4. Module writing

  • Pull request: NOT YET MADE

API

####re!() syntax extension

Pro:

  • Compile-time syntax and type checking (can fail these: re!("["), let x, y = re!("(1)(2)(3)").fullmatch("123").groups())
  • Compile to rust/LLVM bitcode at build time, don't even need to JIT at runtime.

Cons:

  • Syntax extensions must be defined in the parser currently.

Additional implementation notes

  • note

All Categories:

Clone this wiki locally