pub struct Extractor { /* private fields */ }
Expand description
Extracts prefix or suffix literal sequences from Hir
expressions.
Literal extraction is based on the following observations:
- Many regexes start with one or a small number of literals.
- Substring search for literals is often much faster (sometimes by an order of magnitude) than a regex search.
Thus, in many cases, one can search for literals to find candidate starting locations of a match, and then only run the full regex engine at each such location instead of over the full haystack.
The main downside of literal extraction is that it can wind up causing a search to be slower overall. For example, if there are many matches or if there are many candidates that don’t ultimately lead to a match, then a lot of overhead will be spent in shuffing back-and-forth between substring search and the regex engine. This is the fundamental reason why literal optimizations for regex patterns is sometimes considered a “black art.”
Look-around assertions
Literal extraction treats all look-around assertions as-if they match every
empty string. So for example, the regex \bquux\b
will yield a sequence
containing a single exact literal quux
. However, not all occurrences
of quux
correspond to a match a of the regex. For example, \bquux\b
does not match ZquuxZ
anywhere because quux
does not fall on a word
boundary.
In effect, if your regex contains look-around assertions, then a match of an exact literal does not necessarily mean the regex overall matches. So you may still need to run the regex engine in such cases to confirm the match.
The precise guarantee you get from a literal sequence is: if every literal in the sequence is exact and the original regex contains zero look-around assertions, then a preference-order multi-substring search of those literals will precisely match a preference-order search of the original regex.
Example
This shows how to extract prefixes:
use regex_syntax::{hir::literal::{Extractor, Literal, Seq}, parse};
let hir = parse(r"(a|b|c)(x|y|z)[A-Z]+foo")?;
let got = Extractor::new().extract(&hir);
// All literals returned are "inexact" because none of them reach the
// match state.
let expected = Seq::from_iter([
Literal::inexact("ax"),
Literal::inexact("ay"),
Literal::inexact("az"),
Literal::inexact("bx"),
Literal::inexact("by"),
Literal::inexact("bz"),
Literal::inexact("cx"),
Literal::inexact("cy"),
Literal::inexact("cz"),
]);
assert_eq!(expected, got);
This shows how to extract suffixes:
use regex_syntax::{
hir::literal::{Extractor, ExtractKind, Literal, Seq},
parse,
};
let hir = parse(r"foo|[A-Z]+bar")?;
let got = Extractor::new().kind(ExtractKind::Suffix).extract(&hir);
// Since 'foo' gets to a match state, it is considered exact. But 'bar'
// does not because of the '[A-Z]+', and thus is marked inexact.
let expected = Seq::from_iter([
Literal::exact("foo"),
Literal::inexact("bar"),
]);
assert_eq!(expected, got);
Implementations
sourceimpl Extractor
impl Extractor
sourcepub fn new() -> Extractor
pub fn new() -> Extractor
Create a new extractor with a default configuration.
The extractor can be optionally configured before calling
Extractor::extract
to get a literal sequence.
sourcepub fn extract(&self, hir: &Hir) -> Seq
pub fn extract(&self, hir: &Hir) -> Seq
Execute the extractor and return a sequence of literals.
sourcepub fn kind(&mut self, kind: ExtractKind) -> &mut Extractor
pub fn kind(&mut self, kind: ExtractKind) -> &mut Extractor
Set the kind of literal sequence to extract from an Hir
expression.
The default is to extract prefixes, but suffixes can be selected
instead. The contract for prefixes is that every match of the
corresponding Hir
must start with one of the literals in the sequence
returned. Moreover, the order of the sequence returned corresponds to
the preference order.
Suffixes satisfy a similar contract in that every match of the
corresponding Hir
must end with one of the literals in the sequence
returned. However, there is no guarantee that the literals are in
preference order.
Remember that a sequence can be infinite. For example, unless the
limits are configured to be impractically large, attempting to extract
prefixes (or suffixes) for the pattern [A-Z]
will return an infinite
sequence. Generally speaking, if the sequence returned is infinite,
then it is presumed to be unwise to do prefix (or suffix) optimizations
for the pattern.
sourcepub fn limit_class(&mut self, limit: usize) -> &mut Extractor
pub fn limit_class(&mut self, limit: usize) -> &mut Extractor
Configure a limit on the length of the sequence that is permitted for a character class. If a character class exceeds this limit, then the sequence returned for it is infinite.
This prevents classes like [A-Z]
or \pL
from getting turned into
huge and likely unproductive sequences of literals.
Example
This example shows how this limit can be lowered to decrease the tolerance for character classes being turned into literal sequences.
use regex_syntax::{hir::literal::{Extractor, Seq}, parse};
let hir = parse(r"[0-9]")?;
let got = Extractor::new().extract(&hir);
let expected = Seq::new([
"0", "1", "2", "3", "4", "5", "6", "7", "8", "9",
]);
assert_eq!(expected, got);
// Now let's shrink the limit and see how that changes things.
let got = Extractor::new().limit_class(4).extract(&hir);
let expected = Seq::infinite();
assert_eq!(expected, got);
sourcepub fn limit_repeat(&mut self, limit: usize) -> &mut Extractor
pub fn limit_repeat(&mut self, limit: usize) -> &mut Extractor
Configure a limit on the total number of repetitions that is permitted before literal extraction is stopped.
This is useful for limiting things like (abcde){50}
, or more
insidiously, (?:){1000000000}
. This limit prevents any one single
repetition from adding too much to a literal sequence.
With this limit set, repetitions that exceed it will be stopped and any literals extracted up to that point will be made inexact.
Example
This shows how to decrease the limit and compares it with the default.
use regex_syntax::{hir::literal::{Extractor, Literal, Seq}, parse};
let hir = parse(r"(abc){8}")?;
let got = Extractor::new().extract(&hir);
let expected = Seq::new(["abcabcabcabcabcabcabcabc"]);
assert_eq!(expected, got);
// Now let's shrink the limit and see how that changes things.
let got = Extractor::new().limit_repeat(4).extract(&hir);
let expected = Seq::from_iter([
Literal::inexact("abcabcabcabc"),
]);
assert_eq!(expected, got);
sourcepub fn limit_literal_len(&mut self, limit: usize) -> &mut Extractor
pub fn limit_literal_len(&mut self, limit: usize) -> &mut Extractor
Configure a limit on the maximum length of any literal in a sequence.
This is useful for limiting things like (abcde){5}{5}{5}{5}
. While
each repetition or literal in that regex is small, when all the
repetitions are applied, one ends up with a literal of length 5^4 = 625
.
With this limit set, literals that exceed it will be made inexact and thus prevented from growing.
Example
This shows how to decrease the limit and compares it with the default.
use regex_syntax::{hir::literal::{Extractor, Literal, Seq}, parse};
let hir = parse(r"(abc){2}{2}{2}")?;
let got = Extractor::new().extract(&hir);
let expected = Seq::new(["abcabcabcabcabcabcabcabc"]);
assert_eq!(expected, got);
// Now let's shrink the limit and see how that changes things.
let got = Extractor::new().limit_literal_len(14).extract(&hir);
let expected = Seq::from_iter([
Literal::inexact("abcabcabcabcab"),
]);
assert_eq!(expected, got);
sourcepub fn limit_total(&mut self, limit: usize) -> &mut Extractor
pub fn limit_total(&mut self, limit: usize) -> &mut Extractor
Configure a limit on the total number of literals that will be returned.
This is useful as a practical measure for avoiding the creation of
large sequences of literals. While the extractor will automatically
handle local creations of large sequences (for example, [A-Z]
yields
an infinite sequence by default), large sequences can be created
through non-local means as well.
For example, [ab]{3}{3}
would yield a sequence of length 512 = 2^9
despite each of the repetitions being small on their own. This limit
thus represents a “catch all” for avoiding locally small sequences from
combining into large sequences.
Example
This example shows how reducing the limit will change the literal sequence returned.
use regex_syntax::{hir::literal::{Extractor, Literal, Seq}, parse};
let hir = parse(r"[ab]{2}{2}")?;
let got = Extractor::new().extract(&hir);
let expected = Seq::new([
"aaaa", "aaab", "aaba", "aabb",
"abaa", "abab", "abba", "abbb",
"baaa", "baab", "baba", "babb",
"bbaa", "bbab", "bbba", "bbbb",
]);
assert_eq!(expected, got);
// The default limit is not too big, but big enough to extract all
// literals from '[ab]{2}{2}'. If we shrink the limit to less than 16,
// then we'll get a truncated set. Notice that it returns a sequence of
// length 4 even though our limit was 10. This is because the sequence
// is difficult to increase without blowing the limit. Notice also
// that every literal in the sequence is now inexact because they were
// stripped of some suffix.
let got = Extractor::new().limit_total(10).extract(&hir);
let expected = Seq::from_iter([
Literal::inexact("aa"),
Literal::inexact("ab"),
Literal::inexact("ba"),
Literal::inexact("bb"),
]);
assert_eq!(expected, got);