pub struct Extractor { /* private fields */ }
Expand description

Extracts prefix or suffix literal sequences from Hir expressions.

Literal extraction is based on the following observations:

  • Many regexes start with one or a small number of literals.
  • Substring search for literals is often much faster (sometimes by an order of magnitude) than a regex search.

Thus, in many cases, one can search for literals to find candidate starting locations of a match, and then only run the full regex engine at each such location instead of over the full haystack.

The main downside of literal extraction is that it can wind up causing a search to be slower overall. For example, if there are many matches or if there are many candidates that don’t ultimately lead to a match, then a lot of overhead will be spent in shuffing back-and-forth between substring search and the regex engine. This is the fundamental reason why literal optimizations for regex patterns is sometimes considered a “black art.”

Look-around assertions

Literal extraction treats all look-around assertions as-if they match every empty string. So for example, the regex \bquux\b will yield a sequence containing a single exact literal quux. However, not all occurrences of quux correspond to a match a of the regex. For example, \bquux\b does not match ZquuxZ anywhere because quux does not fall on a word boundary.

In effect, if your regex contains look-around assertions, then a match of an exact literal does not necessarily mean the regex overall matches. So you may still need to run the regex engine in such cases to confirm the match.

The precise guarantee you get from a literal sequence is: if every literal in the sequence is exact and the original regex contains zero look-around assertions, then a preference-order multi-substring search of those literals will precisely match a preference-order search of the original regex.

Example

This shows how to extract prefixes:

use regex_syntax::{hir::literal::{Extractor, Literal, Seq}, parse};

let hir = parse(r"(a|b|c)(x|y|z)[A-Z]+foo")?;
let got = Extractor::new().extract(&hir);
// All literals returned are "inexact" because none of them reach the
// match state.
let expected = Seq::from_iter([
    Literal::inexact("ax"),
    Literal::inexact("ay"),
    Literal::inexact("az"),
    Literal::inexact("bx"),
    Literal::inexact("by"),
    Literal::inexact("bz"),
    Literal::inexact("cx"),
    Literal::inexact("cy"),
    Literal::inexact("cz"),
]);
assert_eq!(expected, got);

This shows how to extract suffixes:

use regex_syntax::{
    hir::literal::{Extractor, ExtractKind, Literal, Seq},
    parse,
};

let hir = parse(r"foo|[A-Z]+bar")?;
let got = Extractor::new().kind(ExtractKind::Suffix).extract(&hir);
// Since 'foo' gets to a match state, it is considered exact. But 'bar'
// does not because of the '[A-Z]+', and thus is marked inexact.
let expected = Seq::from_iter([
    Literal::exact("foo"),
    Literal::inexact("bar"),
]);
assert_eq!(expected, got);

Implementations

Create a new extractor with a default configuration.

The extractor can be optionally configured before calling Extractor::extract to get a literal sequence.

Execute the extractor and return a sequence of literals.

Set the kind of literal sequence to extract from an Hir expression.

The default is to extract prefixes, but suffixes can be selected instead. The contract for prefixes is that every match of the corresponding Hir must start with one of the literals in the sequence returned. Moreover, the order of the sequence returned corresponds to the preference order.

Suffixes satisfy a similar contract in that every match of the corresponding Hir must end with one of the literals in the sequence returned. However, there is no guarantee that the literals are in preference order.

Remember that a sequence can be infinite. For example, unless the limits are configured to be impractically large, attempting to extract prefixes (or suffixes) for the pattern [A-Z] will return an infinite sequence. Generally speaking, if the sequence returned is infinite, then it is presumed to be unwise to do prefix (or suffix) optimizations for the pattern.

Configure a limit on the length of the sequence that is permitted for a character class. If a character class exceeds this limit, then the sequence returned for it is infinite.

This prevents classes like [A-Z] or \pL from getting turned into huge and likely unproductive sequences of literals.

Example

This example shows how this limit can be lowered to decrease the tolerance for character classes being turned into literal sequences.

use regex_syntax::{hir::literal::{Extractor, Seq}, parse};

let hir = parse(r"[0-9]")?;

let got = Extractor::new().extract(&hir);
let expected = Seq::new([
    "0", "1", "2", "3", "4", "5", "6", "7", "8", "9",
]);
assert_eq!(expected, got);

// Now let's shrink the limit and see how that changes things.
let got = Extractor::new().limit_class(4).extract(&hir);
let expected = Seq::infinite();
assert_eq!(expected, got);

Configure a limit on the total number of repetitions that is permitted before literal extraction is stopped.

This is useful for limiting things like (abcde){50}, or more insidiously, (?:){1000000000}. This limit prevents any one single repetition from adding too much to a literal sequence.

With this limit set, repetitions that exceed it will be stopped and any literals extracted up to that point will be made inexact.

Example

This shows how to decrease the limit and compares it with the default.

use regex_syntax::{hir::literal::{Extractor, Literal, Seq}, parse};

let hir = parse(r"(abc){8}")?;

let got = Extractor::new().extract(&hir);
let expected = Seq::new(["abcabcabcabcabcabcabcabc"]);
assert_eq!(expected, got);

// Now let's shrink the limit and see how that changes things.
let got = Extractor::new().limit_repeat(4).extract(&hir);
let expected = Seq::from_iter([
    Literal::inexact("abcabcabcabc"),
]);
assert_eq!(expected, got);

Configure a limit on the maximum length of any literal in a sequence.

This is useful for limiting things like (abcde){5}{5}{5}{5}. While each repetition or literal in that regex is small, when all the repetitions are applied, one ends up with a literal of length 5^4 = 625.

With this limit set, literals that exceed it will be made inexact and thus prevented from growing.

Example

This shows how to decrease the limit and compares it with the default.

use regex_syntax::{hir::literal::{Extractor, Literal, Seq}, parse};

let hir = parse(r"(abc){2}{2}{2}")?;

let got = Extractor::new().extract(&hir);
let expected = Seq::new(["abcabcabcabcabcabcabcabc"]);
assert_eq!(expected, got);

// Now let's shrink the limit and see how that changes things.
let got = Extractor::new().limit_literal_len(14).extract(&hir);
let expected = Seq::from_iter([
    Literal::inexact("abcabcabcabcab"),
]);
assert_eq!(expected, got);

Configure a limit on the total number of literals that will be returned.

This is useful as a practical measure for avoiding the creation of large sequences of literals. While the extractor will automatically handle local creations of large sequences (for example, [A-Z] yields an infinite sequence by default), large sequences can be created through non-local means as well.

For example, [ab]{3}{3} would yield a sequence of length 512 = 2^9 despite each of the repetitions being small on their own. This limit thus represents a “catch all” for avoiding locally small sequences from combining into large sequences.

Example

This example shows how reducing the limit will change the literal sequence returned.

use regex_syntax::{hir::literal::{Extractor, Literal, Seq}, parse};

let hir = parse(r"[ab]{2}{2}")?;

let got = Extractor::new().extract(&hir);
let expected = Seq::new([
    "aaaa", "aaab", "aaba", "aabb",
    "abaa", "abab", "abba", "abbb",
    "baaa", "baab", "baba", "babb",
    "bbaa", "bbab", "bbba", "bbbb",
]);
assert_eq!(expected, got);

// The default limit is not too big, but big enough to extract all
// literals from '[ab]{2}{2}'. If we shrink the limit to less than 16,
// then we'll get a truncated set. Notice that it returns a sequence of
// length 4 even though our limit was 10. This is because the sequence
// is difficult to increase without blowing the limit. Notice also
// that every literal in the sequence is now inexact because they were
// stripped of some suffix.
let got = Extractor::new().limit_total(10).extract(&hir);
let expected = Seq::from_iter([
    Literal::inexact("aa"),
    Literal::inexact("ab"),
    Literal::inexact("ba"),
    Literal::inexact("bb"),
]);
assert_eq!(expected, got);

Trait Implementations

Returns a copy of the value. Read more
Performs copy-assignment from source. Read more
Formats the value using the given formatter. Read more
Returns the “default value” for a type. Read more

Auto Trait Implementations

Blanket Implementations

Gets the TypeId of self. Read more
Immutably borrows from an owned value. Read more
Mutably borrows from an owned value. Read more

Returns the argument unchanged.

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

The resulting type after obtaining ownership.
Creates owned data from borrowed data, usually by cloning. Read more
Uses borrowed data to replace owned data, usually by cloning. Read more
The type returned in the event of a conversion error.
Performs the conversion.
The type returned in the event of a conversion error.
Performs the conversion.