Enum regex_syntax::hir::Class
source · [−]pub enum Class {
Unicode(ClassUnicode),
Bytes(ClassBytes),
}
Expand description
The high-level intermediate representation of a character class.
A character class corresponds to a set of characters. A character is either
defined by a Unicode scalar value or a byte. Unicode characters are used
by default, while bytes are used when Unicode mode (via the u
flag) is
disabled.
A character class, regardless of its character type, is represented by a sequence of non-overlapping non-adjacent ranges of characters.
Note that Bytes
variant may be produced even when it exclusively matches
valid UTF-8. This is because a Bytes
variant represents an intention by
the author of the regular expression to disable Unicode mode, which in turn
impacts the semantics of case insensitive matching. For example, (?i)k
and (?i-u)k
will not match the same set of strings.
Variants
Unicode(ClassUnicode)
A set of characters represented by Unicode scalar values.
Bytes(ClassBytes)
A set of characters represented by arbitrary bytes (one byte per character).
Implementations
sourceimpl Class
impl Class
sourcepub fn case_fold_simple(&mut self)
pub fn case_fold_simple(&mut self)
Apply Unicode simple case folding to this character class, in place. The character class will be expanded to include all simple case folded character variants.
If this is a byte oriented character class, then this will be limited
to the ASCII ranges A-Z
and a-z
.
Panics
This routine panics when the case mapping data necessary for this
routine to complete is unavailable. This occurs when the unicode-case
feature is not enabled and the underlying class is Unicode oriented.
Callers should prefer using try_case_fold_simple
instead, which will
return an error instead of panicking.
sourcepub fn try_case_fold_simple(&mut self) -> Result<(), CaseFoldError>
pub fn try_case_fold_simple(&mut self) -> Result<(), CaseFoldError>
Apply Unicode simple case folding to this character class, in place. The character class will be expanded to include all simple case folded character variants.
If this is a byte oriented character class, then this will be limited
to the ASCII ranges A-Z
and a-z
.
Error
This routine returns an error when the case mapping data necessary
for this routine to complete is unavailable. This occurs when the
unicode-case
feature is not enabled and the underlying class is
Unicode oriented.
sourcepub fn negate(&mut self)
pub fn negate(&mut self)
Negate this character class in place.
After completion, this character class will contain precisely the characters that weren’t previously in the class.
sourcepub fn is_utf8(&self) -> bool
pub fn is_utf8(&self) -> bool
Returns true if and only if this character class will only ever match valid UTF-8.
A character class can match invalid UTF-8 only when the following conditions are met:
- The translator was configured to permit generating an expression that can match invalid UTF-8. (By default, this is disabled.)
- Unicode mode (via the
u
flag) was disabled either in the concrete syntax or in the parser builder. By default, Unicode mode is enabled.
sourcepub fn minimum_len(&self) -> Option<usize>
pub fn minimum_len(&self) -> Option<usize>
Returns the length, in bytes, of the smallest string matched by this character class.
For non-empty byte oriented classes, this always returns 1
. For
non-empty Unicode oriented classes, this can return 1
, 2
, 3
or
4
. For empty classes, None
is returned. It is impossible for 0
to
be returned.
Example
This example shows some examples of regexes and their corresponding minimum length, if any.
use regex_syntax::{hir::Properties, parse};
// The empty string has a min length of 0.
let hir = parse(r"")?;
assert_eq!(Some(0), hir.properties().minimum_len());
// As do other types of regexes that only match the empty string.
let hir = parse(r"^$\b\B")?;
assert_eq!(Some(0), hir.properties().minimum_len());
// A regex that can match the empty string but match more is still 0.
let hir = parse(r"a*")?;
assert_eq!(Some(0), hir.properties().minimum_len());
// A regex that matches nothing has no minimum defined.
let hir = parse(r"[a&&b]")?;
assert_eq!(None, hir.properties().minimum_len());
// Character classes usually have a minimum length of 1.
let hir = parse(r"\w")?;
assert_eq!(Some(1), hir.properties().minimum_len());
// But sometimes Unicode classes might be bigger!
let hir = parse(r"\p{Cyrillic}")?;
assert_eq!(Some(2), hir.properties().minimum_len());
sourcepub fn maximum_len(&self) -> Option<usize>
pub fn maximum_len(&self) -> Option<usize>
Returns the length, in bytes, of the longest string matched by this character class.
For non-empty byte oriented classes, this always returns 1
. For
non-empty Unicode oriented classes, this can return 1
, 2
, 3
or
4
. For empty classes, None
is returned. It is impossible for 0
to
be returned.
Example
This example shows some examples of regexes and their corresponding maximum length, if any.
use regex_syntax::{hir::Properties, parse};
// The empty string has a max length of 0.
let hir = parse(r"")?;
assert_eq!(Some(0), hir.properties().maximum_len());
// As do other types of regexes that only match the empty string.
let hir = parse(r"^$\b\B")?;
assert_eq!(Some(0), hir.properties().maximum_len());
// A regex that matches nothing has no maximum defined.
let hir = parse(r"[a&&b]")?;
assert_eq!(None, hir.properties().maximum_len());
// Bounded repeats work as you expect.
let hir = parse(r"x{2,10}")?;
assert_eq!(Some(10), hir.properties().maximum_len());
// An unbounded repeat means there is no maximum.
let hir = parse(r"x{2,}")?;
assert_eq!(None, hir.properties().maximum_len());
// With Unicode enabled, \w can match up to 4 bytes!
let hir = parse(r"\w")?;
assert_eq!(Some(4), hir.properties().maximum_len());
// Without Unicode enabled, \w matches at most 1 byte.
let hir = parse(r"(?-u)\w")?;
assert_eq!(Some(1), hir.properties().maximum_len());