Module regex_syntax::utf8
source · [−]Expand description
Converts ranges of Unicode scalar values to equivalent ranges of UTF-8 bytes.
This is sub-module is useful for constructing byte based automatons that need
to embed UTF-8 decoding. The most common use of this module is in conjunction
with the hir::ClassUnicodeRange
type.
See the documentation on the Utf8Sequences
iterator for more details and
an example.
Wait, what is this?
This is simplest to explain with an example. Let’s say you wanted to test
whether a particular byte sequence was a Cyrillic character. One possible
scalar value range is [0400-04FF]
. The set of allowed bytes for this
range can be expressed as a sequence of byte ranges:
[D0-D3][80-BF]
This is simple enough: simply encode the boundaries, 0400
encodes to
D0 80
and 04FF
encodes to D3 BF
, and create ranges from each
corresponding pair of bytes: D0
to D3
and 80
to BF
.
However, what if you wanted to add the Cyrillic Supplementary characters to
your range? Your range might then become [0400-052F]
. The same procedure
as above doesn’t quite work because 052F
encodes to D4 AF
. The byte ranges
you’d get from the previous transformation would be [D0-D4][80-AF]
. However,
this isn’t quite correct because this range doesn’t capture many characters,
for example, 04FF
(because its last byte, BF
isn’t in the range 80-AF
).
Instead, you need multiple sequences of byte ranges:
[D0-D3][80-BF] # matches codepoints 0400-04FF
[D4][80-AF] # matches codepoints 0500-052F
This gets even more complicated if you want bigger ranges, particularly if
they naively contain surrogate codepoints. For example, the sequence of byte
ranges for the basic multilingual plane ([0000-FFFF]
) look like this:
[0-7F]
[C2-DF][80-BF]
[E0][A0-BF][80-BF]
[E1-EC][80-BF][80-BF]
[ED][80-9F][80-BF]
[EE-EF][80-BF][80-BF]
Note that the byte ranges above will not match any erroneous encoding of UTF-8, including encodings of surrogate codepoints.
And, of course, for all of Unicode ([000000-10FFFF]
):
[0-7F]
[C2-DF][80-BF]
[E0][A0-BF][80-BF]
[E1-EC][80-BF][80-BF]
[ED][80-9F][80-BF]
[EE-EF][80-BF][80-BF]
[F0][90-BF][80-BF][80-BF]
[F1-F3][80-BF][80-BF][80-BF]
[F4][80-8F][80-BF][80-BF]
This module automates the process of creating these byte ranges from ranges of Unicode scalar values.
Lineage
I got the idea and general implementation strategy from Russ Cox in his
article on regexps and RE2.
Russ Cox got it from Ken Thompson’s grep
(no source, folk lore?).
I also got the idea from
Lucene,
which uses it for executing automata on their term index.