Module encoding_rs::mem
source · [−]Expand description
Functions for converting between different in-RAM representations of text and for quickly checking if the Unicode Bidirectional Algorithm can be avoided.
By using slices for output, the functions here seek to enable by-register (ALU register or SIMD register as available) operations in order to outperform iterator-based conversions available in the Rust standard library.
Note: “Latin1” in this module refers to the Unicode range from U+0000 to U+00FF, inclusive, and does not refer to the windows-1252 range. This in-memory encoding is sometimes used as a storage optimization of text when UTF-16 indexing and length semantics are exposed.
The FFI binding for this module are in the encoding_c_mem crate.
Enums
Classification of text as Latin1 (all code points are below U+0100),
left-to-right with some non-Latin1 characters or as containing at least
some right-to-left characters.
Functions
Checks whether a valid UTF-8 buffer contains code points
that trigger right-to-left processing or is all-Latin1.
Checks whether a potentially invalid UTF-8 buffer contains code points
that trigger right-to-left processing or is all-Latin1.
Checks whether a potentially invalid UTF-16 buffer contains code points
that trigger right-to-left processing or is all-Latin1.
Converts bytes whose unsigned value is interpreted as Unicode code point
(i.e. U+0000 to U+00FF, inclusive) to UTF-8 such that the validity of the
output is signaled using the Rust type system.
Converts bytes whose unsigned value is interpreted as Unicode code point
(i.e. U+0000 to U+00FF, inclusive) to UTF-8 such that the validity of the
output is signaled using the Rust type system with potentially insufficient
output space.
Converts bytes whose unsigned value is interpreted as Unicode code point
(i.e. U+0000 to U+00FF, inclusive) to UTF-8.
Converts bytes whose unsigned value is interpreted as Unicode code point
(i.e. U+0000 to U+00FF, inclusive) to UTF-8 with potentially insufficient
output space.
Converts bytes whose unsigned value is interpreted as Unicode code point
(i.e. U+0000 to U+00FF, inclusive) to UTF-16.
Converts valid UTF-8 to valid UTF-16.
If the input is valid UTF-8 representing only Unicode code points from
U+0000 to U+00FF, inclusive, converts the input into output that
represents the value of each code point as the unsigned byte value of
each output byte.
Converts potentially-invalid UTF-8 to valid UTF-16 with errors replaced
with the REPLACEMENT CHARACTER.
Converts potentially-invalid UTF-8 to valid UTF-16 signaling on error.
If the input is valid UTF-16 representing only Unicode code points from
U+0000 to U+00FF, inclusive, converts the input into output that
represents the value of each code point as the unsigned byte value of
each output byte.
Converts potentially-invalid UTF-16 to valid UTF-8 with errors replaced
with the REPLACEMENT CHARACTER such that the validity of the output is
signaled using the Rust type system.
Converts potentially-invalid UTF-16 to valid UTF-8 with errors replaced
with the REPLACEMENT CHARACTER such that the validity of the output is
signaled using the Rust type system with potentially insufficient output
space.
Converts potentially-invalid UTF-16 to valid UTF-8 with errors replaced
with the REPLACEMENT CHARACTER.
Converts potentially-invalid UTF-16 to valid UTF-8 with errors replaced
with the REPLACEMENT CHARACTER with potentially insufficient output
space.
Copies ASCII from source to destination up to the first non-ASCII byte
(or the end of the input if it is ASCII in its entirety).
Copies ASCII from source to destination zero-extending it to UTF-16 up to
the first non-ASCII byte (or the end of the input if it is ASCII in its
entirety).
Copies Basic Latin from source to destination narrowing it to ASCII up to
the first non-Basic Latin code unit (or the end of the input if it is
Basic Latin in its entirety).
Converts bytes whose unsigned value is interpreted as Unicode code point
(i.e. U+0000 to U+00FF, inclusive) to UTF-8.
If the input is valid UTF-8 representing only Unicode code points from
U+0000 to U+00FF, inclusive, converts the input into output that
represents the value of each code point as the unsigned byte value of
each output byte.
Replaces unpaired surrogates in the input with the REPLACEMENT CHARACTER.
Checks whether the buffer is all-ASCII.
Checks whether the buffer is all-Basic Latin (i.e. UTF-16 representing
only ASCII characters).
Checks whether a scalar value triggers right-to-left processing.
Checks whether a valid UTF-8 buffer contains code points that trigger
right-to-left processing.
Checks whether the buffer represents only code points less than or equal
to U+00FF.
Checks whether a potentially-invalid UTF-8 buffer contains code points
that trigger right-to-left processing.
Checks whether the buffer is valid UTF-8 representing only code points
less than or equal to U+00FF.
Checks whether a UTF-16 buffer contains code points that trigger
right-to-left processing.
Checks whether a UTF-16 code unit triggers right-to-left processing.
Checks whether the buffer represents only code point less than or equal
to U+00FF.
Returns the index of first byte that starts a non-Latin1 byte
sequence, or the length of the string if there are none.
Returns the index of first byte that starts an invalid byte
sequence or a non-Latin1 byte sequence, or the length of the
string if there are neither.
Returns the index of the first unpaired surrogate or, if the input is
valid UTF-16 in its entirety, the length of the input.