Expand description
Tokenizer are in charge of chopping text into a stream of tokens ready for indexing.
You must define in your schema which tokenizer should be used for each of your fields :
use tantivy::schema::*;
let mut schema_builder = Schema::builder();
let text_options = TextOptions::default()
.set_indexing_options(
TextFieldIndexing::default()
.set_tokenizer("en_stem")
.set_index_option(IndexRecordOption::Basic)
)
.set_stored();
let id_options = TextOptions::default()
.set_indexing_options(
TextFieldIndexing::default()
.set_tokenizer("raw_ids")
.set_index_option(IndexRecordOption::WithFreqsAndPositions)
)
.set_stored();
schema_builder.add_text_field("title", text_options.clone());
schema_builder.add_text_field("text", text_options);
schema_builder.add_text_field("uuid", id_options);
let schema = schema_builder.build();
By default, tantivy
offers the following tokenizers:
default
default
is the tokenizer that will be used if you do not
assign a specific tokenizer to your text field.
It will chop your text on punctuation and whitespaces,
removes tokens that are longer than 40 chars, and lowercase your text.
raw
Does not actual tokenizer your text. It keeps it entirely unprocessed. It can be useful to index uuids, or urls for instance.
en_stem
In addition to what default
does, the en_stem
tokenizer also
apply stemming to your tokens. Stemming consists in trimming words to
remove their inflection. This tokenizer is slower than the default one,
but is recommended to improve recall.
Custom tokenizers
You can write your own tokenizer by implementing the Tokenizer
trait
or you can extend an existing Tokenizer
by chaining it with several
TokenFilter
s.
For instance, the en_stem
is defined as follows.
use tantivy::tokenizer::*;
let en_stem = TextAnalyzer::from(SimpleTokenizer)
.filter(RemoveLongFilter::limit(40))
.filter(LowerCaser)
.filter(Stemmer::new(Language::English));
Once your tokenizer is defined, you need to
register it with a name in your index’s TokenizerManager
.
let custom_en_tokenizer = SimpleTokenizer;
let index = Index::create_in_ram(schema);
index.tokenizers()
.register("custom_en", custom_en_tokenizer);
If you built your schema programmatically, a complete example could like this for instance.
Note that tokens with a len greater or equal to
MAX_TOKEN_LEN
.
Example
use tantivy::schema::{Schema, IndexRecordOption, TextOptions, TextFieldIndexing};
use tantivy::tokenizer::*;
use tantivy::Index;
let mut schema_builder = Schema::builder();
let text_field_indexing = TextFieldIndexing::default()
.set_tokenizer("custom_en")
.set_index_option(IndexRecordOption::WithFreqsAndPositions);
let text_options = TextOptions::default()
.set_indexing_options(text_field_indexing)
.set_stored();
schema_builder.add_text_field("title", text_options);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
// We need to register our tokenizer :
let custom_en_tokenizer = TextAnalyzer::from(SimpleTokenizer)
.filter(RemoveLongFilter::limit(40))
.filter(LowerCaser);
index
.tokenizers()
.register("custom_en", custom_en_tokenizer);
Structs
TokenFilter
that removes all tokens that contain non
ascii alphanumeric characters.Box<dyn TokenFilter + 'a>
.Box<dyn TokenStream + 'a>
.FacetTokenizer
process a Facet
binary representation
and emits a token for all of its parent.TokenStream
implementation which wraps PreTokenizedString
RemoveLongFilter
removes tokens that are longer
than a given number of bytes (in UTF-8 representation).TokenFilter
which splits compound words into their parts
based on a given dictionary.Stemmer
token filter. Several languages are supported, see Language
for the available
languages.
Tokens are expected to be lowercased beforehand.TokenFilter
that removes stop words from a token streamTextAnalyzer
tokenizes an input text into tokens and modifies the resulting TokenStream
.Enums
Constants
Traits
Tokenizer
s.TokenStream
is the result of the tokenization.Tokenizer
are in charge of splitting text into a stream of token
before indexing.