enchant.tokenize: String tokenization functions for PyEnchant
An important task in spellchecking is breaking up large bodies of text into their constituent words, each of which is then checked for correctness. This package provides Python functions to split strings into words according to the rules of a particular language.
Each tokenization function accepts a string as its only positional argument, and returns an iterator that yields tuples of the following form, one for each word found:
(<word>,<pos>)
The meanings of these fields should be clear: word is the word
that was found and pos is the position within the text at which
the word began (zero indexed, of course). The function will work
on any string-like object that supports array-slicing; in particular
character-array objects from the array
module may be used.
The iterator also provides the attribute offset
which gives the current
position of the tokenizer inside the string being split, and the method
set_offset()
for manually adjusting this position. This can be used for
example if the string’s contents have changed during the tokenization
process.
To obtain an appropriate tokenization function for the language
identified by tag, use the function get_tokenizer()
:
tknzr = get_tokenizer("en_US")
for (word,pos) in tknzr("text to be tokenized goes here")
do_something(word)
This library is designed to be easily extendible by third-party
authors. To register a tokenization function for the language
tag, implement it as the function tokenize within the
module enchant.tokenize.<tag>. The function get_tokenizer()
will automatically detect it. Note that the underscore must be
used as the tag component separator in this case, in order to
form a valid python module name. (e.g. “en_US” rather than “en-US”)
Currently, a tokenizer has only been implemented for the English language. Based on the author’s limited experience, this should be at least partially suitable for other languages.
This module also provides various implementations of Chunkers and Filters. These classes are designed to make it easy to work with text in a variety of common formats, by detecting and excluding parts of the text that don’t need to be checked.
A Chunker
is a class designed to break a body of text into large chunks
of checkable content; for example the HTMLChunker
class extracts the
text content from all HTML tags but excludes the tags themselves.
A Filter
is a class designed to skip individual words during the checking
process; for example the URLFilter
class skips over any words that
have the format of a URL.
For example, to spellcheck an HTML document it is necessary to split the text into chunks based on HTML tags, and to filter out common word forms such as URLs and WikiWords. This would look something like the following:
tknzr = get_tokenizer("en_US",(HTMLChunker,),(URLFilter,WikiWordFilter)))
text = "<html><body>the url is http://example.com</body></html>"
for (word,pos) in tknzer(text):
...check each word and react accordingly...
- class enchant.tokenize.Chunker(text: str)
Base class for text chunking functions.
A chunker is designed to chunk text into large blocks of tokens. It has the same interface as a tokenizer but is for a different purpose.
- class enchant.tokenize.EmailFilter(tokenizer: Type[tokenize] | Filter)
Filter skipping over email addresses. This filter skips any words matching the following regular expression:
^.+@[^.].*.[a-z]{2,}$
That is, any words that resemble email addresses.
- class enchant.tokenize.Filter(tokenizer: Type[tokenize] | Filter)
Base class for token filtering functions.
A filter is designed to wrap a tokenizer (or another
Filter
) and do two things:skip over tokens
split tokens into sub-tokens
Subclasses have two basic options for customising their behaviour. The method
_skip()
may be overridden to return True for words that should be skipped, and False otherwise. The method_split()
may be overridden as tokenization function that will be applied to further tokenize any words that aren’t skipped.
- class enchant.tokenize.HTMLChunker(text: str)
Chunker for breaking up HTML documents into chunks of checkable text.
The operation of this chunker is very simple - anything between a “<” and a “>” will be ignored. Later versions may improve the algorithm slightly.
- class enchant.tokenize.HashtagFilter(tokenizer: Type[tokenize] | Filter)
Filter skipping over #hashtag. This filter skips any words matching the following regular expression:
(A|s)#(w+)
That is, any words that are #hashtag.
- class enchant.tokenize.MentionFilter(tokenizer: Type[tokenize] | Filter)
Filter skipping over @mention. This filter skips any words matching the following regular expression:
(A|s)@(w+)
That is, any words that are @mention.
- class enchant.tokenize.URLFilter(tokenizer: Type[tokenize] | Filter)
Filter skipping over URLs. This filter skips any words matching the following regular expression:
^[a-zA-Z]+://[^s].*
That is, any words that are URLs.
- class enchant.tokenize.WikiWordFilter(tokenizer: Type[tokenize] | Filter)
Filter skipping over WikiWords. This filter skips any words matching the following regular expression:
^([A-Z]w+[A-Z]+w+)
That is, any words that are WikiWords.
- class enchant.tokenize.basic_tokenize(text: str)
Tokenizer class that performs very basic word-finding.
This tokenizer does the most basic thing that could work - it splits text into words based on whitespace boundaries, and removes basic punctuation symbols from the start and end of each word.
- class enchant.tokenize.empty_tokenize
Tokenizer class that yields no elements.
- enchant.tokenize.get_tokenizer(tag: str | None = None, chunkers: Iterable[Type[Chunker] | Type[Filter]] | None = None, filters: Iterable[Type[Filter]] | None = None) tokenize
Locate an appropriate tokenizer by language tag.
This requires importing the function tokenize from an appropriate module. Modules tried are named after the language tag, tried in the following order:
the entire tag (e.g. “en_AU.py”)
the base country code of the tag (e.g. “en.py”)
If the language tag is None, a default tokenizer (actually the English one) is returned. It’s unicode aware and should work OK for most latin-derived languages.
If a suitable function cannot be found, raises
TokenizerNotFoundError
.If given and not None, chunkers and filters must be lists of chunker classes and filter classes respectively. These will be applied to the tokenizer during creation.
- class enchant.tokenize.tokenize(text: str)
Base class for all tokenizer objects.
Each tokenizer must be an iterator and provide the
offset
attribute as described in the documentation for this module.While tokenizers are in fact classes, they should be treated like functions, and so are named using lower_case rather than the CamelCase more traditional of class names.
- class enchant.tokenize.unit_tokenize(text: str)
Tokenizer class that yields the text as a single token.
- enchant.tokenize.wrap_tokenizer(tk1: Type[tokenize] | Filter, tk2: Type[tokenize] | Filter) Filter
Wrap one tokenizer inside another.
This function takes two tokenizer functions tk1 and tk2, and returns a new tokenizer function that passes the output of tk1 through tk2 before yielding it to the calling code.