annif.analyzer package
Submodules
annif.analyzer.analyzer module
Common functionality for analyzers.
- class annif.analyzer.analyzer.Analyzer(**kwargs)
Bases:
object
Base class for language-specific analyzers. Either tokenize_words or _normalize_word must be overridden in subclasses. Other methods may be overridden when necessary.
- is_valid_token(word: str) bool
Return True if the word is an acceptable token.
- name = None
- token_min_length = 3
- tokenize_sentences(text: str) list[str]
Tokenize a piece of text (e.g. a document) into sentences.
- tokenize_words(text: str, filter: bool = True) list[str]
Tokenize a piece of text (e.g. a sentence) into words. If filter=True (default), only return valid tokens (e.g. not punctuation, numbers or very short words)
annif.analyzer.simple module
Simple analyzer for Annif. Only folds words to lower case.
annif.analyzer.simplemma module
Simplemma analyzer for Annif, based on simplemma lemmatizer.
annif.analyzer.snowball module
Snowball analyzer for Annif, based on nltk Snowball stemmer.
annif.analyzer.spacy module
spaCy analyzer for Annif which uses spaCy for lemmatization
- class annif.analyzer.spacy.SpacyAnalyzer(param: str, **kwargs)
Bases:
Analyzer
- name = 'spacy'
- tokenize_words(text: str, filter: bool = True) list[str]
Tokenize a piece of text (e.g. a sentence) into words. If filter=True (default), only return valid tokens (e.g. not punctuation, numbers or very short words)
annif.analyzer.voikko module
Voikko analyzer for Annif, based on libvoikko library.
Module contents
Collection of language-specific analyzers and analyzer registry for Annif
- annif.analyzer.register_analyzer(analyzer)