annif.backend package

Submodules

annif.backend.backend module

Common functionality for backends.

class annif.backend.backend.AnnifBackend(backend_id: str, config_params: dict[str, Any] | SectionProxy, project: AnnifProject)

Bases: object

Base class for Annif backends that perform analysis. The non-implemented methods should be overridden in subclasses.

DEFAULT_PARAMETERS = {'limit': 100}
debug(message: str) None

Log a debug message from this backend

default_params() dict[str, Any]
info(message: str) None

Log an info message from this backend

initialize(parallel: bool = False) None

This method can be overridden by backends. It should cause the backend to pre-load all data it needs during operation. If parallel is True, the backend should expect to be used for parallel operation.

property is_trained: bool
property modification_time: datetime | None
name = None
property params: dict[str, Any]
suggest(texts: list[str], params: dict[str, Any] | None = None) SuggestionBatch

Suggest subjects for the input documents and return a list of subject sets represented as a list of SubjectSuggestion objects.

train(corpus: DocumentCorpus, params: dict[str, Any] | None = None, jobs: int = 0) None

Train the model on the given document or subject corpus.

warning(message: str) None

Log a warning message from this backend

class annif.backend.backend.AnnifLearningBackend(backend_id: str, config_params: dict[str, Any] | SectionProxy, project: AnnifProject)

Bases: AnnifBackend

Base class for Annif backends that can perform online learning

learn(corpus: DocumentCorpus, params: dict[str, Any] | None = None) None

Further train the model on the given document or subject corpus.

annif.backend.dummy module

Dummy backend for testing basic interaction of projects and backends

class annif.backend.dummy.DummyBackend(backend_id: str, config_params: dict[str, Any] | SectionProxy, project: AnnifProject)

Bases: AnnifLearningBackend

initialize(parallel: bool = False) None

This method can be overridden by backends. It should cause the backend to pre-load all data it needs during operation. If parallel is True, the backend should expect to be used for parallel operation.

initialized = False
is_trained = True
modification_time = None
name = 'dummy'
subject_id = 0

annif.backend.ensemble module

Ensemble backend that combines results from multiple projects

class annif.backend.ensemble.BaseEnsembleBackend(backend_id: str, config_params: dict[str, Any] | SectionProxy, project: AnnifProject)

Bases: AnnifBackend

Base class for ensemble backends

initialize(parallel: bool = False) None

This method can be overridden by backends. It should cause the backend to pre-load all data it needs during operation. If parallel is True, the backend should expect to be used for parallel operation.

class annif.backend.ensemble.EnsembleBackend(backend_id: str, config_params: dict[str, Any] | SectionProxy, project: AnnifProject)

Bases: BaseEnsembleBackend, AnnifHyperoptBackend

Ensemble backend that combines results from multiple projects

get_hp_optimizer(corpus: DocumentCorpus, metric: str) EnsembleOptimizer

Get a HyperparameterOptimizer object that can look for optimal hyperparameter combinations for the given corpus, measured using the given metric

property is_trained: bool
property modification_time: datetime | None
name = 'ensemble'
class annif.backend.ensemble.EnsembleOptimizer(backend: EnsembleBackend, corpus: DocumentCorpus, metric: str)

Bases: HyperparameterOptimizer

Hyperparameter optimizer for the ensemble backend

annif.backend.fasttext module

Annif backend using the fastText classifier

class annif.backend.fasttext.FastTextBackend(backend_id: str, config_params: dict[str, Any] | SectionProxy, project: AnnifProject)

Bases: ChunkingBackend, AnnifBackend

fastText backend for Annif

DEFAULT_PARAMETERS = {'dim': 100, 'epoch': 5, 'loss': 'hs', 'lr': 0.25}
FASTTEXT_PARAMS = {'bucket': <class 'int'>, 'dim': <class 'int'>, 'epoch': <class 'int'>, 'loss': <class 'str'>, 'lr': <class 'float'>, 'lrUpdateRate': <class 'int'>, 'maxn': <class 'int'>, 'minCount': <class 'int'>, 'minn': <class 'int'>, 'neg': <class 'int'>, 'pretrainedVectors': <class 'str'>, 't': <class 'float'>, 'thread': <class 'int'>, 'wordNgrams': <class 'int'>, 'ws': <class 'int'>}
MODEL_FILE = 'fasttext-model'
TRAIN_FILE = 'fasttext-train.txt'
default_params() dict[str, Any]
initialize(parallel: bool = False) None

This method can be overridden by backends. It should cause the backend to pre-load all data it needs during operation. If parallel is True, the backend should expect to be used for parallel operation.

name = 'fasttext'

annif.backend.http module

HTTP/REST client backend that makes calls to a web service and returns the results

class annif.backend.http.HTTPBackend(backend_id: str, config_params: dict[str, Any] | SectionProxy, project: AnnifProject)

Bases: AnnifBackend

property headers: dict[str, str]
property is_trained: bool | None
property modification_time: datetime | None
name = 'http'

annif.backend.hyperopt module

Hyperparameter optimization functionality for backends

class annif.backend.hyperopt.AnnifHyperoptBackend(backend_id: str, config_params: dict[str, Any] | SectionProxy, project: AnnifProject)

Bases: AnnifBackend

Base class for Annif backends that can perform hyperparameter optimization

abstract get_hp_optimizer(corpus: DocumentCorpus, metric: str)

Get a HyperparameterOptimizer object that can look for optimal hyperparameter combinations for the given corpus, measured using the given metric

class annif.backend.hyperopt.HPRecommendation(lines, score)

Bases: tuple

lines

Alias for field number 0

score

Alias for field number 1

class annif.backend.hyperopt.HyperparameterOptimizer(backend: AnnifBackend, corpus: DocumentCorpus, metric: str)

Bases: object

Base class for hyperparameter optimizers

optimize(n_trials: int, n_jobs: int, results_file: LazyFile | None) HPRecommendation

Find the optimal hyperparameters by testing up to the given number of hyperparameter combinations

class annif.backend.hyperopt.TrialWriter(results_file: LazyFile, normalize_func: Callable)

Bases: object

Object that writes hyperparameter optimization trial results into a TSV file.

write(study: Study, trial: Trial) None

Write the results of one trial into the results file. On the first run, write the header line first.

annif.backend.mixins module

Annif backend mixins that can be used to implement features

class annif.backend.mixins.ChunkingBackend

Bases: object

Annif backend mixin that implements chunking of input

DEFAULT_PARAMETERS = {'chunksize': 1}
default_params() dict[str, Any]
class annif.backend.mixins.TfidfVectorizerMixin

Bases: object

Annif backend mixin that implements TfidfVectorizer functionality

VECTORIZER_FILE = 'vectorizer'
create_vectorizer(input: Iterable[str], params: dict[str, Any] = None) csr_matrix
initialize_vectorizer() None
vectorizer = None

annif.backend.mllm module

Maui-like Lexical Matching backend

class annif.backend.mllm.MLLMBackend(backend_id: str, config_params: dict[str, Any] | SectionProxy, project: AnnifProject)

Bases: AnnifHyperoptBackend

Maui-like Lexical Matching backend for Annif

DEFAULT_PARAMETERS = {'max_leaf_nodes': 1000, 'max_samples': 0.9, 'min_samples_leaf': 20, 'use_hidden_labels': False}
MODEL_FILE = 'mllm-model.gz'
TRAIN_FILE = 'mllm-train.gz'
get_hp_optimizer(corpus: DocumentCorpus, metric: str) MLLMOptimizer

Get a HyperparameterOptimizer object that can look for optimal hyperparameter combinations for the given corpus, measured using the given metric

initialize(parallel: bool = False) None

This method can be overridden by backends. It should cause the backend to pre-load all data it needs during operation. If parallel is True, the backend should expect to be used for parallel operation.

name = 'mllm'
class annif.backend.mllm.MLLMOptimizer(backend: AnnifBackend, corpus: DocumentCorpus, metric: str)

Bases: HyperparameterOptimizer

Hyperparameter optimizer for the MLLM backend

annif.backend.nn_ensemble module

Neural network based ensemble backend that combines results from multiple projects.

class annif.backend.nn_ensemble.LMDBSequence(txn, batch_size)

Bases: Sequence

A sequence of samples stored in a LMDB database.

add_sample(inputs: ndarray, targets: ndarray) None
class annif.backend.nn_ensemble.MeanLayer(*args, **kwargs)

Bases: Layer

Custom Keras layer that calculates mean values along the 2nd axis.

call(inputs: EagerTensor) EagerTensor

This is where the layer’s logic lives.

The call() method may not create state (except in its first invocation, wrapping the creation of variables or other resources in tf.init_scope()). It is recommended to create state, including tf.Variable instances and nested Layer instances,

in __init__(), or in the build() method that is

called automatically before call() executes for the first time.

Args:
inputs: Input tensor, or dict/list/tuple of input tensors.

The first positional inputs argument is subject to special rules: - inputs must be explicitly passed. A layer cannot have zero

arguments, and inputs cannot be provided via the default value of a keyword argument.

  • NumPy array or Python scalar values in inputs get cast as tensors.

  • Keras mask metadata is only collected from inputs.

  • Layers are built (build(input_shape) method) using shape info from inputs only.

  • input_spec compatibility is only checked against inputs.

  • Mixed precision input casting is only applied to inputs. If a layer has tensor arguments in *args or **kwargs, their casting behavior in mixed precision should be handled manually.

  • The SavedModel input specification is generated using inputs only.

  • Integration with various ecosystem packages like TFMOT, TFLite, TF.js, etc is only supported for inputs and not for tensors in positional and keyword arguments.

*args: Additional positional arguments. May contain tensors, although

this is not recommended, for the reasons above.

**kwargs: Additional keyword arguments. May contain tensors, although

this is not recommended, for the reasons above. The following optional keyword arguments are reserved: - training: Boolean scalar tensor of Python boolean indicating

whether the call is meant for training or inference.

  • mask: Boolean input mask. If the layer’s call() method takes a mask argument, its default value will be set to the mask generated for inputs by the previous layer (if input did come from a layer that generated a corresponding mask, i.e. if it came from a Keras layer with masking support).

Returns:

A tensor or list/tuple of tensors.

class annif.backend.nn_ensemble.NNEnsembleBackend(backend_id: str, config_params: dict[str, Any] | SectionProxy, project: AnnifProject)

Bases: AnnifLearningBackend, BaseEnsembleBackend

Neural network ensemble backend that combines results from multiple projects

DEFAULT_PARAMETERS = {'dropout_rate': 0.2, 'epochs': 10, 'learn-epochs': 1, 'lmdb_map_size': 1073741824, 'nodes': 100, 'optimizer': 'adam'}
LMDB_FILE = 'nn-train.mdb'
MODEL_FILE = 'nn-model.keras'
get_model_metadata(model_filename: str) dict | None

Read metadata from Keras model files.

initialize(parallel: bool = False) None

This method can be overridden by backends. It should cause the backend to pre-load all data it needs during operation. If parallel is True, the backend should expect to be used for parallel operation.

name = 'nn_ensemble'
annif.backend.nn_ensemble.idx_to_key(idx: int) bytes

convert an integer index to a binary key for use in LMDB

annif.backend.nn_ensemble.key_to_idx(key: memoryview | bytes) int

convert a binary LMDB key to an integer index

annif.backend.omikuji module

Annif backend using the Omikuji classifier

class annif.backend.omikuji.OmikujiBackend(backend_id: str, config_params: dict[str, Any] | SectionProxy, project: AnnifProject)

Bases: TfidfVectorizerMixin, AnnifBackend

Omikuji based backend for Annif

DEFAULT_PARAMETERS = {'cluster_balanced': True, 'cluster_k': 2, 'collapse_every_n_layers': 0, 'max_depth': 20, 'min_df': 1, 'ngram': 1}
MODEL_FILE = 'omikuji-model'
TRAIN_FILE = 'omikuji-train.txt'
initialize(parallel: bool = False) None

This method can be overridden by backends. It should cause the backend to pre-load all data it needs during operation. If parallel is True, the backend should expect to be used for parallel operation.

name = 'omikuji'

annif.backend.pav module

PAV ensemble backend that combines results from multiple projects and learns which concept suggestions from each backend are trustworthy using the PAV algorithm, a.k.a. isotonic regression, to turn raw scores returned by individual backends into probabilities.

class annif.backend.pav.PAVBackend(backend_id: str, config_params: dict[str, Any] | SectionProxy, project: AnnifProject)

Bases: BaseEnsembleBackend

PAV ensemble backend that combines results from multiple projects

DEFAULT_PARAMETERS = {'min-docs': 10}
MODEL_FILE_PREFIX = 'pav-model-'
initialize(parallel: bool = False) None

This method can be overridden by backends. It should cause the backend to pre-load all data it needs during operation. If parallel is True, the backend should expect to be used for parallel operation.

name = 'pav'

annif.backend.stwfsa module

class annif.backend.stwfsa.StwfsaBackend(backend_id: str, config_params: dict[str, Any] | SectionProxy, project: AnnifProject)

Bases: AnnifBackend

DEFAULT_PARAMETERS = {'concept_type_uri': 'http://www.w3.org/2004/02/skos/core#Concept', 'expand_abbreviation_with_punctuation': True, 'expand_ampersand_with_spaces': True, 'extract_any_case_from_braces': False, 'extract_upper_case_from_braces': True, 'handle_title_case': True, 'remove_deprecated': True, 'simple_english_plural_rules': False, 'sub_thesaurus_type_uri': 'http://www.w3.org/2004/02/skos/core#Collection', 'thesaurus_relation_is_specialisation': True, 'thesaurus_relation_type_uri': 'http://www.w3.org/2004/02/skos/core#member', 'use_txt_vec': False}
MODEL_FILE = 'stwfsa_predictor.zip'
STWFSA_PARAMETERS = {'concept_type_uri': <class 'str'>, 'expand_abbreviation_with_punctuation': <function boolean>, 'expand_ampersand_with_spaces': <function boolean>, 'extract_any_case_from_braces': <function boolean>, 'extract_upper_case_from_braces': <function boolean>, 'handle_title_case': <function boolean>, 'remove_deprecated': <function boolean>, 'simple_english_plural_rules': <function boolean>, 'sub_thesaurus_type_uri': <class 'str'>, 'thesaurus_relation_is_specialisation': <function boolean>, 'thesaurus_relation_type_uri': <class 'str'>, 'use_txt_vec': <class 'bool'>}
initialize(parallel: bool = False) None

This method can be overridden by backends. It should cause the backend to pre-load all data it needs during operation. If parallel is True, the backend should expect to be used for parallel operation.

name = 'stwfsa'

annif.backend.svc module

Annif backend using a SVM classifier

class annif.backend.svc.SVCBackend(backend_id: str, config_params: dict[str, Any] | SectionProxy, project: AnnifProject)

Bases: TfidfVectorizerMixin, AnnifBackend

Support vector classifier backend for Annif

DEFAULT_PARAMETERS = {'min_df': 1, 'ngram': 1}
MODEL_FILE = 'svc-model.gz'
initialize(parallel: bool = False) None

This method can be overridden by backends. It should cause the backend to pre-load all data it needs during operation. If parallel is True, the backend should expect to be used for parallel operation.

name = 'svc'

annif.backend.tfidf module

Backend that returns most similar subjects based on similarity in sparse TF-IDF normalized bag-of-words vector space

class annif.backend.tfidf.SubjectBuffer(tempdir: str, subject_id: int)

Bases: object

A file-backed buffer to store and retrieve subject text.

BUFFER_SIZE = 100
flush() None
read() str
write(text: str) None
class annif.backend.tfidf.TFIDFBackend(backend_id: str, config_params: dict[str, Any] | SectionProxy, project: AnnifProject)

Bases: TfidfVectorizerMixin, AnnifBackend

TF-IDF vector space similarity based backend for Annif

INDEX_FILE = 'tfidf-index'
initialize(parallel: bool = False) None

This method can be overridden by backends. It should cause the backend to pre-load all data it needs during operation. If parallel is True, the backend should expect to be used for parallel operation.

name = 'tfidf'

annif.backend.yake module

Annif backend using Yake keyword extraction

class annif.backend.yake.YakeBackend(backend_id: str, config_params: dict[str, Any] | SectionProxy, project: AnnifProject)

Bases: AnnifBackend

Yake based backend for Annif

DEFAULT_PARAMETERS = {'deduplication_algo': 'levs', 'deduplication_threshold': 0.9, 'features': None, 'label_types': ['prefLabel', 'altLabel'], 'max_ngram_size': 4, 'num_keywords': 100, 'remove_parentheses': False, 'window_size': 1}
INDEX_FILE = 'yake-index'
initialize(parallel: bool = False) None

This method can be overridden by backends. It should cause the backend to pre-load all data it needs during operation. If parallel is True, the backend should expect to be used for parallel operation.

property is_trained
property label_types: list[URIRef]
name = 'yake'

Module contents

Registry of backend types for Annif

annif.backend.get_backend(backend_id: str) Type[AnnifBackend]