annif.backend package

Submodules

annif.backend.backend module

Common functionality for backends.

class annif.backend.backend.AnnifBackend(backend_id: str, config_params: dict[str, Any] | SectionProxy, project: AnnifProject)

Bases: object

Base class for Annif backends that perform analysis. The non-implemented methods should be overridden in subclasses.

DEFAULT_PARAMETERS = {'limit': 100}

debug(message: str) → None: Log a debug message from this backend

default_params() → dict[str, Any]

info(message: str) → None: Log an info message from this backend

initialize(parallel: bool = False) → None: This method can be overridden by backends. It should cause the backend to pre-load all data it needs during operation. If parallel is True, the backend should expect to be used for parallel operation.

property is_trained: bool

property modification_time: datetime | None

name = None

property params: dict[str, Any]

suggest(documents: list[Document], params: dict[str, Any] | None = None) → SuggestionBatch: Suggest subjects for the input documents and return a list of subject sets represented as a list of SubjectSuggestion objects.

train(corpus: DocumentCorpus, params: dict[str, Any] | None = None, jobs: int = 0) → None: Train the model on the given document or subject corpus.

warning(message: str) → None: Log a warning message from this backend

class annif.backend.backend.AnnifLearningBackend(backend_id: str, config_params: dict[str, Any] | SectionProxy, project: AnnifProject)

Bases: AnnifBackend

Base class for Annif backends that can perform online learning

learn(corpus: DocumentCorpus, params: dict[str, Any] | None = None) → None: Further train the model on the given document or subject corpus.

annif.backend.dummy module

Dummy backend for testing basic interaction of projects and backends

class annif.backend.dummy.DummyBackend(backend_id: str, config_params: dict[str, Any] | SectionProxy, project: AnnifProject)

Bases: AnnifLearningBackend

initialize(parallel: bool = False) → None: This method can be overridden by backends. It should cause the backend to pre-load all data it needs during operation. If parallel is True, the backend should expect to be used for parallel operation.

initialized = False

is_trained = True

modification_time = None

name = 'dummy'

subject_id = 0

annif.backend.ensemble module

Ensemble backend that combines results from multiple projects

class annif.backend.ensemble.BaseEnsembleBackend(backend_id: str, config_params: dict[str, Any] | SectionProxy, project: AnnifProject)

Bases: AnnifBackend

Base class for ensemble backends

initialize(parallel: bool = False) → None: This method can be overridden by backends. It should cause the backend to pre-load all data it needs during operation. If parallel is True, the backend should expect to be used for parallel operation.

class annif.backend.ensemble.EnsembleBackend(backend_id: str, config_params: dict[str, Any] | SectionProxy, project: AnnifProject)

Bases: BaseEnsembleBackend, AnnifHyperoptBackend

Ensemble backend that combines results from multiple projects

get_hp_optimizer(corpus: DocumentCorpus, metric: str) → EnsembleOptimizer: Get a HyperparameterOptimizer object that can look for optimal hyperparameter combinations for the given corpus, measured using the given metric

property is_trained: bool

property modification_time: datetime | None

name = 'ensemble'

class annif.backend.ensemble.EnsembleHPObjective

Bases: HPObjective

Objective function of the ensemble hyperparameter optimizer

classmethod objective(trial: Trial, args) → float: Objective function to optimize. To be implemented by subclasses.

class annif.backend.ensemble.EnsembleOptimizer(backend: EnsembleBackend, corpus: DocumentCorpus, metric: str)

Bases: HyperparameterOptimizer

Hyperparameter optimizer for the ensemble backend

annif.backend.fasttext module

Annif backend using the fastText classifier

class annif.backend.fasttext.FastTextBackend(backend_id: str, config_params: dict[str, Any] | SectionProxy, project: AnnifProject)

Bases: ChunkingBackend, AnnifBackend

fastText backend for Annif

DEFAULT_PARAMETERS = {'dim': 100, 'epoch': 5, 'loss': 'hs', 'lr': 0.25}

FASTTEXT_PARAMS = {'bucket': <class 'int'>, 'dim': <class 'int'>, 'epoch': <class 'int'>, 'loss': <class 'str'>, 'lr': <class 'float'>, 'lrUpdateRate': <class 'int'>, 'maxn': <class 'int'>, 'minCount': <class 'int'>, 'minn': <class 'int'>, 'neg': <class 'int'>, 'pretrainedVectors': <class 'str'>, 't': <class 'float'>, 'thread': <class 'int'>, 'wordNgrams': <class 'int'>, 'ws': <class 'int'>}

MODEL_FILE = 'fasttext-model'

TRAIN_FILE = 'fasttext-train.txt'

default_params() → dict[str, Any]

initialize(parallel: bool = False) → None: This method can be overridden by backends. It should cause the backend to pre-load all data it needs during operation. If parallel is True, the backend should expect to be used for parallel operation.

name = 'fasttext'

annif.backend.http module

HTTP/REST client backend that makes calls to a web service and returns the results

class annif.backend.http.HTTPBackend(backend_id: str, config_params: dict[str, Any] | SectionProxy, project: AnnifProject)

Bases: AnnifBackend

property headers: dict[str, str]

property is_trained: bool | None

property modification_time: datetime | None

name = 'http'

annif.backend.hyperopt module

Hyperparameter optimization functionality for backends

class annif.backend.hyperopt.AnnifHyperoptBackend(backend_id: str, config_params: dict[str, Any] | SectionProxy, project: AnnifProject)

Bases: AnnifBackend

Base class for Annif backends that can perform hyperparameter optimization

abstract get_hp_optimizer(corpus: DocumentCorpus, metric: str): Get a HyperparameterOptimizer object that can look for optimal hyperparameter combinations for the given corpus, measured using the given metric

class annif.backend.hyperopt.HPObjective

Bases: BaseWorker

Base class for hyperparameter optimizer objective functions

classmethod objective(trial: Trial, args) → float: Objective function to optimize. To be implemented by subclasses.

classmethod run_trial(trial_id: int, storage_url: str, study_name: str) → dict[str, Any]

class annif.backend.hyperopt.HPRecommendation(lines, score)

Bases: tuple

lines: Alias for field number 0

score: Alias for field number 1

class annif.backend.hyperopt.HyperparameterOptimizer(backend: AnnifBackend, corpus: DocumentCorpus, metric: str, objective: HPObjective)

Bases: object

Base class for hyperparameter optimizers

optimize(n_trials: int, n_jobs: int, results_file: LazyFile | None) → HPRecommendation: Find the optimal hyperparameters by testing up to the given number of hyperparameter combinations

class annif.backend.hyperopt.TrialWriter(results_file: LazyFile, normalize_func: Callable)

Bases: object

Object that writes hyperparameter optimization trial results into a TSV file.

write(trial_data: dict[str, Any]) → None: Write the results of one trial into the results file. On the first run, write the header line first.

annif.backend.mixins module

Annif backend mixins that can be used to implement features

class annif.backend.mixins.ChunkingBackend

Bases: object

Annif backend mixin that implements chunking of input

DEFAULT_PARAMETERS = {'chunksize': 1}

default_params() → dict[str, Any]

class annif.backend.mixins.TfidfVectorizerMixin

Bases: object

Annif backend mixin that implements TfidfVectorizer functionality

VECTORIZER_FILE = 'vectorizer'

create_vectorizer(input: Iterable[str], params: dict[str, Any] = None) → csr_matrix

initialize_vectorizer() → None

vectorizer = None

annif.backend.mllm module

Maui-like Lexical Matching backend

class annif.backend.mllm.MLLMBackend(backend_id: str, config_params: dict[str, Any] | SectionProxy, project: AnnifProject)

Bases: AnnifHyperoptBackend

Maui-like Lexical Matching backend for Annif

DEFAULT_PARAMETERS = {'max_leaf_nodes': 1000, 'max_samples': 0.9, 'min_samples_leaf': 20, 'use_hidden_labels': False}

MODEL_FILE = 'mllm-model.gz'

TRAIN_FILE = 'mllm-train.gz'

get_hp_optimizer(corpus: DocumentCorpus, metric: str) → MLLMOptimizer: Get a HyperparameterOptimizer object that can look for optimal hyperparameter combinations for the given corpus, measured using the given metric

initialize(parallel: bool = False) → None: This method can be overridden by backends. It should cause the backend to pre-load all data it needs during operation. If parallel is True, the backend should expect to be used for parallel operation.

name = 'mllm'

class annif.backend.mllm.MLLMHPObjective

Bases: HPObjective

Objective function of the MLLM hyperparameter optimizer

classmethod objective(trial: Trial, args) → float: Objective function to optimize. To be implemented by subclasses.

class annif.backend.mllm.MLLMOptimizer(backend: AnnifBackend, corpus: DocumentCorpus, metric: str, objective: HPObjective)

Bases: HyperparameterOptimizer

Hyperparameter optimizer for the MLLM backend

annif.backend.mllm.prediction_to_result(prediction: list[tuple[np.float64, int]], params: dict[str, Any], subject_index: SubjectIndex) → Iterator

annif.backend.nn_ensemble module

Neural network based ensemble backend that combines results from multiple projects.

class annif.backend.nn_ensemble.EarlyStopping(patience: int): Bases: object

class annif.backend.nn_ensemble.LMDBDataset(txn)

Bases: Dataset

A sequence of samples stored in a LMDB database.

add_sample(inputs: ndarray, targets: ndarray) → None

get_subset(indices: list[int]) → tuple[Tensor, Tensor]

Fetch a fixed set of samples by index and stack into batch tensors.

Returns (inputs, targets) where inputs is torch.Tensor of shape (B, M, N) and targets is torch.Tensor of shape (B, N).

class annif.backend.nn_ensemble.NNEnsembleBackend(backend_id: str, config_params: dict[str, Any] | SectionProxy, project: AnnifProject)

Bases: AnnifLearningBackend, BaseEnsembleBackend

Neural network ensemble backend that combines results from multiple projects

DEFAULT_PARAMETERS = {'batch-size': 256, 'learn-epochs': 1, 'lmdb_map_size': 1073741824, 'lr': 0.003, 'max-epochs': 50}

EARLY_STOPPING_PATIENCE = 2

EARLY_STOP_EVAL_ROWS = 512

EARLY_STOP_SEED = 1337

LMDB_FILE = 'nn-train.mdb'

MODEL_FILE = 'nn-model.pt'

PRED_SCALE = 20

initialize(parallel: bool = False) → None: This method can be overridden by backends. It should cause the backend to pre-load all data it needs during operation. If parallel is True, the backend should expect to be used for parallel operation.

name = 'nn_ensemble'

class annif.backend.nn_ensemble.NNEnsembleModel(n_sources: int, n_subjects: int, source_weights: list[float])

Bases: Module

forward(inputs: Tensor)

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

classmethod load(filepath, map_location='cpu')

save(filepath)

annif.backend.nn_ensemble.idx_to_key(idx: int) → bytes: convert an integer index to a binary key for use in LMDB

annif.backend.nn_ensemble.key_to_idx(key: memoryview | bytes) → int: convert a binary LMDB key to an integer index

annif.backend.nn_ensemble.ndcg_batch(preds: Tensor, targets: Tensor) → float

preds: (B, N) float targets: (B, N) binary {0, 1} relevance labels

Returns: mean NDCG across the batch (float)

annif.backend.omikuji module

Annif backend using the Omikuji classifier

class annif.backend.omikuji.OmikujiBackend(backend_id: str, config_params: dict[str, Any] | SectionProxy, project: AnnifProject)

Bases: TfidfVectorizerMixin, AnnifBackend

Omikuji based backend for Annif

DEFAULT_PARAMETERS = {'cluster_balanced': True, 'cluster_k': 2, 'collapse_every_n_layers': 0, 'max_depth': 20, 'min_df': 1, 'ngram': 1}

MODEL_FILE = 'omikuji-model'

TRAIN_FILE = 'omikuji-train.txt'

initialize(parallel: bool = False) → None: This method can be overridden by backends. It should cause the backend to pre-load all data it needs during operation. If parallel is True, the backend should expect to be used for parallel operation.

name = 'omikuji'

annif.backend.pav module

PAV ensemble backend that combines results from multiple projects and learns which concept suggestions from each backend are trustworthy using the PAV algorithm, a.k.a. isotonic regression, to turn raw scores returned by individual backends into probabilities.

class annif.backend.pav.PAVBackend(backend_id: str, config_params: dict[str, Any] | SectionProxy, project: AnnifProject)

Bases: BaseEnsembleBackend

PAV ensemble backend that combines results from multiple projects

DEFAULT_PARAMETERS = {'min-docs': 10}

MODEL_FILE_PREFIX = 'pav-model-'

initialize(parallel: bool = False) → None: This method can be overridden by backends. It should cause the backend to pre-load all data it needs during operation. If parallel is True, the backend should expect to be used for parallel operation.

name = 'pav'

annif.backend.stwfsa module

class annif.backend.stwfsa.StwfsaBackend(backend_id: str, config_params: dict[str, Any] | SectionProxy, project: AnnifProject)

Bases: AnnifBackend

DEFAULT_PARAMETERS = {'concept_type_uri': 'http://www.w3.org/2004/02/skos/core#Concept', 'expand_abbreviation_with_punctuation': True, 'expand_ampersand_with_spaces': True, 'extract_any_case_from_braces': False, 'extract_upper_case_from_braces': True, 'handle_title_case': True, 'remove_deprecated': True, 'simple_english_plural_rules': False, 'sub_thesaurus_type_uri': 'http://www.w3.org/2004/02/skos/core#Collection', 'thesaurus_relation_is_specialisation': True, 'thesaurus_relation_type_uri': 'http://www.w3.org/2004/02/skos/core#member', 'use_txt_vec': False}

MODEL_FILE = 'stwfsa_predictor.zip'

STWFSA_PARAMETERS = {'concept_type_uri': <class 'str'>, 'expand_abbreviation_with_punctuation': <function boolean>, 'expand_ampersand_with_spaces': <function boolean>, 'extract_any_case_from_braces': <function boolean>, 'extract_upper_case_from_braces': <function boolean>, 'handle_title_case': <function boolean>, 'remove_deprecated': <function boolean>, 'simple_english_plural_rules': <function boolean>, 'sub_thesaurus_type_uri': <class 'str'>, 'thesaurus_relation_is_specialisation': <function boolean>, 'thesaurus_relation_type_uri': <class 'str'>, 'use_txt_vec': <class 'bool'>}

initialize(parallel: bool = False) → None: This method can be overridden by backends. It should cause the backend to pre-load all data it needs during operation. If parallel is True, the backend should expect to be used for parallel operation.

name = 'stwfsa'

annif.backend.svc module

Annif backend using a SVM classifier

class annif.backend.svc.SVCBackend(backend_id: str, config_params: dict[str, Any] | SectionProxy, project: AnnifProject)

Bases: TfidfVectorizerMixin, AnnifBackend

Support vector classifier backend for Annif

DEFAULT_PARAMETERS = {'min_df': 1, 'ngram': 1}

MODEL_FILE = 'svc-model.gz'

initialize(parallel: bool = False) → None: This method can be overridden by backends. It should cause the backend to pre-load all data it needs during operation. If parallel is True, the backend should expect to be used for parallel operation.

name = 'svc'

annif.backend.tfidf module

Backend that returns most similar subjects based on similarity in sparse TF-IDF normalized bag-of-words vector space

class annif.backend.tfidf.SubjectBuffer(tempdir: str, subject_id: int)

Bases: object

A file-backed buffer to store and retrieve subject text.

BUFFER_SIZE = 100

flush() → None

read() → str

write(text: str) → None

class annif.backend.tfidf.TFIDFBackend(backend_id: str, config_params: dict[str, Any] | SectionProxy, project: AnnifProject)

Bases: TfidfVectorizerMixin, AnnifBackend

TF-IDF vector space similarity based backend for Annif

MATRIX_FILE = 'tfidf-matrix.npz'

OLD_INDEX_FILE = 'tfidf-index'

initialize(parallel: bool = False) → None: This method can be overridden by backends. It should cause the backend to pre-load all data it needs during operation. If parallel is True, the backend should expect to be used for parallel operation.

name = 'tfidf'

annif.backend.yake module

Annif backend using Yake keyword extraction

class annif.backend.yake.YakeBackend(backend_id: str, config_params: dict[str, Any] | SectionProxy, project: AnnifProject)

Bases: AnnifBackend

Yake based backend for Annif

DEFAULT_PARAMETERS = {'deduplication_algo': 'levs', 'deduplication_threshold': 0.9, 'features': None, 'label_types': ['prefLabel', 'altLabel'], 'max_ngram_size': 4, 'num_keywords': 100, 'remove_parentheses': False, 'window_size': 1}

INDEX_FILE = 'yake-index'

initialize(parallel: bool = False) → None: This method can be overridden by backends. It should cause the backend to pre-load all data it needs during operation. If parallel is True, the backend should expect to be used for parallel operation.

property is_trained

property label_types: list[URIRef]

name = 'yake'

Module contents

Registry of backend types for Annif

annif.backend.get_backend(backend_id: str) → Type[AnnifBackend]