annif package

Subpackages

Submodules

annif.cli module

Definitions for command-line (Click) commands for invoking Annif operations and printing the results to console.

annif.cli_util module

Utility functions for Annif CLI commands

annif.cli_util.backend_param_option(f): Decorator to add an option for CLI commands to override BE parameters

annif.cli_util.common_options(f): Decorator to add common options for all CLI commands

annif.cli_util.complete_param(ctx: Context, param: Argument, incomplete: str) → list[str]

annif.cli_util.docs_limit_option(f): Decorator to add an option for CLI commands to limit the number of documents to use

annif.cli_util.format_datetime(dt: datetime | None) → str: Helper function to format a datetime object as a string in the local time.

annif.cli_util.generate_filter_params(filter_batch_max_limit: int) → list[tuple[int, float]]

annif.cli_util.get_output_stream(path: str, suffix: str, output: str | None, use_gzip: bool, force: bool) → TextIO | None: Return a writable output stream based on the output option.

annif.cli_util.get_project(project_id: str) → AnnifProject: Helper function to get a project by ID and bail out if it doesn’t exist

annif.cli_util.get_vocab(vocab_id: str) → AnnifVocabulary: Helper function to get a vocabulary by ID and bail out if it doesn’t exist

annif.cli_util.make_list_template(*rows) → str: Helper function to create a template for a list of entries with fields of variable width. The width of each field is determined by the longest item in the field in the given rows.

annif.cli_util.open_doc_path(path, subject_index, vocab_lang, require_subjects=True): open a single path and return it as a DocumentCorpus

annif.cli_util.open_documents(paths: tuple[str, ...], subject_index: SubjectIndex, vocab_lang: str, docs_limit: int | None) → DocumentCorpus: Helper function to open a document corpus from a list of pathnames, each of which is either a CSV, TSV or JSONL file or a directory of TXT or JSON files. For corpora with subjects expressed as labels, the given vocabulary language will be used to convert subject labels into URIs. The corpus will be returned as an instance of DocumentCorpus or LimitingDocumentCorpus.

annif.cli_util.open_text_documents(paths: tuple[str, ...], docs_limit: int | None) → DocumentList: Helper function to read text documents from the given file paths. Returns a DocumentList object with Documents having no subjects. If a path is “-”, the document text is read from standard input. The maximum number of documents to read is set by docs_limit parameter.

annif.cli_util.parse_backend_params(backend_param: tuple[str, ...] | tuple[()], project: AnnifProject) → collections.defaultdict[str, dict[str, str]]: Parse a list of backend parameters given with the –backend-param option into a nested dict structure

annif.cli_util.parse_metadata(metadata: tuple[str, ...] | tuple[()]) → dict[str, str]: Parse a list of metadata parameters given with the –metadata option into a dictionary

annif.cli_util.project_id(f): Decorator to add a project ID parameter to a CLI command

annif.cli_util.show_hits(hits: SuggestionResult, project: AnnifProject, lang: str, file: io.TextIOWrapper | None = None) → None: Print subject suggestions to the console or a file. The suggestions are displayed as a table, with one row per hit. Each row contains the URI, label, possible notation, and score of the suggestion. The label is given in the specified language.

annif.config module

Configuration file handling

class annif.config.AnnifConfigCFG(filename: str | None = None, projstr: str | None = None)

Bases: object

Class for reading configuration in CFG/INI format

property project_ids: list[str]

class annif.config.AnnifConfigDirectory(directory: str)

Bases: object

Class for reading configuration from directory

property project_ids

class annif.config.AnnifConfigTOML(filename: str)

Bases: object

Class for reading configuration in TOML format

property project_ids

annif.config.check_config(projects_config_path: str) → str | None

annif.config.find_config() → str | None

annif.config.parse_config(projects_config_path: str) → AnnifConfigDirectory | AnnifConfigCFG | AnnifConfigTOML | None

annif.datadir module

Mixin class for types that need a data directory

class annif.datadir.DatadirMixin(datadir: str, typename: str, identifier: str)

Bases: object

Mixin class for types that need a data directory for storing files

property datadir: str

annif.default_config module

A configuration module, where “Config” is a default configuration and the other classes are different configuration profiles overriding default settings.

class annif.default_config.Config

Bases: object

DATADIR = 'data'

DEBUG = False

INITIALIZE_PROJECTS = False

MAX_FORM_MEMORY_SIZE = 20000000

PROJECTS_CONFIG_PATH = ''

TESTING = False

class annif.default_config.DevelopmentConfig

Bases: Config

DEBUG = True

class annif.default_config.ProductionConfig

Bases: Config

INITIALIZE_PROJECTS = True

class annif.default_config.TestingConfig

Bases: Config

DATADIR = 'tests/data'

PROJECTS_CONFIG_PATH = 'tests/projects.cfg'

TESTING = True

class annif.default_config.TestingDirectoryConfig

Bases: TestingConfig

PROJECTS_CONFIG_PATH = 'tests/projects.d'

class annif.default_config.TestingInitializeConfig

Bases: TestingConfig

INITIALIZE_PROJECTS = True

class annif.default_config.TestingInvalid2ProjectsConfig

Bases: TestingConfig

PROJECTS_CONFIG_PATH = 'tests/projects_invalid2.cfg'

class annif.default_config.TestingInvalidProjectsConfig

Bases: TestingConfig

PROJECTS_CONFIG_PATH = 'tests/projects_invalid.cfg'

class annif.default_config.TestingNoProjectsConfig

Bases: TestingConfig

PROJECTS_CONFIG_PATH = 'tests/notfound.cfg'

class annif.default_config.TestingTOMLConfig

Bases: TestingConfig

PROJECTS_CONFIG_PATH = 'tests/projects.toml'

annif.eval module

Evaluation metrics for Annif

class annif.eval.EvaluationBatch(subject_index: SubjectIndex)

Bases: object

A class for evaluating batches of results using all available metrics. The evaluate() method is called once per document in the batch or evaluate_many() for a list of documents of the batch. Final results can be queried using the results() method.

evaluate_many(suggestion_batch: list[list[SubjectSuggestion]] | SuggestionBatch | list[Iterator], gold_subject_batch: Sequence[SubjectSet]) → None

output_result_per_subject(y_true: csr_array, y_pred: csr_array, results_file: TextIOWrapper | LazyFile, language: str) → None: Write results per subject (non-aggregated) to outputfile results_file, using labels in the given language

results(metrics: Iterable[str] = [], results_file: LazyFile | TextIOWrapper | None = None, language: str | None = None) → dict[str, float]: evaluate a set of selected subjects against a gold standard using different metrics. If metrics is empty, use all available metrics. If results_file (file object) given, write results per subject to it with labels expressed in the given language.

annif.eval.dcg_score(y_true: csr_array, y_pred: csr_array, limit: int | None = None) → np.float64: return the discounted cumulative gain (DCG) score for the selected labels vs. relevant labels

annif.eval.false_negatives(y_true: csr_array, y_pred: csr_array) → int: calculate the number of false negatives using bitwise operations, emulating the way sklearn evaluation metric functions work

annif.eval.false_positives(y_true: csr_array, y_pred: csr_array) → int: calculate the number of false positives using bitwise operations, emulating the way sklearn evaluation metric functions work

annif.eval.ndcg_score(y_true: csr_array, y_pred: csr_array, limit: int | None = None) → float: return the normalized discounted cumulative gain (nDCG) score for the selected labels vs. relevant labels

annif.eval.true_positives(y_true: csr_array, y_pred: csr_array) → int: calculate the number of true positives using bitwise operations, emulating the way sklearn evaluation metric functions work

annif.exception module

Custom exceptions used by Annif

exception annif.exception.AnnifException(message: str, project_id: str | None = None, backend_id: str | None = None)

Bases: ClickException

Base Annif exception. We define this as a subclass of ClickException so that the CLI can automatically handle exceptions. This exception cannot be instantiated directly - subclasses should be used instead.

format_message() → str

prefix = None

exception annif.exception.ConfigurationException(message: str, project_id: str | None = None, backend_id: str | None = None)

Bases: AnnifException

Exception raised when a project or backend is misconfigured.

prefix = 'Misconfigured'

exception annif.exception.NotEnabledException(message: str, project_id: str | None = None, backend_id: str | None = None)

Bases: AnnifException

Exception raised when an operation is not enabled.

prefix = 'Not enabled'

exception annif.exception.NotInitializedException(message: str, project_id: str | None = None, backend_id: str | None = None)

Bases: AnnifException

Exception raised for attempting to use a project or backend that cannot be initialized, most likely since it is not yet functional because of lack of vocabulary or training.

prefix = "Couldn't initialize"

exception annif.exception.NotSupportedException(message: str, project_id: str | None = None, backend_id: str | None = None)

Bases: AnnifException

Exception raised when an operation is not supported by a project or backend.

prefix = 'Not supported'

exception annif.exception.OperationFailedException(message: str, project_id: str | None = None, backend_id: str | None = None)

Bases: AnnifException

Exception raised when an operation fails for some unknown reason.

prefix = 'Operation failed'

annif.hfh_util module

Utility functions for interactions with Hugging Face Hub.

annif.hfh_util.check_is_download_allowed(trust_repo, repo_id): Check if downloading from the specified repository is allowed based on the trust option and cache status.

annif.hfh_util.copy_project_config(src_path: str, force: bool) → None: Copy a given project configuration file to projects.d/ directory.

annif.hfh_util.download_from_hf_hub(filename: str, repo_id: str, token: str, revision: str) → list[str]

annif.hfh_util.get_matching_project_ids_from_hf_hub(project_ids_pattern: str, repo_id: str, token, revision: str) → list[str]: Get project IDs of the projects in a Hugging Face Model Hub repository that match the given pattern.

annif.hfh_util.get_matching_projects(pattern: str) → list[AnnifProject]: Get projects that match the given pattern.

annif.hfh_util.get_vocab_id_from_config(config_path: str) → str: Get the vocabulary ID from a configuration file.

annif.hfh_util.prepare_commits(projects: list[AnnifProject], repo_id: str, token: str) → tuple[list, list]: Prepare and pre-upload data and config commit operations for projects to a Hugging Face Hub repository.

annif.hfh_util.unzip_archive(src_path: str, force: bool) → None: Unzip a zip archive of projects and vocabularies to a directory, by default data/ under current directory.

annif.hfh_util.upsert_modelcard(repo_id, projects, token, revision): This function creates or updates a Model Card in a Hugging Face Hub repository with some metadata in it.

annif.parallel module

Parallel processing functionality for Annif

class annif.parallel.BaseWorker

Bases: object

Base class for workers that implement tasks executed via multiprocessing. The init method can be used to store data objects that are necessary for the operation. They will be stored in a class attribute that is accessible to the static worker method. The storage solution is inspired by this blog post: https://thelaziestprogrammer.com/python/multiprocessing-pool-a-global-solution # noqa

args = None

classmethod init(args) → None

class annif.parallel.ProjectSuggestMap(registry: AnnifRegistry, project_ids: list[str], backend_params: defaultdict[str, Any] | None, limit: int | None, threshold: float)

Bases: object

A utility class that can be used to wrap one or more projects and provide a mapping method that converts Document objects to suggestions. Intended to be used with the multiprocessing module.

suggest(doc: Document) → tuple[dict[str, SuggestionResult], SubjectSet]

suggest_batch(batch) → tuple[dict[str, SuggestionBatch], Iterator[SubjectSet]]

annif.parallel.get_pool(n_jobs: int) → tuple[int | None, Callable]: return a suitable constructor for multiprocessing pool class, and the correct jobs argument for it, for the given amount of parallel jobs

annif.project module

Project management functionality for Annif

class annif.project.Access(value)

Bases: IntEnum

Enumeration of access levels for projects

hidden = 2

private = 1

public = 3

class annif.project.AnnifProject(project_id: str, config: dict[str, str] | SectionProxy, datadir: str, registry: AnnifRegistry)

Bases: DatadirMixin

Class representing the configuration of a single Annif project.

DEFAULT_ACCESS = 'public'

property analyzer: Analyzer

property backend: AnnifBackend | None

dump() → dict[str, str | dict | bool | datetime | None]: return this project as a dict

hyperopt(corpus: DocumentCorpus, trials: int, jobs: int, metric: str, results_file: LazyFile | None) → HPRecommendation: optimize the hyperparameters of the project using a validation corpus against a given metric

initialize(parallel: bool = False) → None: Initialize this project and its backend so that they are ready to be used. If parallel is True, expect that the project will be used for parallel processing.

initialized = False

property is_trained: bool | None

learn(corpus: DocumentCorpus, backend_params: defaultdict[str, dict] | None = None) → None: further train the project using documents from a metadata source

property modification_time: datetime | None

remove_model_data() → None: remove the data of this project

property subjects: SubjectIndex

suggest(documents: list[Document], backend_params: defaultdict[str, dict] | None = None) → annif.suggestion.SuggestionBatch: Suggest subjects for the given documents batch.

suggest_corpus(corpus: DocumentCorpus, backend_params: defaultdict[str, dict] | None = None) → annif.suggestion.SuggestionResults: Suggest subjects for the given documents corpus in batches of documents.

train(corpus: DocumentCorpus, backend_params: defaultdict[str, dict] | None = None, jobs: int = 0) → None: train the project using documents from a metadata source

property transform: TransformChain

property vocab: AnnifVocabulary

property vocab_lang: str

annif.registry module

Registry that keeps track of Annif projects

class annif.registry.AnnifRegistry(projects_config_path: str, datadir: str, init_projects: bool)

Bases: object

Class that keeps track of the Annif projects and vocabularies

get_project(project_id: str, min_access: Access = Access.private) → AnnifProject: return the definition of a single Project by project_id

get_projects(min_access: Access = Access.private) → dict[str, AnnifProject]: Return the available projects as a dict of project_id -> AnnifProject. The min_access parameter may be used to set the minimum access level required for the returned projects.

get_vocab(vocab_id: str) → AnnifVocabulary: Return an AnnifVocabulary by vocab_id

annif.registry.get_project(project_id: str, min_access: Access = Access.private) → AnnifProject: return the definition of a single Project by project_id

annif.registry.get_projects(min_access: Access = Access.private) → dict[str, AnnifProject]: Return the available projects as a dict of project_id -> AnnifProject. The min_access parameter may be used to set the minimum access level required for the returned projects.

annif.registry.get_vocab(vocab_id: str, min_access: Access = Access.private) → AnnifVocabulary: return a single AnnifVocabulary by vocabulary id

annif.registry.get_vocabs(min_access: Access = Access.private) → dict[str, AnnifVocabulary]: Return the available vocabularies as a dict of vocab_id -> AnnifVocabulary. The min_access parameter may be used to set the minimum access level required for the returned vocabularies.

annif.registry.initialize_projects(app: Flask) → None

annif.rest module

Definitions for REST API operations. These are wired via Connexion to methods defined in the OpenAPI specification.

annif.rest.detect_language(body: dict[str, Any]): return scores for detected languages formatted according to Swagger spec

annif.rest.language_not_supported_error(lang: str) → ConnexionResponse: return a Connexion error object when attempting to use unsupported language

annif.rest.learn(project_id: str, body: list[dict[str, Any]]) → ConnexionResponse | tuple[None, int]: learn from documents and return an empty 204 response if succesful

annif.rest.learning_not_enabled_error(project_id) → ConnexionResponse: return a Connexion error object when a project is not configured for learning

annif.rest.list_projects() → tuple: return a dict with projects formatted according to OpenAPI spec

annif.rest.list_vocabs() → tuple: return a dict with vocabularies formatted according to OpenAPI spec

annif.rest.project_not_found_error(project_id: str) → ConnexionResponse: return a Connexion error object when a project is not found

annif.rest.server_error(err: AnnifException) → ConnexionResponse: return a Connexion error object when there is a server error (project or backend problem)

annif.rest.show_info() → tuple: return version of annif and a title for the api according to OpenAPI spec

annif.rest.show_project(project_id: str) → dict | ConnexionResponse: return a single project formatted according to OpenAPI spec

annif.rest.suggest(project_id: str, body: dict[str, Any]) → dict[str, list] | ConnexionResponse: suggest subjects for the given text and return a dict with results formatted according to OpenAPI spec

annif.rest.suggest_batch(project_id: str, body: dict[str, list], **query_parameters) → list[dict[str, Any]] | ConnexionResponse: suggest subjects for the given documents and return a list of dicts with results formatted according to OpenAPI spec

annif.simplemma_util module

Wrapper code for using Simplemma functionality in Annif

annif.simplemma_util.detect_language(text: str, languages: Tuple[str, ...]) → Dict[str, float]

annif.simplemma_util.get_language_detector(lang: str | Tuple[str, ...]) → LanguageDetector

annif.suggestion module

Representing suggested subjects.

class annif.suggestion.SubjectSuggestion(subject_id, score)

Bases: tuple

score: Alias for field number 1

subject_id: Alias for field number 0

class annif.suggestion.SuggestionBatch(array: csr_array)

Bases: object

Subject suggestions for a batch of documents.

filter(limit: int | None = None, threshold: float = 0.0) → SuggestionBatch: Return a subset of the hits, filtered by the given limit and score threshold, as another SuggestionBatch object.

classmethod from_averaged(batches: list[SuggestionBatch], weights: list[float]) → SuggestionBatch: Create a new SuggestionBatch where the subject scores are the weighted average of scores in several SuggestionBatches

classmethod from_sequence(suggestion_results: Sequence[Iterable[SubjectSuggestion]], subject_index: SubjectIndex, limit: int | None = None) → SuggestionBatch: Create a new SuggestionBatch from a sequence where each item is a sequence of SubjectSuggestion objects.

class annif.suggestion.SuggestionResult(array: csr_array, idx: int)

Bases: object

Suggestions for a single document, backed by a row of a sparse array.

as_vector() → ndarray

class annif.suggestion.SuggestionResults(batches: Iterable[SuggestionBatch])

Bases: object

Subject suggestions for a potentially very large number of documents.

filter(limit: int | None = None, threshold: float = 0.0) → SuggestionResults: Return a view of these suggestions, filtered by the given limit and/or threshold, as another SuggestionResults object.

annif.suggestion.filter_suggestion(preds: csr_array, limit: int | None = None, threshold: float = 0.0) → csr_array: filter a 2D sparse suggestion array (csr_array), retaining only the top K suggestions with a score above or equal to the threshold for each individual prediction; the rest will be left as zeros

annif.suggestion.vector_to_suggestions(vector: np.ndarray, limit: int) → Iterator

annif.util module

Utility functions for Annif

class annif.util.DuplicateFilter

Bases: Filter

Filter out log messages that have already been displayed.

filter(record: LogRecord) → bool

Determine if the specified record is to be logged.

Returns True if the record should be logged, or False otherwise. If deemed appropriate, the record may be modified in-place.

annif.util.atomic_save(obj: Any, dirname: str, filename: str, method: Callable | None = None) → None: Save the given object (which must have a .save() method, unless the method parameter is given) into the given directory with the given filename, using a temporary file and renaming the temporary file to the final name.

annif.util.boolean(val: Any) → bool: Convert the given value to a boolean True/False value, if it isn’t already. True values are ‘1’, ‘yes’, ‘true’, and ‘on’ (case insensitive), everything else is False.

annif.util.cleanup_uri(uri: str) → str: remove angle brackets from a URI, if any

annif.util.identity(x: Any) → Any: Identity function: return the given argument unchanged

annif.util.metric_code(metric): Convert a human-readable metric name into an alphanumeric string

annif.util.parse_args(param_string: str) → tuple[list, dict]: Parse a string of comma separated arguments such as ‘42,43,key=abc’ into a list of positional args [42, 43] and a dict of keyword args {key: abc}

annif.util.parse_sources(sourcedef: str) → list[tuple[str, float]]: parse a source definition such as ‘src1:1.0,src2’ into a sequence of tuples (src_id, weight)

annif.util.suggestion_results_to_list(suggestion_results: SuggestionResults, subjects: SubjectIndex, lang: str) → list[dict[str, list]]

annif.util.suggestion_to_dict(suggestion: SubjectSuggestion, subject_index: SubjectIndex, language: str) → dict[str, str | float | None]

annif.views module

annif.views.home()

Module contents

annif.create_app(config_name: str | None = None) → FlaskApp: Create a Connexion app to be used for the API.

annif.create_cx_app(config_name: str | None = None) → FlaskApp: Create a Connexion app to be used for the API.

annif.create_flask_app(config_name: str | None = None) → Flask: Create a Flask app to be used by the CLI.