annif package

Subpackages

Submodules

annif.cli module

Definitions for command-line (Click) commands for invoking Annif operations and printing the results to console.

annif.cli_util module

Utility functions for Annif CLI commands

annif.cli_util.backend_param_option(f)

Decorator to add an option for CLI commands to override BE parameters

annif.cli_util.common_options(f)

Decorator to add common options for all CLI commands

annif.cli_util.complete_param(ctx: Context, param: Argument, incomplete: str) list[str]
annif.cli_util.docs_limit_option(f)

Decorator to add an option for CLI commands to limit the number of documents to use

annif.cli_util.format_datetime(dt: datetime | None) str

Helper function to format a datetime object as a string in the local time.

annif.cli_util.generate_filter_params(filter_batch_max_limit: int) list[tuple[int, float]]
annif.cli_util.get_project(project_id: str) AnnifProject

Helper function to get a project by ID and bail out if it doesn’t exist

annif.cli_util.get_vocab(vocab_id: str) AnnifVocabulary

Helper function to get a vocabulary by ID and bail out if it doesn’t exist

annif.cli_util.make_list_template(*rows) str

Helper function to create a template for a list of entries with fields of variable width. The width of each field is determined by the longest item in the field in the given rows.

annif.cli_util.open_documents(paths: tuple[str, ...], subject_index: SubjectIndex, vocab_lang: str, docs_limit: int | None) DocumentCorpus

Helper function to open a document corpus from a list of pathnames, each of which is either a TSV file or a directory of TXT files. For directories with subjects in TSV files, the given vocabulary language will be used to convert subject labels into URIs. The corpus will be returned as an instance of DocumentCorpus or LimitingDocumentCorpus.

annif.cli_util.open_text_documents(paths: tuple[str, ...], docs_limit: int | None) DocumentList

Helper function to read text documents from the given file paths. Returns a DocumentList object with Documents having no subjects. If a path is “-”, the document text is read from standard input. The maximum number of documents to read is set by docs_limit parameter.

annif.cli_util.parse_backend_params(backend_param: tuple[str, ...] | tuple[()], project: AnnifProject) collections.defaultdict[str, dict[str, str]]

Parse a list of backend parameters given with the –backend-param option into a nested dict structure

annif.cli_util.project_id(f)

Decorator to add a project ID parameter to a CLI command

annif.cli_util.show_hits(hits: SuggestionResult, project: AnnifProject, lang: str, file: io.TextIOWrapper | None = None) None

Print subject suggestions to the console or a file. The suggestions are displayed as a table, with one row per hit. Each row contains the URI, label, possible notation, and score of the suggestion. The label is given in the specified language.

annif.config module

Configuration file handling

class annif.config.AnnifConfigCFG(filename: str | None = None, projstr: str | None = None)

Bases: object

Class for reading configuration in CFG/INI format

property project_ids: list[str]
class annif.config.AnnifConfigDirectory(directory: str)

Bases: object

Class for reading configuration from directory

property project_ids
class annif.config.AnnifConfigTOML(filename: str)

Bases: object

Class for reading configuration in TOML format

property project_ids
annif.config.check_config(projects_config_path: str) str | None
annif.config.find_config() str | None
annif.config.parse_config(projects_config_path: str) AnnifConfigDirectory | AnnifConfigCFG | AnnifConfigTOML | None

annif.datadir module

Mixin class for types that need a data directory

class annif.datadir.DatadirMixin(datadir: str, typename: str, identifier: str)

Bases: object

Mixin class for types that need a data directory for storing files

property datadir: str

annif.default_config module

A configuration module, where “Config” is a default configuration and the other classes are different configuration profiles overriding default settings.

class annif.default_config.Config

Bases: object

DATADIR = 'data'
DEBUG = False
INITIALIZE_PROJECTS = False
PROJECTS_CONFIG_PATH = ''
TESTING = False
class annif.default_config.DevelopmentConfig

Bases: Config

DEBUG = True
class annif.default_config.ProductionConfig

Bases: Config

INITIALIZE_PROJECTS = True
class annif.default_config.TestingConfig

Bases: Config

DATADIR = 'tests/data'
PROJECTS_CONFIG_PATH = 'tests/projects.cfg'
TESTING = True
class annif.default_config.TestingDirectoryConfig

Bases: TestingConfig

PROJECTS_CONFIG_PATH = 'tests/projects.d'
class annif.default_config.TestingInitializeConfig

Bases: TestingConfig

INITIALIZE_PROJECTS = True
class annif.default_config.TestingInvalidProjectsConfig

Bases: TestingConfig

PROJECTS_CONFIG_PATH = 'tests/projects_invalid.cfg'
class annif.default_config.TestingNoProjectsConfig

Bases: TestingConfig

PROJECTS_CONFIG_PATH = 'tests/notfound.cfg'
class annif.default_config.TestingTOMLConfig

Bases: TestingConfig

PROJECTS_CONFIG_PATH = 'tests/projects.toml'

annif.eval module

Evaluation metrics for Annif

class annif.eval.EvaluationBatch(subject_index: SubjectIndex)

Bases: object

A class for evaluating batches of results using all available metrics. The evaluate() method is called once per document in the batch or evaluate_many() for a list of documents of the batch. Final results can be queried using the results() method.

evaluate_many(suggestion_batch: list[list[SubjectSuggestion]] | SuggestionBatch | list[Iterator], gold_subject_batch: Sequence[SubjectSet]) None
output_result_per_subject(y_true: csr_array, y_pred: csr_array, results_file: TextIOWrapper | LazyFile, language: str) None

Write results per subject (non-aggregated) to outputfile results_file, using labels in the given language

results(metrics: Iterable[str] = [], results_file: LazyFile | TextIOWrapper | None = None, language: str | None = None) dict[str, float]

evaluate a set of selected subjects against a gold standard using different metrics. If metrics is empty, use all available metrics. If results_file (file object) given, write results per subject to it with labels expressed in the given language.

annif.eval.dcg_score(y_true: csr_array, y_pred: csr_array, limit: int | None = None) np.float64

return the discounted cumulative gain (DCG) score for the selected labels vs. relevant labels

annif.eval.false_negatives(y_true: csr_array, y_pred: csr_array) int

calculate the number of false negatives using bitwise operations, emulating the way sklearn evaluation metric functions work

annif.eval.false_positives(y_true: csr_array, y_pred: csr_array) int

calculate the number of false positives using bitwise operations, emulating the way sklearn evaluation metric functions work

annif.eval.ndcg_score(y_true: csr_array, y_pred: csr_array, limit: int | None = None) float

return the normalized discounted cumulative gain (nDCG) score for the selected labels vs. relevant labels

annif.eval.true_positives(y_true: csr_array, y_pred: csr_array) int

calculate the number of true positives using bitwise operations, emulating the way sklearn evaluation metric functions work

annif.exception module

Custom exceptions used by Annif

exception annif.exception.AnnifException(message: str, project_id: str | None = None, backend_id: str | None = None)

Bases: ClickException

Base Annif exception. We define this as a subclass of ClickException so that the CLI can automatically handle exceptions. This exception cannot be instantiated directly - subclasses should be used instead.

format_message() str
prefix = None
exception annif.exception.ConfigurationException(message: str, project_id: str | None = None, backend_id: str | None = None)

Bases: AnnifException

Exception raised when a project or backend is misconfigured.

prefix = 'Misconfigured'
exception annif.exception.NotInitializedException(message: str, project_id: str | None = None, backend_id: str | None = None)

Bases: AnnifException

Exception raised for attempting to use a project or backend that cannot be initialized, most likely since it is not yet functional because of lack of vocabulary or training.

prefix = "Couldn't initialize"
exception annif.exception.NotSupportedException(message: str, project_id: str | None = None, backend_id: str | None = None)

Bases: AnnifException

Exception raised when an operation is not supported by a project or backend.

prefix = 'Not supported'
exception annif.exception.OperationFailedException(message: str, project_id: str | None = None, backend_id: str | None = None)

Bases: AnnifException

Exception raised when an operation fails for some unknown reason.

prefix = 'Operation failed'

annif.hfh_util module

Utility functions for interactions with Hugging Face Hub.

annif.hfh_util.check_is_download_allowed(trust_repo, repo_id)

Check if downloading from the specified repository is allowed based on the trust option and cache status.

annif.hfh_util.copy_project_config(src_path: str, force: bool) None

Copy a given project configuration file to projects.d/ directory.

annif.hfh_util.download_from_hf_hub(filename: str, repo_id: str, token: str, revision: str) list[str]
annif.hfh_util.get_matching_project_ids_from_hf_hub(project_ids_pattern: str, repo_id: str, token, revision: str) list[str]

Get project IDs of the projects in a Hugging Face Model Hub repository that match the given pattern.

annif.hfh_util.get_matching_projects(pattern: str) list[AnnifProject]

Get projects that match the given pattern.

annif.hfh_util.get_vocab_id_from_config(config_path: str) str

Get the vocabulary ID from a configuration file.

annif.hfh_util.prepare_commits(projects: list[AnnifProject], repo_id: str, token: str) tuple[list, list]

Prepare and pre-upload data and config commit operations for projects to a Hugging Face Hub repository.

annif.hfh_util.unzip_archive(src_path: str, force: bool) None

Unzip a zip archive of projects and vocabularies to a directory, by default data/ under current directory.

annif.hfh_util.upsert_modelcard(repo_id, projects, token, revision)

This function creates or updates a Model Card in a Hugging Face Hub repository with some metadata in it.

annif.parallel module

Parallel processing functionality for Annif

class annif.parallel.BaseWorker

Bases: object

Base class for workers that implement tasks executed via multiprocessing. The init method can be used to store data objects that are necessary for the operation. They will be stored in a class attribute that is accessible to the static worker method. The storage solution is inspired by this blog post: https://thelaziestprogrammer.com/python/multiprocessing-pool-a-global-solution # noqa

args = None
classmethod init(args) None
class annif.parallel.ProjectSuggestMap(registry: AnnifRegistry, project_ids: list[str], backend_params: defaultdict[str, Any] | None, limit: int | None, threshold: float)

Bases: object

A utility class that can be used to wrap one or more projects and provide a mapping method that converts Document objects to suggestions. Intended to be used with the multiprocessing module.

suggest(doc: Document) tuple[dict[str, SuggestionResult], SubjectSet]
suggest_batch(batch) tuple[dict[str, SuggestionBatch], Iterator[SubjectSet]]
annif.parallel.get_pool(n_jobs: int) tuple[int | None, Callable]

return a suitable constructor for multiprocessing pool class, and the correct jobs argument for it, for the given amount of parallel jobs

annif.project module

Project management functionality for Annif

class annif.project.Access(value)

Bases: IntEnum

Enumeration of access levels for projects

hidden = 2
private = 1
public = 3
class annif.project.AnnifProject(project_id: str, config: dict[str, str] | SectionProxy, datadir: str, registry: AnnifRegistry)

Bases: DatadirMixin

Class representing the configuration of a single Annif project.

DEFAULT_ACCESS = 'public'
property analyzer: Analyzer
property backend: AnnifBackend | None
dump() dict[str, str | dict | bool | datetime | None]

return this project as a dict

hyperopt(corpus: DocumentCorpus, trials: int, jobs: int, metric: str, results_file: LazyFile | None) HPRecommendation

optimize the hyperparameters of the project using a validation corpus against a given metric

initialize(parallel: bool = False) None

Initialize this project and its backend so that they are ready to be used. If parallel is True, expect that the project will be used for parallel processing.

initialized = False
property is_trained: bool | None
learn(corpus: DocumentCorpus, backend_params: defaultdict[str, dict] | None = None) None

further train the project using documents from a metadata source

property modification_time: datetime | None
remove_model_data() None

remove the data of this project

property subjects: SubjectIndex
suggest(texts: list[str], backend_params: defaultdict[str, dict] | None = None) annif.suggestion.SuggestionBatch

Suggest subjects for the given documents batch.

suggest_corpus(corpus: DocumentCorpus, backend_params: defaultdict[str, dict] | None = None) annif.suggestion.SuggestionResults

Suggest subjects for the given documents corpus in batches of documents.

train(corpus: DocumentCorpus, backend_params: defaultdict[str, dict] | None = None, jobs: int = 0) None

train the project using documents from a metadata source

property transform: TransformChain
property vocab: AnnifVocabulary
property vocab_lang: str

annif.registry module

Registry that keeps track of Annif projects

class annif.registry.AnnifRegistry(projects_config_path: str, datadir: str, init_projects: bool)

Bases: object

Class that keeps track of the Annif projects and vocabularies

get_project(project_id: str, min_access: Access = Access.private) AnnifProject

return the definition of a single Project by project_id

get_projects(min_access: Access = Access.private) dict[str, AnnifProject]

Return the available projects as a dict of project_id -> AnnifProject. The min_access parameter may be used to set the minimum access level required for the returned projects.

get_vocab(vocab_spec: str, default_language: str | None) tuple[AnnifVocabulary, None] | tuple[AnnifVocabulary, str]

Return an (AnnifVocabulary, language) pair corresponding to the vocab_spec. If no language information is specified, use the given default language.

annif.registry.get_project(project_id: str, min_access: Access = Access.private) AnnifProject

return the definition of a single Project by project_id

annif.registry.get_projects(min_access: Access = Access.private) dict[str, AnnifProject]

Return the available projects as a dict of project_id -> AnnifProject. The min_access parameter may be used to set the minimum access level required for the returned projects.

annif.registry.get_vocab(vocab_id: str, min_access: Access = Access.private) AnnifVocabulary

return a single AnnifVocabulary by vocabulary id

annif.registry.get_vocabs(min_access: Access = Access.private) dict[str, AnnifVocabulary]

Return the available vocabularies as a dict of vocab_id -> AnnifVocabulary. The min_access parameter may be used to set the minimum access level required for the returned vocabularies.

annif.registry.initialize_projects(app: Flask) None

annif.rest module

Definitions for REST API operations. These are wired via Connexion to methods defined in the OpenAPI specification.

annif.rest.detect_language(body: dict[str, Any])

return scores for detected languages formatted according to Swagger spec

annif.rest.language_not_supported_error(lang: str) ConnexionResponse

return a Connexion error object when attempting to use unsupported language

annif.rest.learn(project_id: str, body: list[dict[str, Any]]) ConnexionResponse | tuple[None, int]

learn from documents and return an empty 204 response if succesful

annif.rest.list_projects() tuple

return a dict with projects formatted according to OpenAPI spec

annif.rest.project_not_found_error(project_id: str) ConnexionResponse

return a Connexion error object when a project is not found

annif.rest.server_error(err: AnnifException) ConnexionResponse

return a Connexion error object when there is a server error (project or backend problem)

annif.rest.show_info() tuple

return version of annif and a title for the api according to OpenAPI spec

annif.rest.show_project(project_id: str) dict | ConnexionResponse

return a single project formatted according to OpenAPI spec

annif.rest.suggest(project_id: str, body: dict[str, Any]) dict[str, list] | ConnexionResponse

suggest subjects for the given text and return a dict with results formatted according to OpenAPI spec

annif.rest.suggest_batch(project_id: str, body: dict[str, list], **query_parameters) list[dict[str, Any]] | ConnexionResponse

suggest subjects for the given documents and return a list of dicts with results formatted according to OpenAPI spec

annif.simplemma_util module

Wrapper code for using Simplemma functionality in Annif

annif.simplemma_util.detect_language(text: str, languages: Tuple[str, ...]) Dict[str, float]
annif.simplemma_util.get_language_detector(lang: str | Tuple[str, ...]) LanguageDetector

annif.suggestion module

Representing suggested subjects.

class annif.suggestion.SubjectSuggestion(subject_id, score)

Bases: tuple

score

Alias for field number 1

subject_id

Alias for field number 0

class annif.suggestion.SuggestionBatch(array: csr_array)

Bases: object

Subject suggestions for a batch of documents.

filter(limit: int | None = None, threshold: float = 0.0) SuggestionBatch

Return a subset of the hits, filtered by the given limit and score threshold, as another SuggestionBatch object.

classmethod from_averaged(batches: list[SuggestionBatch], weights: list[float]) SuggestionBatch

Create a new SuggestionBatch where the subject scores are the weighted average of scores in several SuggestionBatches

classmethod from_sequence(suggestion_results: Sequence[Iterable[SubjectSuggestion]], subject_index: SubjectIndex, limit: int | None = None) SuggestionBatch

Create a new SuggestionBatch from a sequence where each item is a sequence of SubjectSuggestion objects.

class annif.suggestion.SuggestionResult(array: csr_array, idx: int)

Bases: object

Suggestions for a single document, backed by a row of a sparse array.

as_vector() ndarray
class annif.suggestion.SuggestionResults(batches: Iterable[SuggestionBatch])

Bases: object

Subject suggestions for a potentially very large number of documents.

filter(limit: int | None = None, threshold: float = 0.0) SuggestionResults

Return a view of these suggestions, filtered by the given limit and/or threshold, as another SuggestionResults object.

annif.suggestion.filter_suggestion(preds: csr_array, limit: int | None = None, threshold: float = 0.0) csr_array

filter a 2D sparse suggestion array (csr_array), retaining only the top K suggestions with a score above or equal to the threshold for each individual prediction; the rest will be left as zeros

annif.suggestion.vector_to_suggestions(vector: np.ndarray, limit: int) Iterator

annif.util module

Utility functions for Annif

class annif.util.DuplicateFilter

Bases: Filter

Filter out log messages that have already been displayed.

filter(record: LogRecord) bool

Determine if the specified record is to be logged.

Returns True if the record should be logged, or False otherwise. If deemed appropriate, the record may be modified in-place.

annif.util.atomic_save(obj: Any, dirname: str, filename: str, method: Callable | None = None) None

Save the given object (which must have a .save() method, unless the method parameter is given) into the given directory with the given filename, using a temporary file and renaming the temporary file to the final name.

annif.util.boolean(val: Any) bool

Convert the given value to a boolean True/False value, if it isn’t already. True values are ‘1’, ‘yes’, ‘true’, and ‘on’ (case insensitive), everything else is False.

annif.util.cleanup_uri(uri: str) str

remove angle brackets from a URI, if any

annif.util.identity(x: Any) Any

Identity function: return the given argument unchanged

annif.util.metric_code(metric)

Convert a human-readable metric name into an alphanumeric string

annif.util.parse_args(param_string: str) tuple[list, dict]

Parse a string of comma separated arguments such as ‘42,43,key=abc’ into a list of positional args [42, 43] and a dict of keyword args {key: abc}

annif.util.parse_sources(sourcedef: str) list[tuple[str, float]]

parse a source definition such as ‘src1:1.0,src2’ into a sequence of tuples (src_id, weight)

annif.views module

annif.views.home()

annif.vocab module

Vocabulary management functionality for Annif

class annif.vocab.AnnifVocabulary(vocab_id: str, datadir: str)

Bases: DatadirMixin

Class representing a subject vocabulary which can be used by multiple Annif projects.

INDEX_FILENAME_CSV = 'subjects.csv'
INDEX_FILENAME_DUMP = 'subjects.dump.gz'
INDEX_FILENAME_TTL = 'subjects.ttl'
as_graph() Graph

return the vocabulary as an rdflib graph

property languages: list[str]
load_vocabulary(subject_corpus: SubjectCorpus, force: bool = False) None

Load subjects from a subject corpus and save them into one or more subject index files as well as a SKOS/Turtle file for later use. If force=True, replace the existing subject index completely.

property skos: SubjectFileSKOS

return the subject vocabulary from SKOS file

property subjects: SubjectIndex

Module contents

annif.create_app(config_name: str | None = None) FlaskApp

Create a Connexion app to be used for the API.

annif.create_cx_app(config_name: str | None = None) FlaskApp

Create a Connexion app to be used for the API.

annif.create_flask_app(config_name: str | None = None) Flask

Create a Flask app to be used by the CLI.