annif package
Subpackages
- annif.analyzer package
- annif.backend package
- Submodules
- annif.backend.backend module
- annif.backend.dummy module
- annif.backend.ensemble module
- annif.backend.fasttext module
- annif.backend.http module
- annif.backend.hyperopt module
- annif.backend.mixins module
- annif.backend.mllm module
- annif.backend.nn_ensemble module
- annif.backend.omikuji module
- annif.backend.pav module
- annif.backend.stwfsa module
- annif.backend.svc module
- annif.backend.tfidf module
- annif.backend.yake module
- Module contents
- annif.corpus package
- Submodules
- annif.corpus.combine module
- annif.corpus.document module
- annif.corpus.skos module
- annif.corpus.subject module
- annif.corpus.types module
- Module contents
- annif.lexical package
- annif.openapi package
- annif.transform package
Submodules
annif.cli module
Definitions for command-line (Click) commands for invoking Annif operations and printing the results to console.
annif.cli_util module
Utility functions for Annif CLI commands
- annif.cli_util.backend_param_option(f)
Decorator to add an option for CLI commands to override BE parameters
- annif.cli_util.common_options(f)
Decorator to add common options for all CLI commands
- annif.cli_util.complete_param(ctx: Context, param: Argument, incomplete: str) list[str]
- annif.cli_util.docs_limit_option(f)
Decorator to add an option for CLI commands to limit the number of documents to use
- annif.cli_util.format_datetime(dt: datetime | None) str
Helper function to format a datetime object as a string in the local time.
- annif.cli_util.generate_filter_params(filter_batch_max_limit: int) list[tuple[int, float]]
- annif.cli_util.get_project(project_id: str) AnnifProject
Helper function to get a project by ID and bail out if it doesn’t exist
- annif.cli_util.get_vocab(vocab_id: str) AnnifVocabulary
Helper function to get a vocabulary by ID and bail out if it doesn’t exist
- annif.cli_util.make_list_template(*rows) str
Helper function to create a template for a list of entries with fields of variable width. The width of each field is determined by the longest item in the field in the given rows.
- annif.cli_util.open_documents(paths: tuple[str, ...], subject_index: SubjectIndex, vocab_lang: str, docs_limit: int | None) DocumentCorpus
Helper function to open a document corpus from a list of pathnames, each of which is either a TSV file or a directory of TXT files. For directories with subjects in TSV files, the given vocabulary language will be used to convert subject labels into URIs. The corpus will be returned as an instance of DocumentCorpus or LimitingDocumentCorpus.
- annif.cli_util.open_text_documents(paths: tuple[str, ...], docs_limit: int | None) DocumentList
Helper function to read text documents from the given file paths. Returns a DocumentList object with Documents having no subjects. If a path is “-”, the document text is read from standard input. The maximum number of documents to read is set by docs_limit parameter.
- annif.cli_util.parse_backend_params(backend_param: tuple[str, ...] | tuple[()], project: AnnifProject) collections.defaultdict[str, dict[str, str]]
Parse a list of backend parameters given with the –backend-param option into a nested dict structure
- annif.cli_util.project_id(f)
Decorator to add a project ID parameter to a CLI command
- annif.cli_util.show_hits(hits: SuggestionResult, project: AnnifProject, lang: str, file: io.TextIOWrapper | None = None) None
Print subject suggestions to the console or a file. The suggestions are displayed as a table, with one row per hit. Each row contains the URI, label, possible notation, and score of the suggestion. The label is given in the specified language.
annif.config module
Configuration file handling
- class annif.config.AnnifConfigCFG(filename: str | None = None, projstr: str | None = None)
Bases:
object
Class for reading configuration in CFG/INI format
- property project_ids: list[str]
- class annif.config.AnnifConfigDirectory(directory: str)
Bases:
object
Class for reading configuration from directory
- property project_ids
- class annif.config.AnnifConfigTOML(filename: str)
Bases:
object
Class for reading configuration in TOML format
- property project_ids
- annif.config.check_config(projects_config_path: str) str | None
- annif.config.find_config() str | None
- annif.config.parse_config(projects_config_path: str) AnnifConfigDirectory | AnnifConfigCFG | AnnifConfigTOML | None
annif.datadir module
Mixin class for types that need a data directory
annif.default_config module
A configuration module, where “Config” is a default configuration and the other classes are different configuration profiles overriding default settings.
- class annif.default_config.Config
Bases:
object
- DATADIR = 'data'
- DEBUG = False
- INITIALIZE_PROJECTS = False
- PROJECTS_CONFIG_PATH = ''
- TESTING = False
- class annif.default_config.TestingConfig
Bases:
Config
- DATADIR = 'tests/data'
- PROJECTS_CONFIG_PATH = 'tests/projects.cfg'
- TESTING = True
- class annif.default_config.TestingDirectoryConfig
Bases:
TestingConfig
- PROJECTS_CONFIG_PATH = 'tests/projects.d'
- class annif.default_config.TestingInitializeConfig
Bases:
TestingConfig
- INITIALIZE_PROJECTS = True
- class annif.default_config.TestingInvalidProjectsConfig
Bases:
TestingConfig
- PROJECTS_CONFIG_PATH = 'tests/projects_invalid.cfg'
- class annif.default_config.TestingNoProjectsConfig
Bases:
TestingConfig
- PROJECTS_CONFIG_PATH = 'tests/notfound.cfg'
- class annif.default_config.TestingTOMLConfig
Bases:
TestingConfig
- PROJECTS_CONFIG_PATH = 'tests/projects.toml'
annif.eval module
Evaluation metrics for Annif
- class annif.eval.EvaluationBatch(subject_index: SubjectIndex)
Bases:
object
A class for evaluating batches of results using all available metrics. The evaluate() method is called once per document in the batch or evaluate_many() for a list of documents of the batch. Final results can be queried using the results() method.
- evaluate_many(suggestion_batch: list[list[SubjectSuggestion]] | SuggestionBatch | list[Iterator], gold_subject_batch: Sequence[SubjectSet]) None
- output_result_per_subject(y_true: csr_array, y_pred: csr_array, results_file: TextIOWrapper | LazyFile, language: str) None
Write results per subject (non-aggregated) to outputfile results_file, using labels in the given language
- results(metrics: Iterable[str] = [], results_file: LazyFile | TextIOWrapper | None = None, language: str | None = None) dict[str, float]
evaluate a set of selected subjects against a gold standard using different metrics. If metrics is empty, use all available metrics. If results_file (file object) given, write results per subject to it with labels expressed in the given language.
- annif.eval.dcg_score(y_true: csr_array, y_pred: csr_array, limit: int | None = None) np.float64
return the discounted cumulative gain (DCG) score for the selected labels vs. relevant labels
- annif.eval.false_negatives(y_true: csr_array, y_pred: csr_array) int
calculate the number of false negatives using bitwise operations, emulating the way sklearn evaluation metric functions work
- annif.eval.false_positives(y_true: csr_array, y_pred: csr_array) int
calculate the number of false positives using bitwise operations, emulating the way sklearn evaluation metric functions work
- annif.eval.ndcg_score(y_true: csr_array, y_pred: csr_array, limit: int | None = None) float
return the normalized discounted cumulative gain (nDCG) score for the selected labels vs. relevant labels
- annif.eval.true_positives(y_true: csr_array, y_pred: csr_array) int
calculate the number of true positives using bitwise operations, emulating the way sklearn evaluation metric functions work
annif.exception module
Custom exceptions used by Annif
- exception annif.exception.AnnifException(message: str, project_id: str | None = None, backend_id: str | None = None)
Bases:
ClickException
Base Annif exception. We define this as a subclass of ClickException so that the CLI can automatically handle exceptions. This exception cannot be instantiated directly - subclasses should be used instead.
- format_message() str
- prefix = None
- exception annif.exception.ConfigurationException(message: str, project_id: str | None = None, backend_id: str | None = None)
Bases:
AnnifException
Exception raised when a project or backend is misconfigured.
- prefix = 'Misconfigured'
- exception annif.exception.NotInitializedException(message: str, project_id: str | None = None, backend_id: str | None = None)
Bases:
AnnifException
Exception raised for attempting to use a project or backend that cannot be initialized, most likely since it is not yet functional because of lack of vocabulary or training.
- prefix = "Couldn't initialize"
- exception annif.exception.NotSupportedException(message: str, project_id: str | None = None, backend_id: str | None = None)
Bases:
AnnifException
Exception raised when an operation is not supported by a project or backend.
- prefix = 'Not supported'
- exception annif.exception.OperationFailedException(message: str, project_id: str | None = None, backend_id: str | None = None)
Bases:
AnnifException
Exception raised when an operation fails for some unknown reason.
- prefix = 'Operation failed'
annif.hfh_util module
Utility functions for interactions with Hugging Face Hub.
- annif.hfh_util.check_is_download_allowed(trust_repo, repo_id)
Check if downloading from the specified repository is allowed based on the trust option and cache status.
- annif.hfh_util.copy_project_config(src_path: str, force: bool) None
Copy a given project configuration file to projects.d/ directory.
- annif.hfh_util.download_from_hf_hub(filename: str, repo_id: str, token: str, revision: str) list[str]
- annif.hfh_util.get_matching_project_ids_from_hf_hub(project_ids_pattern: str, repo_id: str, token, revision: str) list[str]
Get project IDs of the projects in a Hugging Face Model Hub repository that match the given pattern.
- annif.hfh_util.get_matching_projects(pattern: str) list[AnnifProject]
Get projects that match the given pattern.
- annif.hfh_util.get_vocab_id_from_config(config_path: str) str
Get the vocabulary ID from a configuration file.
- annif.hfh_util.prepare_commits(projects: list[AnnifProject], repo_id: str, token: str) tuple[list, list]
Prepare and pre-upload data and config commit operations for projects to a Hugging Face Hub repository.
- annif.hfh_util.unzip_archive(src_path: str, force: bool) None
Unzip a zip archive of projects and vocabularies to a directory, by default data/ under current directory.
- annif.hfh_util.upsert_modelcard(repo_id, projects, token, revision)
This function creates or updates a Model Card in a Hugging Face Hub repository with some metadata in it.
annif.parallel module
Parallel processing functionality for Annif
- class annif.parallel.BaseWorker
Bases:
object
Base class for workers that implement tasks executed via multiprocessing. The init method can be used to store data objects that are necessary for the operation. They will be stored in a class attribute that is accessible to the static worker method. The storage solution is inspired by this blog post: https://thelaziestprogrammer.com/python/multiprocessing-pool-a-global-solution # noqa
- args = None
- classmethod init(args) None
- class annif.parallel.ProjectSuggestMap(registry: AnnifRegistry, project_ids: list[str], backend_params: defaultdict[str, Any] | None, limit: int | None, threshold: float)
Bases:
object
A utility class that can be used to wrap one or more projects and provide a mapping method that converts Document objects to suggestions. Intended to be used with the multiprocessing module.
- suggest(doc: Document) tuple[dict[str, SuggestionResult], SubjectSet]
- suggest_batch(batch) tuple[dict[str, SuggestionBatch], Iterator[SubjectSet]]
- annif.parallel.get_pool(n_jobs: int) tuple[int | None, Callable]
return a suitable constructor for multiprocessing pool class, and the correct jobs argument for it, for the given amount of parallel jobs
annif.project module
Project management functionality for Annif
- class annif.project.Access(value)
Bases:
IntEnum
Enumeration of access levels for projects
- private = 1
- public = 3
- class annif.project.AnnifProject(project_id: str, config: dict[str, str] | SectionProxy, datadir: str, registry: AnnifRegistry)
Bases:
DatadirMixin
Class representing the configuration of a single Annif project.
- DEFAULT_ACCESS = 'public'
- property backend: AnnifBackend | None
- dump() dict[str, str | dict | bool | datetime | None]
return this project as a dict
- hyperopt(corpus: DocumentCorpus, trials: int, jobs: int, metric: str, results_file: LazyFile | None) HPRecommendation
optimize the hyperparameters of the project using a validation corpus against a given metric
- initialize(parallel: bool = False) None
Initialize this project and its backend so that they are ready to be used. If parallel is True, expect that the project will be used for parallel processing.
- initialized = False
- property is_trained: bool | None
- learn(corpus: DocumentCorpus, backend_params: defaultdict[str, dict] | None = None) None
further train the project using documents from a metadata source
- property modification_time: datetime | None
- remove_model_data() None
remove the data of this project
- property subjects: SubjectIndex
- suggest(texts: list[str], backend_params: defaultdict[str, dict] | None = None) annif.suggestion.SuggestionBatch
Suggest subjects for the given documents batch.
- suggest_corpus(corpus: DocumentCorpus, backend_params: defaultdict[str, dict] | None = None) annif.suggestion.SuggestionResults
Suggest subjects for the given documents corpus in batches of documents.
- train(corpus: DocumentCorpus, backend_params: defaultdict[str, dict] | None = None, jobs: int = 0) None
train the project using documents from a metadata source
- property transform: TransformChain
- property vocab: AnnifVocabulary
- property vocab_lang: str
annif.registry module
Registry that keeps track of Annif projects
- class annif.registry.AnnifRegistry(projects_config_path: str, datadir: str, init_projects: bool)
Bases:
object
Class that keeps track of the Annif projects and vocabularies
- get_project(project_id: str, min_access: Access = Access.private) AnnifProject
return the definition of a single Project by project_id
- get_projects(min_access: Access = Access.private) dict[str, AnnifProject]
Return the available projects as a dict of project_id -> AnnifProject. The min_access parameter may be used to set the minimum access level required for the returned projects.
- get_vocab(vocab_spec: str, default_language: str | None) tuple[AnnifVocabulary, None] | tuple[AnnifVocabulary, str]
Return an (AnnifVocabulary, language) pair corresponding to the vocab_spec. If no language information is specified, use the given default language.
- annif.registry.get_project(project_id: str, min_access: Access = Access.private) AnnifProject
return the definition of a single Project by project_id
- annif.registry.get_projects(min_access: Access = Access.private) dict[str, AnnifProject]
Return the available projects as a dict of project_id -> AnnifProject. The min_access parameter may be used to set the minimum access level required for the returned projects.
- annif.registry.get_vocab(vocab_id: str, min_access: Access = Access.private) AnnifVocabulary
return a single AnnifVocabulary by vocabulary id
- annif.registry.get_vocabs(min_access: Access = Access.private) dict[str, AnnifVocabulary]
Return the available vocabularies as a dict of vocab_id -> AnnifVocabulary. The min_access parameter may be used to set the minimum access level required for the returned vocabularies.
- annif.registry.initialize_projects(app: Flask) None
annif.rest module
Definitions for REST API operations. These are wired via Connexion to methods defined in the OpenAPI specification.
- annif.rest.detect_language(body: dict[str, Any])
return scores for detected languages formatted according to Swagger spec
- annif.rest.language_not_supported_error(lang: str) ConnexionResponse
return a Connexion error object when attempting to use unsupported language
- annif.rest.learn(project_id: str, body: list[dict[str, Any]]) ConnexionResponse | tuple[None, int]
learn from documents and return an empty 204 response if succesful
- annif.rest.list_projects() tuple
return a dict with projects formatted according to OpenAPI spec
- annif.rest.project_not_found_error(project_id: str) ConnexionResponse
return a Connexion error object when a project is not found
- annif.rest.server_error(err: AnnifException) ConnexionResponse
return a Connexion error object when there is a server error (project or backend problem)
- annif.rest.show_info() tuple
return version of annif and a title for the api according to OpenAPI spec
- annif.rest.show_project(project_id: str) dict | ConnexionResponse
return a single project formatted according to OpenAPI spec
- annif.rest.suggest(project_id: str, body: dict[str, Any]) dict[str, list] | ConnexionResponse
suggest subjects for the given text and return a dict with results formatted according to OpenAPI spec
- annif.rest.suggest_batch(project_id: str, body: dict[str, list], **query_parameters) list[dict[str, Any]] | ConnexionResponse
suggest subjects for the given documents and return a list of dicts with results formatted according to OpenAPI spec
annif.simplemma_util module
Wrapper code for using Simplemma functionality in Annif
- annif.simplemma_util.detect_language(text: str, languages: Tuple[str, ...]) Dict[str, float]
- annif.simplemma_util.get_language_detector(lang: str | Tuple[str, ...]) LanguageDetector
annif.suggestion module
Representing suggested subjects.
- class annif.suggestion.SubjectSuggestion(subject_id, score)
Bases:
tuple
- score
Alias for field number 1
- subject_id
Alias for field number 0
- class annif.suggestion.SuggestionBatch(array: csr_array)
Bases:
object
Subject suggestions for a batch of documents.
- filter(limit: int | None = None, threshold: float = 0.0) SuggestionBatch
Return a subset of the hits, filtered by the given limit and score threshold, as another SuggestionBatch object.
- classmethod from_averaged(batches: list[SuggestionBatch], weights: list[float]) SuggestionBatch
Create a new SuggestionBatch where the subject scores are the weighted average of scores in several SuggestionBatches
- classmethod from_sequence(suggestion_results: Sequence[Iterable[SubjectSuggestion]], subject_index: SubjectIndex, limit: int | None = None) SuggestionBatch
Create a new SuggestionBatch from a sequence where each item is a sequence of SubjectSuggestion objects.
- class annif.suggestion.SuggestionResult(array: csr_array, idx: int)
Bases:
object
Suggestions for a single document, backed by a row of a sparse array.
- as_vector() ndarray
- class annif.suggestion.SuggestionResults(batches: Iterable[SuggestionBatch])
Bases:
object
Subject suggestions for a potentially very large number of documents.
- filter(limit: int | None = None, threshold: float = 0.0) SuggestionResults
Return a view of these suggestions, filtered by the given limit and/or threshold, as another SuggestionResults object.
- annif.suggestion.filter_suggestion(preds: csr_array, limit: int | None = None, threshold: float = 0.0) csr_array
filter a 2D sparse suggestion array (csr_array), retaining only the top K suggestions with a score above or equal to the threshold for each individual prediction; the rest will be left as zeros
- annif.suggestion.vector_to_suggestions(vector: np.ndarray, limit: int) Iterator
annif.util module
Utility functions for Annif
- class annif.util.DuplicateFilter
Bases:
Filter
Filter out log messages that have already been displayed.
- filter(record: LogRecord) bool
Determine if the specified record is to be logged.
Returns True if the record should be logged, or False otherwise. If deemed appropriate, the record may be modified in-place.
- annif.util.atomic_save(obj: Any, dirname: str, filename: str, method: Callable | None = None) None
Save the given object (which must have a .save() method, unless the method parameter is given) into the given directory with the given filename, using a temporary file and renaming the temporary file to the final name.
- annif.util.boolean(val: Any) bool
Convert the given value to a boolean True/False value, if it isn’t already. True values are ‘1’, ‘yes’, ‘true’, and ‘on’ (case insensitive), everything else is False.
- annif.util.cleanup_uri(uri: str) str
remove angle brackets from a URI, if any
- annif.util.identity(x: Any) Any
Identity function: return the given argument unchanged
- annif.util.metric_code(metric)
Convert a human-readable metric name into an alphanumeric string
- annif.util.parse_args(param_string: str) tuple[list, dict]
Parse a string of comma separated arguments such as ‘42,43,key=abc’ into a list of positional args [42, 43] and a dict of keyword args {key: abc}
- annif.util.parse_sources(sourcedef: str) list[tuple[str, float]]
parse a source definition such as ‘src1:1.0,src2’ into a sequence of tuples (src_id, weight)
annif.views module
- annif.views.home()
annif.vocab module
Vocabulary management functionality for Annif
- class annif.vocab.AnnifVocabulary(vocab_id: str, datadir: str)
Bases:
DatadirMixin
Class representing a subject vocabulary which can be used by multiple Annif projects.
- INDEX_FILENAME_CSV = 'subjects.csv'
- INDEX_FILENAME_DUMP = 'subjects.dump.gz'
- INDEX_FILENAME_TTL = 'subjects.ttl'
- as_graph() Graph
return the vocabulary as an rdflib graph
- property languages: list[str]
- load_vocabulary(subject_corpus: SubjectCorpus, force: bool = False) None
Load subjects from a subject corpus and save them into one or more subject index files as well as a SKOS/Turtle file for later use. If force=True, replace the existing subject index completely.
- property skos: SubjectFileSKOS
return the subject vocabulary from SKOS file
- property subjects: SubjectIndex
Module contents
- annif.create_app(config_name: str | None = None) FlaskApp
Create a Connexion app to be used for the API.
- annif.create_cx_app(config_name: str | None = None) FlaskApp
Create a Connexion app to be used for the API.
- annif.create_flask_app(config_name: str | None = None) Flask
Create a Flask app to be used by the CLI.