annif package

Submodules

annif.cli module

Definitions for command-line (Click) commands for invoking Annif operations and printing the results to console.

annif.cli_util module

Utility functions for Annif CLI commands

annif.cli_util.backend_param_option(f)

Decorator to add an option for CLI commands to override BE parameters

annif.cli_util.common_options(f)

Decorator to add common options for all CLI commands

annif.cli_util.docs_limit_option(f)

Decorator to add an option for CLI commands to limit the number of documents to use

annif.cli_util.generate_filter_params(filter_batch_max_limit)
annif.cli_util.get_project(project_id)

Helper function to get a project by ID and bail out if it doesn’t exist

annif.cli_util.get_vocab(vocab_id)

Helper function to get a vocabulary by ID and bail out if it doesn’t exist

annif.cli_util.open_documents(paths, subject_index, vocab_lang, docs_limit)

Helper function to open a document corpus from a list of pathnames, each of which is either a TSV file or a directory of TXT files. For directories with subjects in TSV files, the given vocabulary language will be used to convert subject labels into URIs. The corpus will be returned as an instance of DocumentCorpus or LimitingDocumentCorpus.

annif.cli_util.open_text_documents(paths, docs_limit)

Helper function to read text documents from the given file paths. Returns a DocumentList object with Documents having no subjects. If a path is “-“, the document text is read from standard input. The maximum number of documents to read is set by docs_limit parameter.

annif.cli_util.parse_backend_params(backend_param, project)

Parse a list of backend parameters given with the –backend-param option into a nested dict structure

annif.cli_util.show_hits(hits, project, lang, file=None)

Print subject suggestions to the console or a file. The suggestions are displayed as a table, with one row per hit. Each row contains the URI, label, possible notation, and score of the suggestion. The label is given in the specified language.

annif.config module

Configuration file handling

class annif.config.AnnifConfigCFG(filename)

Bases: object

Class for reading configuration in CFG/INI format

property project_ids
class annif.config.AnnifConfigDirectory(directory)

Bases: object

Class for reading configuration from directory

property project_ids
class annif.config.AnnifConfigTOML(filename)

Bases: object

Class for reading configuration in TOML format

property project_ids
annif.config.check_config(projects_config_path)
annif.config.find_config()
annif.config.parse_config(projects_config_path)

annif.datadir module

Mixin class for types that need a data directory

class annif.datadir.DatadirMixin(datadir, typename, identifier)

Bases: object

Mixin class for types that need a data directory for storing files

property datadir

annif.default_config module

A configuration module, where “Config” is a default configuration and the other classes are different configuration profiles overriding default settings.

class annif.default_config.Config

Bases: object

DATADIR = 'data'
DEBUG = False
INITIALIZE_PROJECTS = False
PROJECTS_CONFIG_PATH = ''
TESTING = False
class annif.default_config.DevelopmentConfig

Bases: annif.default_config.Config

DEBUG = True
class annif.default_config.ProductionConfig

Bases: annif.default_config.Config

INITIALIZE_PROJECTS = True
class annif.default_config.TestingConfig

Bases: annif.default_config.Config

DATADIR = 'tests/data'
PROJECTS_CONFIG_PATH = 'tests/projects.cfg'
TESTING = True
class annif.default_config.TestingDirectoryConfig

Bases: annif.default_config.TestingConfig

PROJECTS_CONFIG_PATH = 'tests/projects.d'
class annif.default_config.TestingInitializeConfig

Bases: annif.default_config.TestingConfig

INITIALIZE_PROJECTS = True
class annif.default_config.TestingInvalidProjectsConfig

Bases: annif.default_config.TestingConfig

PROJECTS_CONFIG_PATH = 'tests/projects_invalid.cfg'
class annif.default_config.TestingNoProjectsConfig

Bases: annif.default_config.TestingConfig

PROJECTS_CONFIG_PATH = 'tests/notfound.cfg'
class annif.default_config.TestingTOMLConfig

Bases: annif.default_config.TestingConfig

PROJECTS_CONFIG_PATH = 'tests/projects.toml'

annif.eval module

Evaluation metrics for Annif

class annif.eval.EvaluationBatch(subject_index)

Bases: object

A class for evaluating batches of results using all available metrics. The evaluate() method is called once per document in the batch or evaluate_many() for a list of documents of the batch. Final results can be queried using the results() method.

evaluate_many(suggestion_batch, gold_subject_batch)
output_result_per_subject(y_true, y_pred, results_file, language)

Write results per subject (non-aggregated) to outputfile results_file, using labels in the given language

results(metrics=[], results_file=None, language=None)

evaluate a set of selected subjects against a gold standard using different metrics. If metrics is empty, use all available metrics. If results_file (file object) given, write results per subject to it with labels expressed in the given language.

annif.eval.dcg_score(y_true, y_pred, limit=None)

return the discounted cumulative gain (DCG) score for the selected labels vs. relevant labels

annif.eval.false_negatives(y_true, y_pred)

calculate the number of false negatives using bitwise operations, emulating the way sklearn evaluation metric functions work

annif.eval.false_positives(y_true, y_pred)

calculate the number of false positives using bitwise operations, emulating the way sklearn evaluation metric functions work

annif.eval.ndcg_score(y_true, y_pred, limit=None)

return the normalized discounted cumulative gain (nDCG) score for the selected labels vs. relevant labels

annif.eval.true_positives(y_true, y_pred)

calculate the number of true positives using bitwise operations, emulating the way sklearn evaluation metric functions work

annif.exception module

Custom exceptions used by Annif

exception annif.exception.AnnifException(message, project_id=None, backend_id=None)

Bases: click.exceptions.ClickException

Base Annif exception. We define this as a subclass of ClickException so that the CLI can automatically handle exceptions. This exception cannot be instantiated directly - subclasses should be used instead.

format_message()
prefix = None
exception annif.exception.ConfigurationException(message, project_id=None, backend_id=None)

Bases: annif.exception.AnnifException

Exception raised when a project or backend is misconfigured.

prefix = 'Misconfigured'
exception annif.exception.NotInitializedException(message, project_id=None, backend_id=None)

Bases: annif.exception.AnnifException

Exception raised for attempting to use a project or backend that cannot be initialized, most likely since it is not yet functional because of lack of vocabulary or training.

prefix = "Couldn't initialize"
exception annif.exception.NotSupportedException(message, project_id=None, backend_id=None)

Bases: annif.exception.AnnifException

Exception raised when an operation is not supported by a project or backend.

prefix = 'Not supported'
exception annif.exception.OperationFailedException(message, project_id=None, backend_id=None)

Bases: annif.exception.AnnifException

Exception raised when an operation fails for some unknown reason.

prefix = 'Operation failed'

annif.parallel module

Parallel processing functionality for Annif

class annif.parallel.BaseWorker

Bases: object

Base class for workers that implement tasks executed via multiprocessing. The init method can be used to store data objects that are necessary for the operation. They will be stored in a class attribute that is accessible to the static worker method. The storage solution is inspired by this blog post: https://thelaziestprogrammer.com/python/multiprocessing-pool-a-global-solution # noqa

args = None
classmethod init(args)
class annif.parallel.ProjectSuggestMap(registry, project_ids, backend_params, limit, threshold)

Bases: object

A utility class that can be used to wrap one or more projects and provide a mapping method that converts Document objects to suggestions. Intended to be used with the multiprocessing module.

suggest(doc)
suggest_batch(batch)
annif.parallel.get_pool(n_jobs)

return a suitable multiprocessing pool class, and the correct jobs argument for its constructor, for the given amount of parallel jobs

annif.project module

Project management functionality for Annif

class annif.project.Access(value)

Bases: enum.IntEnum

Enumeration of access levels for projects

hidden = 2
private = 1
public = 3
class annif.project.AnnifProject(project_id, config, datadir, registry)

Bases: annif.datadir.DatadirMixin

Class representing the configuration of a single Annif project.

DEFAULT_ACCESS = 'public'
property analyzer
property backend
dump()

return this project as a dict

hyperopt(corpus, trials, jobs, metric, results_file)

optimize the hyperparameters of the project using a validation corpus against a given metric

initialize(parallel=False)

Initialize this project and its backend so that they are ready to be used. If parallel is True, expect that the project will be used for parallel processing.

initialized = False
property is_trained
learn(corpus, backend_params=None)

further train the project using documents from a metadata source

property modification_time
remove_model_data()

remove the data of this project

property subjects
suggest(texts, backend_params=None)

Suggest subjects for the given documents batch.

suggest_corpus(corpus, backend_params=None)

Suggest subjects for the given documents corpus in batches of documents.

train(corpus, backend_params=None, jobs=0)

train the project using documents from a metadata source

property transform
property vocab
property vocab_lang

annif.registry module

Registry that keeps track of Annif projects

class annif.registry.AnnifRegistry(projects_config_path, datadir, init_projects)

Bases: object

Class that keeps track of the Annif projects and vocabularies

get_project(project_id, min_access=Access.private)

return the definition of a single Project by project_id

get_projects(min_access=Access.private)

Return the available projects as a dict of project_id -> AnnifProject. The min_access parameter may be used to set the minimum access level required for the returned projects.

get_vocab(vocab_spec, default_language)

Return an (AnnifVocabulary, language) pair corresponding to the vocab_spec. If no language information is specified, use the given default language.

annif.registry.get_project(project_id, min_access=Access.private)

return the definition of a single Project by project_id

annif.registry.get_projects(min_access=Access.private)

Return the available projects as a dict of project_id -> AnnifProject. The min_access parameter may be used to set the minimum access level required for the returned projects.

annif.registry.get_vocab(vocab_id, min_access=Access.private)

return a single AnnifVocabulary by vocabulary id

annif.registry.get_vocabs(min_access=Access.private)

Return the available vocabularies as a dict of vocab_id -> AnnifVocabulary. The min_access parameter may be used to set the minimum access level required for the returned vocabularies.

annif.registry.initialize_projects(app)

annif.rest module

Definitions for REST API operations. These are wired via Connexion to methods defined in the OpenAPI specification.

annif.rest.language_not_supported_error(lang)

return a Connexion error object when attempting to use unsupported language

annif.rest.learn(project_id, body)

learn from documents and return an empty 204 response if succesful

annif.rest.list_projects()

return a dict with projects formatted according to OpenAPI spec

annif.rest.project_not_found_error(project_id)

return a Connexion error object when a project is not found

annif.rest.server_error(err)

return a Connexion error object when there is a server error (project or backend problem)

annif.rest.show_info()

return version of annif and a title for the api according to OpenAPI spec

annif.rest.show_project(project_id)

return a single project formatted according to OpenAPI spec

annif.rest.suggest(project_id, body)

suggest subjects for the given text and return a dict with results formatted according to OpenAPI spec

annif.rest.suggest_batch(project_id, body, **query_parameters)

suggest subjects for the given documents and return a list of dicts with results formatted according to OpenAPI spec

annif.suggestion module

Representing suggested subjects.

class annif.suggestion.SubjectSuggestion(subject_id, score)

Bases: tuple

score

Alias for field number 1

subject_id

Alias for field number 0

class annif.suggestion.SuggestionBatch(array)

Bases: object

Subject suggestions for a batch of documents.

filter(limit=None, threshold=0.0)

Return a subset of the hits, filtered by the given limit and score threshold, as another SuggestionBatch object.

classmethod from_averaged(batches, weights)

Create a new SuggestionBatch where the subject scores are the weighted average of scores in several SuggestionBatches

classmethod from_sequence(suggestion_results, subject_index, limit=None)

Create a new SuggestionBatch from a sequence where each item is a sequence of SubjectSuggestion objects.

class annif.suggestion.SuggestionResult(array, idx)

Bases: object

Suggestions for a single document, backed by a row of a sparse array.

as_vector()
class annif.suggestion.SuggestionResults(batches)

Bases: object

Subject suggestions for a potentially very large number of documents.

filter(limit=None, threshold=0.0)

Return a view of these suggestions, filtered by the given limit and/or threshold, as another SuggestionResults object.

annif.suggestion.filter_suggestion(preds, limit=None, threshold=0.0)

filter a 2D sparse suggestion array (csr_array), retaining only the top K suggestions with a score above or equal to the threshold for each individual prediction; the rest will be left as zeros

annif.suggestion.vector_to_suggestions(vector, limit)

annif.util module

Utility functions for Annif

class annif.util.DuplicateFilter

Bases: logging.Filter

Filter out log messages that have already been displayed.

filter(record)

Determine if the specified record is to be logged.

Returns True if the record should be logged, or False otherwise. If deemed appropriate, the record may be modified in-place.

annif.util.atomic_save(obj, dirname, filename, method=None)

Save the given object (which must have a .save() method, unless the method parameter is given) into the given directory with the given filename, using a temporary file and renaming the temporary file to the final name.

annif.util.boolean(val)

Convert the given value to a boolean True/False value, if it isn’t already. True values are ‘1’, ‘yes’, ‘true’, and ‘on’ (case insensitive), everything else is False.

annif.util.cleanup_uri(uri)

remove angle brackets from a URI, if any

annif.util.identity(x)

Identity function: return the given argument unchanged

annif.util.metric_code(metric)

Convert a human-readable metric name into an alphanumeric string

annif.util.parse_args(param_string)

Parse a string of comma separated arguments such as ‘42,43,key=abc’ into a list of positional args [42, 43] and a dict of keyword args {key: abc}

annif.util.parse_sources(sourcedef)

parse a source definition such as ‘src1:1.0,src2’ into a sequence of tuples (src_id, weight)

annif.views module

annif.views.home()

annif.vocab module

Vocabulary management functionality for Annif

class annif.vocab.AnnifVocabulary(vocab_id, datadir)

Bases: annif.datadir.DatadirMixin

Class representing a subject vocabulary which can be used by multiple Annif projects.

INDEX_FILENAME_CSV = 'subjects.csv'
INDEX_FILENAME_DUMP = 'subjects.dump.gz'
INDEX_FILENAME_TTL = 'subjects.ttl'
as_graph()

return the vocabulary as an rdflib graph

property languages
load_vocabulary(subject_corpus, force=False)

Load subjects from a subject corpus and save them into one or more subject index files as well as a SKOS/Turtle file for later use. If force=True, replace the existing subject index completely.

property skos

return the subject vocabulary from SKOS file

property subjects

Module contents

annif.create_app(config_name=None)