annif package

Submodules

annif.cli module

Definitions for command-line (Click) commands for invoking Annif operations and printing the results to console.

annif.cli.backend_param_option(f)

Decorator to add an option for CLI commands to override BE parameters

annif.cli.common_options(f)

Decorator to add common options for all CLI commands

annif.cli.generate_filter_batches(subjects)
annif.cli.get_project(project_id)

Helper function to get a project by ID and bail out if it doesn’t exist

annif.cli.get_vocab(vocab_id)

Helper function to get a vocabulary by ID and bail out if it doesn’t exist

annif.cli.open_documents(paths, subject_index, vocab_lang, docs_limit)

Helper function to open a document corpus from a list of pathnames, each of which is either a TSV file or a directory of TXT files. For directories with subjects in TSV files, the given vocabulary language will be used to convert subject labels into URIs. The corpus will be returned as an instance of DocumentCorpus or LimitingDocumentCorpus.

annif.cli.parse_backend_params(backend_param, project)

Parse a list of backend parameters given with the –backend-param option into a nested dict structure

annif.cli.set_project_config_file_path(ctx, param, value)

Override the default path or the path given in env by CLI option

annif.cli.validate_backend_params(backend, beparam, project)

annif.config module

Configuration file handling

class annif.config.AnnifConfigCFG(filename)

Bases: object

Class for reading configuration in CFG/INI format

property project_ids
class annif.config.AnnifConfigDirectory(directory)

Bases: object

Class for reading configuration from directory

property project_ids
class annif.config.AnnifConfigTOML(filename)

Bases: object

Class for reading configuration in TOML format

property project_ids
annif.config.check_config(projects_config_path)
annif.config.find_config()
annif.config.parse_config(projects_config_path)

annif.datadir module

Mixin class for types that need a data directory

class annif.datadir.DatadirMixin(datadir, typename, identifier)

Bases: object

Mixin class for types that need a data directory for storing files

property datadir

annif.default_config module

A configuration module, where “Config” is a default configuration and the other classes are different configuration profiles overriding default settings.

class annif.default_config.Config

Bases: object

DATADIR = 'data'
DEBUG = False
INITIALIZE_PROJECTS = False
PROJECTS_CONFIG_PATH = ''
TESTING = False
class annif.default_config.DevelopmentConfig

Bases: annif.default_config.Config

DEBUG = True
class annif.default_config.ProductionConfig

Bases: annif.default_config.Config

INITIALIZE_PROJECTS = True
class annif.default_config.TestingConfig

Bases: annif.default_config.Config

DATADIR = 'tests/data'
PROJECTS_CONFIG_PATH = 'tests/projects.cfg'
TESTING = True
class annif.default_config.TestingDirectoryConfig

Bases: annif.default_config.TestingConfig

PROJECTS_CONFIG_PATH = 'tests/projects.d'
class annif.default_config.TestingInitializeConfig

Bases: annif.default_config.TestingConfig

INITIALIZE_PROJECTS = True
class annif.default_config.TestingInvalidProjectsConfig

Bases: annif.default_config.TestingConfig

PROJECTS_CONFIG_PATH = 'tests/projects_invalid.cfg'
class annif.default_config.TestingNoProjectsConfig

Bases: annif.default_config.TestingConfig

PROJECTS_CONFIG_PATH = 'tests/notfound.cfg'
class annif.default_config.TestingTOMLConfig

Bases: annif.default_config.TestingConfig

PROJECTS_CONFIG_PATH = 'tests/projects.toml'

annif.eval module

Evaluation metrics for Annif

class annif.eval.EvaluationBatch(subject_index)

Bases: object

A class for evaluating batches of results using all available metrics. The evaluate() method is called once per document in the batch. Final results can be queried using the results() method.

evaluate(hits, gold_subjects)
output_result_per_subject(y_true, y_pred, results_file, language)

Write results per subject (non-aggregated) to outputfile results_file, using labels in the given language

results(metrics=[], results_file=None, language=None)

evaluate a set of selected subjects against a gold standard using different metrics. If metrics is empty, use all available metrics. If results_file (file object) given, write results per subject to it with labels expressed in the given language.

annif.eval.dcg_score(y_true, y_pred, limit=None)

return the discounted cumulative gain (DCG) score for the selected labels vs. relevant labels

annif.eval.false_negatives(y_true, y_pred)

calculate the number of false negatives using bitwise operations, emulating the way sklearn evaluation metric functions work

annif.eval.false_positives(y_true, y_pred)

calculate the number of false positives using bitwise operations, emulating the way sklearn evaluation metric functions work

annif.eval.filter_pred_top_k(preds, limit)

filter a 2D prediction vector, retaining only the top K suggestions for each individual prediction; the rest will be set to zeros

annif.eval.ndcg_score(y_true, y_pred, limit=None)

return the normalized discounted cumulative gain (nDCG) score for the selected labels vs. relevant labels

annif.eval.precision_at_k_score(y_true, y_pred, limit)

calculate the precision at K, i.e. the number of relevant items among the top K predicted ones

annif.eval.true_positives(y_true, y_pred)

calculate the number of true positives using bitwise operations, emulating the way sklearn evaluation metric functions work

annif.exception module

Custom exceptions used by Annif

exception annif.exception.AnnifException(message, project_id=None, backend_id=None)

Bases: click.exceptions.ClickException

Base Annif exception. We define this as a subclass of ClickException so that the CLI can automatically handle exceptions. This exception cannot be instantiated directly - subclasses should be used instead.

format_message()
prefix = None
exception annif.exception.ConfigurationException(message, project_id=None, backend_id=None)

Bases: annif.exception.AnnifException

Exception raised when a project or backend is misconfigured.

prefix = 'Misconfigured'
exception annif.exception.NotInitializedException(message, project_id=None, backend_id=None)

Bases: annif.exception.AnnifException

Exception raised for attempting to use a project or backend that cannot be initialized, most likely since it is not yet functional because of lack of vocabulary or training.

prefix = "Couldn't initialize"
exception annif.exception.NotSupportedException(message, project_id=None, backend_id=None)

Bases: annif.exception.AnnifException

Exception raised when an operation is not supported by a project or backend.

prefix = 'Not supported'
exception annif.exception.OperationFailedException(message, project_id=None, backend_id=None)

Bases: annif.exception.AnnifException

Exception raised when an operation fails for some unknown reason.

prefix = 'Operation failed'

annif.parallel module

Parallel processing functionality for Annif

class annif.parallel.BaseWorker

Bases: object

Base class for workers that implement tasks executed via multiprocessing. The init method can be used to store data objects that are necessary for the operation. They will be stored in a class attribute that is accessible to the static worker method. The storage solution is inspired by this blog post: https://thelaziestprogrammer.com/python/multiprocessing-pool-a-global-solution # noqa

args = None
classmethod init(args)
class annif.parallel.ProjectSuggestMap(registry, project_ids, backend_params, limit, threshold)

Bases: object

A utility class that can be used to wrap one or more projects and provide a mapping method that converts Document objects to suggestions. Intended to be used with the multiprocessing module.

suggest(doc)
annif.parallel.get_pool(n_jobs)

return a suitable multiprocessing pool class, and the correct jobs argument for its constructor, for the given amount of parallel jobs

annif.project module

Project management functionality for Annif

class annif.project.Access(value)

Bases: enum.IntEnum

Enumeration of access levels for projects

hidden = 2
private = 1
public = 3
class annif.project.AnnifProject(project_id, config, datadir, registry)

Bases: annif.datadir.DatadirMixin

Class representing the configuration of a single Annif project.

DEFAULT_ACCESS = 'public'
property analyzer
property backend
dump()

return this project as a dict

hyperopt(corpus, trials, jobs, metric, results_file)

optimize the hyperparameters of the project using a validation corpus against a given metric

initialize(parallel=False)

Initialize this project and its backend so that they are ready to be used. If parallel is True, expect that the project will be used for parallel processing.

initialized = False
property is_trained
learn(corpus, backend_params=None)

further train the project using documents from a metadata source

property modification_time
remove_model_data()

remove the data of this project

property subjects
suggest(text, backend_params=None)

Suggest subjects the given text by passing it to the backend. Returns a list of SubjectSuggestion objects ordered by decreasing score.

train(corpus, backend_params=None, jobs=0)

train the project using documents from a metadata source

property transform
property vocab
property vocab_lang

annif.registry module

Registry that keeps track of Annif projects

class annif.registry.AnnifRegistry(projects_config_path, datadir, init_projects)

Bases: object

Class that keeps track of the Annif projects and vocabularies

get_project(project_id, min_access=Access.private)

return the definition of a single Project by project_id

get_projects(min_access=Access.private)

Return the available projects as a dict of project_id -> AnnifProject. The min_access parameter may be used to set the minimum access level required for the returned projects.

get_vocab(vocab_spec, default_language)

Return an (AnnifVocabulary, language) pair corresponding to the vocab_spec. If no language information is specified, use the given default language.

annif.registry.get_project(project_id, min_access=Access.private)

return the definition of a single Project by project_id

annif.registry.get_projects(min_access=Access.private)

Return the available projects as a dict of project_id -> AnnifProject. The min_access parameter may be used to set the minimum access level required for the returned projects.

annif.registry.get_vocab(vocab_id, min_access=Access.private)

return a single AnnifVocabulary by vocabulary id

annif.registry.get_vocabs(min_access=Access.private)

Return the available vocabularies as a dict of vocab_id -> AnnifVocabulary. The min_access parameter may be used to set the minimum access level required for the returned vocabularies.

annif.registry.initialize_projects(app)

annif.rest module

Definitions for REST API operations. These are wired via Connexion to methods defined in the Swagger specification.

annif.rest.learn(project_id, body)

learn from documents and return an empty 204 response if succesful

annif.rest.list_projects()

return a dict with projects formatted according to Swagger spec

annif.rest.project_not_found_error(project_id)

return a Connexion error object when a project is not found

annif.rest.server_error(err)

return a Connexion error object when there is a server error (project or backend problem)

annif.rest.show_project(project_id)

return a single project formatted according to Swagger spec

annif.rest.suggest(project_id, body)

suggest subjects for the given text and return a dict with results formatted according to Swagger spec

annif.suggestion module

Representing suggested subjects.

class annif.suggestion.LazySuggestionResult(construct)

Bases: annif.suggestion.SuggestionResult

SuggestionResult implementation that wraps another SuggestionResult which is initialized lazily only when it is actually accessed. Method calls will be proxied to the wrapped SuggestionResult.

as_list()

Return the hits as an ordered sequence of SubjectSuggestion objects, highest scores first.

as_vector(size, destination=None)

Return the hits as a one-dimensional score vector of given size. If destination array is given (not None) it will be used, otherwise a new array will be created.

filter(subject_index, limit=None, threshold=0.0)

Return a subset of the hits, filtered by the given limit and score threshold, as another SuggestionResult object.

class annif.suggestion.ListSuggestionResult(hits)

Bases: annif.suggestion.SuggestionResult

SuggestionResult implementation based primarily on lists of hits.

as_list()

Return the hits as an ordered sequence of SubjectSuggestion objects, highest scores first.

as_vector(size, destination=None)

Return the hits as a one-dimensional score vector of given size. If destination array is given (not None) it will be used, otherwise a new array will be created.

filter(subject_index, limit=None, threshold=0.0)

Return a subset of the hits, filtered by the given limit and score threshold, as another SuggestionResult object.

class annif.suggestion.SubjectSuggestion(subject_id, score)

Bases: tuple

score

Alias for field number 1

subject_id

Alias for field number 0

class annif.suggestion.SuggestionFilter(subject_index, limit=None, threshold=0.0)

Bases: object

A reusable filter for filtering SubjectSuggestion objects.

class annif.suggestion.SuggestionResult

Bases: object

Abstract base class for a set of hits returned by an analysis operation.

abstract as_list()

Return the hits as an ordered sequence of SubjectSuggestion objects, highest scores first.

abstract as_vector(size, destination=None)

Return the hits as a one-dimensional score vector of given size. If destination array is given (not None) it will be used, otherwise a new array will be created.

abstract filter(subject_index, limit=None, threshold=0.0)

Return a subset of the hits, filtered by the given limit and score threshold, as another SuggestionResult object.

class annif.suggestion.VectorSuggestionResult(vector)

Bases: annif.suggestion.SuggestionResult

SuggestionResult implementation based primarily on NumPy vectors.

as_list()

Return the hits as an ordered sequence of SubjectSuggestion objects, highest scores first.

as_vector(size, destination=None)

Return the hits as a one-dimensional score vector of given size. If destination array is given (not None) it will be used, otherwise a new array will be created.

filter(subject_index, limit=None, threshold=0.0)

Return a subset of the hits, filtered by the given limit and score threshold, as another SuggestionResult object.

property subject_order
class annif.suggestion.WeightedSuggestion(hits, weight, subjects)

Bases: tuple

hits

Alias for field number 0

subjects

Alias for field number 2

weight

Alias for field number 1

annif.util module

Utility functions for Annif

annif.util.atomic_save(obj, dirname, filename, method=None)

Save the given object (which must have a .save() method, unless the method parameter is given) into the given directory with the given filename, using a temporary file and renaming the temporary file to the final name.

annif.util.boolean(val)

Convert the given value to a boolean True/False value, if it isn’t already. True values are ‘1’, ‘yes’, ‘true’, and ‘on’ (case insensitive), everything else is False.

annif.util.cleanup_uri(uri)

remove angle brackets from a URI, if any

annif.util.identity(x)

Identity function: return the given argument unchanged

annif.util.merge_hits(weighted_hits, size)

Merge hits from multiple sources. Input is a sequence of WeightedSuggestion objects. The size parameter determines the length of the subject vector. Returns an SuggestionResult object.

annif.util.metric_code(metric)

Convert a human-readable metric name into an alphanumeric string

annif.util.parse_args(param_string)

Parse a string of comma separated arguments such as ‘42,43,key=abc’ into a list of positional args [42, 43] and a dict of keyword args {key: abc}

annif.util.parse_sources(sourcedef)

parse a source definition such as ‘src1:1.0,src2’ into a sequence of tuples (src_id, weight)

annif.views module

annif.views.home()

annif.vocab module

Vocabulary management functionality for Annif

class annif.vocab.AnnifVocabulary(vocab_id, datadir)

Bases: annif.datadir.DatadirMixin

Class representing a subject vocabulary which can be used by multiple Annif projects.

INDEX_FILENAME_CSV = 'subjects.csv'
INDEX_FILENAME_DUMP = 'subjects.dump.gz'
INDEX_FILENAME_TTL = 'subjects.ttl'
as_graph()

return the vocabulary as an rdflib graph

property languages
load_vocabulary(subject_corpus, force=False)

Load subjects from a subject corpus and save them into one or more subject index files as well as a SKOS/Turtle file for later use. If force=True, replace the existing subject index completely.

property skos

return the subject vocabulary from SKOS file

property subjects

Module contents

annif.create_app(config_name=None)