annif.corpus package


annif.corpus.combine module

Class for combining multiple corpora so they behave like a single corpus

class annif.corpus.combine.CombinedCorpus(corpora)

Bases: annif.corpus.types.DocumentCorpus

Class for combining multiple corpora so they behave like a single corpus

property documents

Iterate through the document corpus, yielding Document objects.

annif.corpus.document module

Clases for supporting document corpora

class annif.corpus.document.DocumentDirectory(path, subject_index, language, require_subjects=False)

Bases: annif.corpus.types.DocumentCorpus

A directory of files as a full text document corpus

property documents

Iterate through the document corpus, yielding Document objects.

class annif.corpus.document.DocumentFile(path, subject_index)

Bases: annif.corpus.types.DocumentCorpus

A TSV file as a corpus of documents with subjects

property documents

Iterate through the document corpus, yielding Document objects.

class annif.corpus.document.DocumentList(documents)

Bases: annif.corpus.types.DocumentCorpus

A document corpus based on a list of other iterable of Document objects

property documents

Iterate through the document corpus, yielding Document objects.

class annif.corpus.document.LimitingDocumentCorpus(corpus, docs_limit)

Bases: annif.corpus.types.DocumentCorpus

A document corpus that wraps another document corpus but limits the number of documents to a given limit

property documents

Iterate through the document corpus, yielding Document objects.

class annif.corpus.document.TransformingDocumentCorpus(corpus, transform_fn)

Bases: annif.corpus.types.DocumentCorpus

A document corpus that wraps another document corpus but transforms the documents using a given transform function

property documents

Iterate through the document corpus, yielding Document objects.

annif.corpus.skos module

Support for subjects loaded from a SKOS/RDF file

class annif.corpus.skos.SubjectFileSKOS(path)

Bases: annif.corpus.types.SubjectCorpus

A subject corpus that uses SKOS files

PREF_LABEL_PROPERTIES = (rdflib.term.URIRef(''), rdflib.term.URIRef(''))
property concepts
get_concept_labels(concept, label_types)

return all the labels of the given concept with the given label properties as a dict-like object where the keys are language codes and the values are lists of labels in that language

static is_rdf_file(path)

return True if the path looks like an RDF file that can be loaded as SKOS

property languages

Provide a list of language codes supported by this subject corpus.


Save the contents of the subject vocabulary into a SKOS/Turtle file with the given path name.

property subjects

Iterate through the subject corpus, yielding Subject objects.

annif.corpus.skos.serialize_subjects_to_skos(subjects, path)

Create a SKOS representation of the given subjects and serialize it into a SKOS/Turtle file with the given path name.

annif.corpus.subject module

Classes for supporting subject corpora expressed as directories or files

class annif.corpus.subject.SubjectFileCSV(path)

Bases: annif.corpus.types.SubjectCorpus

A multilingual subject vocabulary stored in a CSV file.

static is_csv_file(path)

return True if the path looks like a CSV file

property languages

Provide a list of language codes supported by this subject corpus.


Save the contents of the subject vocabulary into a SKOS/Turtle file with the given path name.

property subjects

Iterate through the subject corpus, yielding Subject objects.

class annif.corpus.subject.SubjectFileTSV(path, language)

Bases: annif.corpus.types.SubjectCorpus

A monolingual subject vocabulary stored in a TSV file.

property languages

Provide a list of language codes supported by this subject corpus.


Save the contents of the subject vocabulary into a SKOS/Turtle file with the given path name.

property subjects

Iterate through the subject corpus, yielding Subject objects.

class annif.corpus.subject.SubjectIndex

Bases: object

An index that remembers the associations between integers subject IDs and their URIs and labels.

property active

return a list of (subject_id, subject) tuples of all subjects that are not deprecated

by_label(label, language)

return the subject ID of a subject by its label in a given language

by_uri(uri, warnings=True)

return the subject ID of a subject by its URI, or None if not found. If warnings=True, log a warning message if the URI cannot be found.


return indices of deprecated subjects

property languages
classmethod load(path)

Load a subject index from a CSV file and return it.


Initialize the subject index from a subject corpus


Save this subject index into a file with the given path name.

class annif.corpus.subject.SubjectSet(subject_ids=None)

Bases: object

Represents a set of subjects for a document.

as_vector(size=None, destination=None)

Return the hits as a one-dimensional NumPy array in sklearn multilabel indicator format. Use destination array if given (not None), otherwise create and return a new one of the given size.

classmethod from_string(subj_data, subject_index, language)

annif.corpus.types module

Basic types for document and subject corpora

class annif.corpus.types.Document(text, subject_set)

Bases: tuple


Alias for field number 1


Alias for field number 0

class annif.corpus.types.DocumentCorpus

Bases: object

Abstract base class for document corpora

property doc_batches

Iterate through the document corpus in batches, yielding lists of Document objects.

abstract property documents

Iterate through the document corpus, yielding Document objects.


Check if there are no documents to iterate.

class annif.corpus.types.Subject(uri, labels, notation)

Bases: tuple


Alias for field number 1


Alias for field number 2


Alias for field number 0

class annif.corpus.types.SubjectCorpus

Bases: object

Abstract base class for subject corpora

abstract property languages

Provide a list of language codes supported by this subject corpus.

abstract save_skos(path)

Save the contents of the subject corpus into a SKOS/Turtle file with the given path name.

abstract property subjects

Iterate through the subject corpus, yielding Subject objects.

Module contents

Annif corpus operations

class annif.corpus.CombinedCorpus(corpora)

Bases: annif.corpus.types.DocumentCorpus

Class for combining multiple corpora so they behave like a single corpus

property documents

Iterate through the document corpus, yielding Document objects.

class annif.corpus.Document(text, subject_set)

Bases: tuple


Alias for field number 1


Alias for field number 0

class annif.corpus.DocumentDirectory(path, subject_index, language, require_subjects=False)

Bases: annif.corpus.types.DocumentCorpus

A directory of files as a full text document corpus

property documents

Iterate through the document corpus, yielding Document objects.

class annif.corpus.DocumentFile(path, subject_index)

Bases: annif.corpus.types.DocumentCorpus

A TSV file as a corpus of documents with subjects

property documents

Iterate through the document corpus, yielding Document objects.

class annif.corpus.DocumentList(documents)

Bases: annif.corpus.types.DocumentCorpus

A document corpus based on a list of other iterable of Document objects

property documents

Iterate through the document corpus, yielding Document objects.

class annif.corpus.LimitingDocumentCorpus(corpus, docs_limit)

Bases: annif.corpus.types.DocumentCorpus

A document corpus that wraps another document corpus but limits the number of documents to a given limit

property documents

Iterate through the document corpus, yielding Document objects.

class annif.corpus.Subject(uri, labels, notation)

Bases: tuple


Alias for field number 1


Alias for field number 2


Alias for field number 0

class annif.corpus.SubjectFileCSV(path)

Bases: annif.corpus.types.SubjectCorpus

A multilingual subject vocabulary stored in a CSV file.

static is_csv_file(path)

return True if the path looks like a CSV file

property languages

Provide a list of language codes supported by this subject corpus.


Save the contents of the subject vocabulary into a SKOS/Turtle file with the given path name.

property subjects

Iterate through the subject corpus, yielding Subject objects.

class annif.corpus.SubjectFileSKOS(path)

Bases: annif.corpus.types.SubjectCorpus

A subject corpus that uses SKOS files

PREF_LABEL_PROPERTIES = (rdflib.term.URIRef(''), rdflib.term.URIRef(''))
property concepts
get_concept_labels(concept, label_types)

return all the labels of the given concept with the given label properties as a dict-like object where the keys are language codes and the values are lists of labels in that language

static is_rdf_file(path)

return True if the path looks like an RDF file that can be loaded as SKOS

property languages

Provide a list of language codes supported by this subject corpus.


Save the contents of the subject vocabulary into a SKOS/Turtle file with the given path name.

property subjects

Iterate through the subject corpus, yielding Subject objects.

class annif.corpus.SubjectFileTSV(path, language)

Bases: annif.corpus.types.SubjectCorpus

A monolingual subject vocabulary stored in a TSV file.

property languages

Provide a list of language codes supported by this subject corpus.


Save the contents of the subject vocabulary into a SKOS/Turtle file with the given path name.

property subjects

Iterate through the subject corpus, yielding Subject objects.

class annif.corpus.SubjectIndex

Bases: object

An index that remembers the associations between integers subject IDs and their URIs and labels.

property active

return a list of (subject_id, subject) tuples of all subjects that are not deprecated

by_label(label, language)

return the subject ID of a subject by its label in a given language

by_uri(uri, warnings=True)

return the subject ID of a subject by its URI, or None if not found. If warnings=True, log a warning message if the URI cannot be found.


return indices of deprecated subjects

property languages
classmethod load(path)

Load a subject index from a CSV file and return it.


Initialize the subject index from a subject corpus


Save this subject index into a file with the given path name.

class annif.corpus.SubjectSet(subject_ids=None)

Bases: object

Represents a set of subjects for a document.

as_vector(size=None, destination=None)

Return the hits as a one-dimensional NumPy array in sklearn multilabel indicator format. Use destination array if given (not None), otherwise create and return a new one of the given size.

classmethod from_string(subj_data, subject_index, language)
class annif.corpus.TransformingDocumentCorpus(corpus, transform_fn)

Bases: annif.corpus.types.DocumentCorpus

A document corpus that wraps another document corpus but transforms the documents using a given transform function

property documents

Iterate through the document corpus, yielding Document objects.