annif.corpus package

Submodules

annif.corpus.combine module

Class for combining multiple corpora so they behave like a single corpus

class annif.corpus.combine.CombinedCorpus(corpora: list[DocumentFile])

Bases: annif.corpus.types.DocumentCorpus

Class for combining multiple corpora so they behave like a single corpus

property documents: itertools.chain

Iterate through the document corpus, yielding Document objects.

annif.corpus.document module

Clases for supporting document corpora

class annif.corpus.document.DocumentDirectory(path: str, subject_index: SubjectIndex | None = None, language: str | None = None, require_subjects: bool = False)

Bases: annif.corpus.types.DocumentCorpus

A directory of files as a full text document corpus

property documents: Iterator[Document]

Iterate through the document corpus, yielding Document objects.

class annif.corpus.document.DocumentFile(path: str, subject_index: SubjectIndex)

Bases: annif.corpus.types.DocumentCorpus

A TSV file as a corpus of documents with subjects

property documents: Iterator[Document]

Iterate through the document corpus, yielding Document objects.

class annif.corpus.document.DocumentList(documents)

Bases: annif.corpus.types.DocumentCorpus

A document corpus based on a list of other iterable of Document objects

property documents

Iterate through the document corpus, yielding Document objects.

class annif.corpus.document.LimitingDocumentCorpus(corpus, docs_limit)

Bases: annif.corpus.types.DocumentCorpus

A document corpus that wraps another document corpus but limits the number of documents to a given limit

property documents

Iterate through the document corpus, yielding Document objects.

class annif.corpus.document.TransformingDocumentCorpus(corpus, transform_fn)

Bases: annif.corpus.types.DocumentCorpus

A document corpus that wraps another document corpus but transforms the documents using a given transform function

property documents

Iterate through the document corpus, yielding Document objects.

annif.corpus.skos module

Support for subjects loaded from a SKOS/RDF file

class annif.corpus.skos.SubjectFileSKOS(path: str)

Bases: annif.corpus.types.SubjectCorpus

A subject corpus that uses SKOS files

PREF_LABEL_PROPERTIES = (rdflib.term.URIRef('http://www.w3.org/2004/02/skos/core#prefLabel'), rdflib.term.URIRef('http://www.w3.org/2000/01/rdf-schema#label'))
property concepts: Iterator[URIRef]
get_concept_labels(concept: URIRef, label_types: Sequence[URIRef]) collections.defaultdict[str | None, list[str]]

return all the labels of the given concept with the given label properties as a dict-like object where the keys are language codes and the values are lists of labels in that language

static is_rdf_file(path: str) bool

return True if the path looks like an RDF file that can be loaded as SKOS

property languages: set[str]

Provide a list of language codes supported by this subject corpus.

save_skos(path: str) None

Save the contents of the subject vocabulary into a SKOS/Turtle file with the given path name.

property subjects: Iterator[Subject]

Iterate through the subject corpus, yielding Subject objects.

annif.corpus.skos.serialize_subjects_to_skos(subjects: Iterator, path: str) None

Create a SKOS representation of the given subjects and serialize it into a SKOS/Turtle file with the given path name.

annif.corpus.subject module

Classes for supporting subject corpora expressed as directories or files

class annif.corpus.subject.SubjectFileCSV(path: str)

Bases: annif.corpus.types.SubjectCorpus

A multilingual subject vocabulary stored in a CSV file.

static is_csv_file(path: str) bool

return True if the path looks like a CSV file

property languages: list[str]

Provide a list of language codes supported by this subject corpus.

save_skos(path: str) None

Save the contents of the subject vocabulary into a SKOS/Turtle file with the given path name.

property subjects: Generator

Iterate through the subject corpus, yielding Subject objects.

class annif.corpus.subject.SubjectFileTSV(path: str, language: str)

Bases: annif.corpus.types.SubjectCorpus

A monolingual subject vocabulary stored in a TSV file.

property languages: list[str]

Provide a list of language codes supported by this subject corpus.

save_skos(path: str) None

Save the contents of the subject vocabulary into a SKOS/Turtle file with the given path name.

property subjects: Generator

Iterate through the subject corpus, yielding Subject objects.

class annif.corpus.subject.SubjectIndex

Bases: object

An index that remembers the associations between integers subject IDs and their URIs and labels.

property active: list[tuple[int, annif.corpus.types.Subject]]

return a list of (subject_id, subject) tuples of all subjects that are not deprecated

append(subject: annif.corpus.types.Subject) None
by_label(label: str | None, language: str) int | None

return the subject ID of a subject by its label in a given language

by_uri(uri: str, warnings: bool = True) int | None

return the subject ID of a subject by its URI, or None if not found. If warnings=True, log a warning message if the URI cannot be found.

contains_uri(uri: str) bool
deprecated_ids() list[int]

return indices of deprecated subjects

property languages: list[str] | None
classmethod load(path: str) annif.corpus.subject.SubjectIndex

Load a subject index from a CSV file and return it.

load_subjects(corpus: annif.corpus.types.SubjectCorpus) None

Initialize the subject index from a subject corpus

save(path: str) None

Save this subject index into a file with the given path name.

class annif.corpus.subject.SubjectSet(subject_ids: Optional[Any] = None)

Bases: object

Represents a set of subjects for a document.

as_vector(size: int | None = None, destination: np.ndarray | None = None) np.ndarray

Return the hits as a one-dimensional NumPy array in sklearn multilabel indicator format. Use destination array if given (not None), otherwise create and return a new one of the given size.

classmethod from_string(subj_data: str, subject_index: annif.corpus.subject.SubjectIndex, language: str) annif.corpus.subject.SubjectSet

annif.corpus.types module

Basic types for document and subject corpora

class annif.corpus.types.Document(text, subject_set)

Bases: tuple

subject_set

Alias for field number 1

text

Alias for field number 0

class annif.corpus.types.DocumentCorpus

Bases: object

Abstract base class for document corpora

DOC_BATCH_SIZE = 32
property doc_batches: collections.abc.Iterator[list[annif.corpus.types.Document]]

Iterate through the document corpus in batches, yielding lists of Document objects.

abstract property documents

Iterate through the document corpus, yielding Document objects.

is_empty() bool

Check if there are no documents to iterate.

class annif.corpus.types.Subject(uri, labels, notation)

Bases: tuple

labels

Alias for field number 1

notation

Alias for field number 2

uri

Alias for field number 0

class annif.corpus.types.SubjectCorpus

Bases: object

Abstract base class for subject corpora

abstract property languages

Provide a list of language codes supported by this subject corpus.

abstract save_skos(path)

Save the contents of the subject corpus into a SKOS/Turtle file with the given path name.

abstract property subjects

Iterate through the subject corpus, yielding Subject objects.

Module contents

Annif corpus operations

class annif.corpus.CombinedCorpus(corpora: list[DocumentFile])

Bases: annif.corpus.types.DocumentCorpus

Class for combining multiple corpora so they behave like a single corpus

property documents: itertools.chain

Iterate through the document corpus, yielding Document objects.

class annif.corpus.Document(text, subject_set)

Bases: tuple

subject_set

Alias for field number 1

text

Alias for field number 0

class annif.corpus.DocumentDirectory(path: str, subject_index: SubjectIndex | None = None, language: str | None = None, require_subjects: bool = False)

Bases: annif.corpus.types.DocumentCorpus

A directory of files as a full text document corpus

property documents: Iterator[Document]

Iterate through the document corpus, yielding Document objects.

class annif.corpus.DocumentFile(path: str, subject_index: SubjectIndex)

Bases: annif.corpus.types.DocumentCorpus

A TSV file as a corpus of documents with subjects

property documents: Iterator[Document]

Iterate through the document corpus, yielding Document objects.

class annif.corpus.DocumentList(documents)

Bases: annif.corpus.types.DocumentCorpus

A document corpus based on a list of other iterable of Document objects

property documents

Iterate through the document corpus, yielding Document objects.

class annif.corpus.LimitingDocumentCorpus(corpus, docs_limit)

Bases: annif.corpus.types.DocumentCorpus

A document corpus that wraps another document corpus but limits the number of documents to a given limit

property documents

Iterate through the document corpus, yielding Document objects.

class annif.corpus.Subject(uri, labels, notation)

Bases: tuple

labels

Alias for field number 1

notation

Alias for field number 2

uri

Alias for field number 0

class annif.corpus.SubjectFileCSV(path: str)

Bases: annif.corpus.types.SubjectCorpus

A multilingual subject vocabulary stored in a CSV file.

static is_csv_file(path: str) bool

return True if the path looks like a CSV file

property languages: list[str]

Provide a list of language codes supported by this subject corpus.

save_skos(path: str) None

Save the contents of the subject vocabulary into a SKOS/Turtle file with the given path name.

property subjects: Generator

Iterate through the subject corpus, yielding Subject objects.

class annif.corpus.SubjectFileSKOS(path: str)

Bases: annif.corpus.types.SubjectCorpus

A subject corpus that uses SKOS files

PREF_LABEL_PROPERTIES = (rdflib.term.URIRef('http://www.w3.org/2004/02/skos/core#prefLabel'), rdflib.term.URIRef('http://www.w3.org/2000/01/rdf-schema#label'))
property concepts: Iterator[URIRef]
get_concept_labels(concept: URIRef, label_types: Sequence[URIRef]) collections.defaultdict[str | None, list[str]]

return all the labels of the given concept with the given label properties as a dict-like object where the keys are language codes and the values are lists of labels in that language

static is_rdf_file(path: str) bool

return True if the path looks like an RDF file that can be loaded as SKOS

property languages: set[str]

Provide a list of language codes supported by this subject corpus.

save_skos(path: str) None

Save the contents of the subject vocabulary into a SKOS/Turtle file with the given path name.

property subjects: Iterator[Subject]

Iterate through the subject corpus, yielding Subject objects.

class annif.corpus.SubjectFileTSV(path: str, language: str)

Bases: annif.corpus.types.SubjectCorpus

A monolingual subject vocabulary stored in a TSV file.

property languages: list[str]

Provide a list of language codes supported by this subject corpus.

save_skos(path: str) None

Save the contents of the subject vocabulary into a SKOS/Turtle file with the given path name.

property subjects: Generator

Iterate through the subject corpus, yielding Subject objects.

class annif.corpus.SubjectIndex

Bases: object

An index that remembers the associations between integers subject IDs and their URIs and labels.

property active: list[tuple[int, annif.corpus.types.Subject]]

return a list of (subject_id, subject) tuples of all subjects that are not deprecated

append(subject: annif.corpus.types.Subject) None
by_label(label: str | None, language: str) int | None

return the subject ID of a subject by its label in a given language

by_uri(uri: str, warnings: bool = True) int | None

return the subject ID of a subject by its URI, or None if not found. If warnings=True, log a warning message if the URI cannot be found.

contains_uri(uri: str) bool
deprecated_ids() list[int]

return indices of deprecated subjects

property languages: list[str] | None
classmethod load(path: str) annif.corpus.subject.SubjectIndex

Load a subject index from a CSV file and return it.

load_subjects(corpus: annif.corpus.types.SubjectCorpus) None

Initialize the subject index from a subject corpus

save(path: str) None

Save this subject index into a file with the given path name.

class annif.corpus.SubjectSet(subject_ids: Optional[Any] = None)

Bases: object

Represents a set of subjects for a document.

as_vector(size: int | None = None, destination: np.ndarray | None = None) np.ndarray

Return the hits as a one-dimensional NumPy array in sklearn multilabel indicator format. Use destination array if given (not None), otherwise create and return a new one of the given size.

classmethod from_string(subj_data: str, subject_index: annif.corpus.subject.SubjectIndex, language: str) annif.corpus.subject.SubjectSet
class annif.corpus.TransformingDocumentCorpus(corpus, transform_fn)

Bases: annif.corpus.types.DocumentCorpus

A document corpus that wraps another document corpus but transforms the documents using a given transform function

property documents

Iterate through the document corpus, yielding Document objects.