annif.corpus package

Submodules

annif.corpus.combine module

Class for combining multiple corpora so they behave like a single corpus

class annif.corpus.combine.CombinedCorpus(corpora: list[DocumentFile])

Bases: DocumentCorpus

Class for combining multiple corpora so they behave like a single corpus

property documents: chain

Iterate through the document corpus, yielding Document objects.

annif.corpus.document module

Classes for supporting document corpora

class annif.corpus.document.DocumentDirectory(path: str, subject_index: SubjectIndex | None = None, language: str | None = None, require_subjects: bool = False)

Bases: DocumentCorpus

A directory of files as a full text document corpus

property documents: Iterator[Document]

Iterate through the document corpus, yielding Document objects.

class annif.corpus.document.DocumentFileCSV(path: str, subject_index: SubjectIndex, require_subjects=True)

Bases: DocumentCorpus

A CSV file as a corpus of documents with subjects

property documents: Iterator[Document]

Iterate through the document corpus, yielding Document objects.

static is_csv_file(path: str) bool

return True if the path looks like a CSV file

class annif.corpus.document.DocumentFileJSONL(path: str, subject_index: SubjectIndex, language: str, require_subjects=True)

Bases: DocumentCorpus

A JSON Lines file as a corpus of documents with subjects

property documents: Iterator[Document]

Iterate through the document corpus, yielding Document objects.

static is_jsonl_file(path: str) bool

return True if the path looks like a JSONL file

class annif.corpus.document.DocumentFileTSV(path: str, subject_index: SubjectIndex, require_subjects=True)

Bases: DocumentCorpus

A TSV file as a corpus of documents with subjects

property documents: Iterator[Document]

Iterate through the document corpus, yielding Document objects.

class annif.corpus.document.DocumentList(documents)

Bases: DocumentCorpus

A document corpus based on a list of other iterable of Document objects

property documents

Iterate through the document corpus, yielding Document objects.

class annif.corpus.document.LimitingDocumentCorpus(corpus, docs_limit)

Bases: DocumentCorpus

A document corpus that wraps another document corpus but limits the number of documents to a given limit

property documents

Iterate through the document corpus, yielding Document objects.

class annif.corpus.document.TransformingDocumentCorpus(corpus, transform_fn)

Bases: DocumentCorpus

A document corpus that wraps another document corpus but transforms the documents using a given transform function

property documents

Iterate through the document corpus, yielding Document objects.

annif.corpus.json module

Support for document corpora in JSON format

annif.corpus.json.json_file_to_document(filename: str, subject_index: SubjectIndex | None, language: str, require_subjects: bool) Document | None
annif.corpus.json.json_to_document(filename: str, json_data: str, subject_index: SubjectIndex | None, language: str, require_subjects: bool) Document | None

annif.corpus.types module

Basic types for document corpora

class annif.corpus.types.Document(text: str, subject_set: SubjectSet | None = None, metadata: dict[str, Any] | None = None, document_id: str | None = None, file_path: str | None = None)

Bases: object

as_dict(subject_index: SubjectIndex, language: str) dict[str, Any]
class annif.corpus.types.DocumentCorpus

Bases: object

Abstract base class for document corpora

DOC_BATCH_SIZE = 32
property doc_batches: Iterator[list[Document]]

Iterate through the document corpus in batches, yielding lists of Document objects.

abstract property documents

Iterate through the document corpus, yielding Document objects.

is_empty() bool

Check if there are no documents to iterate.

class annif.corpus.types.SubjectSet(subject_ids: Any | None = None)

Bases: object

Represents a set of subjects for a document.

as_list(subject_index: SubjectIndex, language: str) list[dict[str:str]]
as_vector(size: int | None = None, destination: np.ndarray | None = None) np.ndarray

Return the hits as a one-dimensional NumPy array in sklearn multilabel indicator format. Use destination array if given (not None), otherwise create and return a new one of the given size.

classmethod from_string(subj_data: str, subject_index: SubjectIndex, language: str) SubjectSet

Module contents

Annif corpus operations

class annif.corpus.CombinedCorpus(corpora: list[DocumentFile])

Bases: DocumentCorpus

Class for combining multiple corpora so they behave like a single corpus

property documents: chain

Iterate through the document corpus, yielding Document objects.

class annif.corpus.Document(text: str, subject_set: SubjectSet | None = None, metadata: dict[str, Any] | None = None, document_id: str | None = None, file_path: str | None = None)

Bases: object

as_dict(subject_index: SubjectIndex, language: str) dict[str, Any]
class annif.corpus.DocumentDirectory(path: str, subject_index: SubjectIndex | None = None, language: str | None = None, require_subjects: bool = False)

Bases: DocumentCorpus

A directory of files as a full text document corpus

property documents: Iterator[Document]

Iterate through the document corpus, yielding Document objects.

class annif.corpus.DocumentFileCSV(path: str, subject_index: SubjectIndex, require_subjects=True)

Bases: DocumentCorpus

A CSV file as a corpus of documents with subjects

property documents: Iterator[Document]

Iterate through the document corpus, yielding Document objects.

static is_csv_file(path: str) bool

return True if the path looks like a CSV file

class annif.corpus.DocumentFileJSONL(path: str, subject_index: SubjectIndex, language: str, require_subjects=True)

Bases: DocumentCorpus

A JSON Lines file as a corpus of documents with subjects

property documents: Iterator[Document]

Iterate through the document corpus, yielding Document objects.

static is_jsonl_file(path: str) bool

return True if the path looks like a JSONL file

class annif.corpus.DocumentFileTSV(path: str, subject_index: SubjectIndex, require_subjects=True)

Bases: DocumentCorpus

A TSV file as a corpus of documents with subjects

property documents: Iterator[Document]

Iterate through the document corpus, yielding Document objects.

class annif.corpus.DocumentList(documents)

Bases: DocumentCorpus

A document corpus based on a list of other iterable of Document objects

property documents

Iterate through the document corpus, yielding Document objects.

class annif.corpus.LimitingDocumentCorpus(corpus, docs_limit)

Bases: DocumentCorpus

A document corpus that wraps another document corpus but limits the number of documents to a given limit

property documents

Iterate through the document corpus, yielding Document objects.

class annif.corpus.SubjectSet(subject_ids: Any | None = None)

Bases: object

Represents a set of subjects for a document.

as_list(subject_index: SubjectIndex, language: str) list[dict[str:str]]
as_vector(size: int | None = None, destination: np.ndarray | None = None) np.ndarray

Return the hits as a one-dimensional NumPy array in sklearn multilabel indicator format. Use destination array if given (not None), otherwise create and return a new one of the given size.

classmethod from_string(subj_data: str, subject_index: SubjectIndex, language: str) SubjectSet
class annif.corpus.TransformingDocumentCorpus(corpus, transform_fn)

Bases: DocumentCorpus

A document corpus that wraps another document corpus but transforms the documents using a given transform function

property documents

Iterate through the document corpus, yielding Document objects.