annif.corpus package
Submodules
annif.corpus.combine module
Class for combining multiple corpora so they behave like a single corpus
- class annif.corpus.combine.CombinedCorpus(corpora: list[DocumentFile])
Bases:
DocumentCorpusClass for combining multiple corpora so they behave like a single corpus
- property documents: chain
Iterate through the document corpus, yielding Document objects.
annif.corpus.document module
Classes for supporting document corpora
- class annif.corpus.document.DocumentDirectory(path: str, subject_index: SubjectIndex | None = None, language: str | None = None, require_subjects: bool = False)
Bases:
DocumentCorpusA directory of files as a full text document corpus
- class annif.corpus.document.DocumentFileCSV(path: str, subject_index: SubjectIndex, require_subjects=True)
Bases:
DocumentCorpusA CSV file as a corpus of documents with subjects
- property documents: Iterator[Document]
Iterate through the document corpus, yielding Document objects.
- static is_csv_file(path: str) bool
return True if the path looks like a CSV file
- class annif.corpus.document.DocumentFileJSONL(path: str, subject_index: SubjectIndex, language: str, require_subjects=True)
Bases:
DocumentCorpusA JSON Lines file as a corpus of documents with subjects
- property documents: Iterator[Document]
Iterate through the document corpus, yielding Document objects.
- static is_jsonl_file(path: str) bool
return True if the path looks like a JSONL file
- class annif.corpus.document.DocumentFileTSV(path: str, subject_index: SubjectIndex, require_subjects=True)
Bases:
DocumentCorpusA TSV file as a corpus of documents with subjects
- class annif.corpus.document.DocumentList(documents)
Bases:
DocumentCorpusA document corpus based on a list of other iterable of Document objects
- property documents
Iterate through the document corpus, yielding Document objects.
- class annif.corpus.document.LimitingDocumentCorpus(corpus, docs_limit)
Bases:
DocumentCorpusA document corpus that wraps another document corpus but limits the number of documents to a given limit
- property documents
Iterate through the document corpus, yielding Document objects.
- class annif.corpus.document.TransformingDocumentCorpus(corpus, transform_fn)
Bases:
DocumentCorpusA document corpus that wraps another document corpus but transforms the documents using a given transform function
- property documents
Iterate through the document corpus, yielding Document objects.
annif.corpus.json module
Support for document corpora in JSON format
- annif.corpus.json.json_file_to_document(filename: str, subject_index: SubjectIndex | None, language: str, require_subjects: bool) Document | None
- annif.corpus.json.json_to_document(filename: str, json_data: str, subject_index: SubjectIndex | None, language: str, require_subjects: bool) Document | None
annif.corpus.types module
Basic types for document corpora
- class annif.corpus.types.Document(text: str, subject_set: SubjectSet | None = None, metadata: dict[str, Any] | None = None, document_id: str | None = None, file_path: str | None = None)
Bases:
object- as_dict(subject_index: SubjectIndex, language: str) dict[str, Any]
- class annif.corpus.types.DocumentCorpus
Bases:
objectAbstract base class for document corpora
- DOC_BATCH_SIZE = 32
- property doc_batches: Iterator[list[Document]]
Iterate through the document corpus in batches, yielding lists of Document objects.
- abstract property documents
Iterate through the document corpus, yielding Document objects.
- is_empty() bool
Check if there are no documents to iterate.
- class annif.corpus.types.SubjectSet(subject_ids: Any | None = None)
Bases:
objectRepresents a set of subjects for a document.
- as_list(subject_index: SubjectIndex, language: str) list[dict[str:str]]
- as_vector(size: int | None = None, destination: np.ndarray | None = None) np.ndarray
Return the hits as a one-dimensional NumPy array in sklearn multilabel indicator format. Use destination array if given (not None), otherwise create and return a new one of the given size.
- classmethod from_string(subj_data: str, subject_index: SubjectIndex, language: str) SubjectSet
Module contents
Annif corpus operations
- class annif.corpus.CombinedCorpus(corpora: list[DocumentFile])
Bases:
DocumentCorpusClass for combining multiple corpora so they behave like a single corpus
- property documents: chain
Iterate through the document corpus, yielding Document objects.
- class annif.corpus.Document(text: str, subject_set: SubjectSet | None = None, metadata: dict[str, Any] | None = None, document_id: str | None = None, file_path: str | None = None)
Bases:
object- as_dict(subject_index: SubjectIndex, language: str) dict[str, Any]
- class annif.corpus.DocumentDirectory(path: str, subject_index: SubjectIndex | None = None, language: str | None = None, require_subjects: bool = False)
Bases:
DocumentCorpusA directory of files as a full text document corpus
- class annif.corpus.DocumentFileCSV(path: str, subject_index: SubjectIndex, require_subjects=True)
Bases:
DocumentCorpusA CSV file as a corpus of documents with subjects
- property documents: Iterator[Document]
Iterate through the document corpus, yielding Document objects.
- static is_csv_file(path: str) bool
return True if the path looks like a CSV file
- class annif.corpus.DocumentFileJSONL(path: str, subject_index: SubjectIndex, language: str, require_subjects=True)
Bases:
DocumentCorpusA JSON Lines file as a corpus of documents with subjects
- property documents: Iterator[Document]
Iterate through the document corpus, yielding Document objects.
- static is_jsonl_file(path: str) bool
return True if the path looks like a JSONL file
- class annif.corpus.DocumentFileTSV(path: str, subject_index: SubjectIndex, require_subjects=True)
Bases:
DocumentCorpusA TSV file as a corpus of documents with subjects
- class annif.corpus.DocumentList(documents)
Bases:
DocumentCorpusA document corpus based on a list of other iterable of Document objects
- property documents
Iterate through the document corpus, yielding Document objects.
- class annif.corpus.LimitingDocumentCorpus(corpus, docs_limit)
Bases:
DocumentCorpusA document corpus that wraps another document corpus but limits the number of documents to a given limit
- property documents
Iterate through the document corpus, yielding Document objects.
- class annif.corpus.SubjectSet(subject_ids: Any | None = None)
Bases:
objectRepresents a set of subjects for a document.
- as_list(subject_index: SubjectIndex, language: str) list[dict[str:str]]
- as_vector(size: int | None = None, destination: np.ndarray | None = None) np.ndarray
Return the hits as a one-dimensional NumPy array in sklearn multilabel indicator format. Use destination array if given (not None), otherwise create and return a new one of the given size.
- classmethod from_string(subj_data: str, subject_index: SubjectIndex, language: str) SubjectSet
- class annif.corpus.TransformingDocumentCorpus(corpus, transform_fn)
Bases:
DocumentCorpusA document corpus that wraps another document corpus but transforms the documents using a given transform function
- property documents
Iterate through the document corpus, yielding Document objects.