annif.corpus package
Submodules
annif.corpus.combine module
Class for combining multiple corpora so they behave like a single corpus
- class annif.corpus.combine.CombinedCorpus(corpora: list[DocumentFile])
Bases:
DocumentCorpus
Class for combining multiple corpora so they behave like a single corpus
- property documents: chain
Iterate through the document corpus, yielding Document objects.
annif.corpus.document module
Clases for supporting document corpora
- class annif.corpus.document.DocumentDirectory(path: str, subject_index: SubjectIndex | None = None, language: str | None = None, require_subjects: bool = False)
Bases:
DocumentCorpus
A directory of files as a full text document corpus
- class annif.corpus.document.DocumentFile(path: str, subject_index: SubjectIndex)
Bases:
DocumentCorpus
A TSV file as a corpus of documents with subjects
- class annif.corpus.document.DocumentList(documents)
Bases:
DocumentCorpus
A document corpus based on a list of other iterable of Document objects
- property documents
Iterate through the document corpus, yielding Document objects.
- class annif.corpus.document.LimitingDocumentCorpus(corpus, docs_limit)
Bases:
DocumentCorpus
A document corpus that wraps another document corpus but limits the number of documents to a given limit
- property documents
Iterate through the document corpus, yielding Document objects.
- class annif.corpus.document.TransformingDocumentCorpus(corpus, transform_fn)
Bases:
DocumentCorpus
A document corpus that wraps another document corpus but transforms the documents using a given transform function
- property documents
Iterate through the document corpus, yielding Document objects.
annif.corpus.skos module
Support for subjects loaded from a SKOS/RDF file
- class annif.corpus.skos.SubjectFileSKOS(path: str)
Bases:
SubjectCorpus
A subject corpus that uses SKOS files
- PREF_LABEL_PROPERTIES = (rdflib.term.URIRef('http://www.w3.org/2004/02/skos/core#prefLabel'), rdflib.term.URIRef('http://www.w3.org/2000/01/rdf-schema#label'))
- property concepts: Iterator[URIRef]
- get_concept_labels(concept: URIRef, label_types: Sequence[URIRef]) collections.defaultdict[str | None, list[str]]
return all the labels of the given concept with the given label properties as a dict-like object where the keys are language codes and the values are lists of labels in that language
- static is_rdf_file(path: str) bool
return True if the path looks like an RDF file that can be loaded as SKOS
- property languages: set[str]
Provide a list of language codes supported by this subject corpus.
- save_skos(path: str) None
Save the contents of the subject vocabulary into a SKOS/Turtle file with the given path name.
- annif.corpus.skos.serialize_subjects_to_skos(subjects: Iterator, path: str) None
Create a SKOS representation of the given subjects and serialize it into a SKOS/Turtle file with the given path name.
annif.corpus.subject module
Classes for supporting subject corpora expressed as directories or files
- class annif.corpus.subject.SubjectFileCSV(path: str)
Bases:
SubjectCorpus
A multilingual subject vocabulary stored in a CSV file.
- static is_csv_file(path: str) bool
return True if the path looks like a CSV file
- property languages: list[str]
Provide a list of language codes supported by this subject corpus.
- save_skos(path: str) None
Save the contents of the subject vocabulary into a SKOS/Turtle file with the given path name.
- property subjects: Generator
Iterate through the subject corpus, yielding Subject objects.
- class annif.corpus.subject.SubjectFileTSV(path: str, language: str)
Bases:
SubjectCorpus
A monolingual subject vocabulary stored in a TSV file.
- property languages: list[str]
Provide a list of language codes supported by this subject corpus.
- save_skos(path: str) None
Save the contents of the subject vocabulary into a SKOS/Turtle file with the given path name.
- property subjects: Generator
Iterate through the subject corpus, yielding Subject objects.
- class annif.corpus.subject.SubjectIndex
Bases:
object
An index that remembers the associations between integers subject IDs and their URIs and labels.
- property active: list[tuple[int, Subject]]
return a list of (subject_id, subject) tuples of all subjects that are not deprecated
- by_label(label: str | None, language: str) int | None
return the subject ID of a subject by its label in a given language
- by_uri(uri: str, warnings: bool = True) int | None
return the subject ID of a subject by its URI, or None if not found. If warnings=True, log a warning message if the URI cannot be found.
- contains_uri(uri: str) bool
- deprecated_ids() list[int]
return indices of deprecated subjects
- property languages: list[str] | None
- classmethod load(path: str) SubjectIndex
Load a subject index from a CSV file and return it.
- load_subjects(corpus: SubjectCorpus) None
Initialize the subject index from a subject corpus
- save(path: str) None
Save this subject index into a file with the given path name.
- class annif.corpus.subject.SubjectSet(subject_ids: Any | None = None)
Bases:
object
Represents a set of subjects for a document.
- as_vector(size: int | None = None, destination: np.ndarray | None = None) np.ndarray
Return the hits as a one-dimensional NumPy array in sklearn multilabel indicator format. Use destination array if given (not None), otherwise create and return a new one of the given size.
- classmethod from_string(subj_data: str, subject_index: SubjectIndex, language: str) SubjectSet
annif.corpus.types module
Basic types for document and subject corpora
- class annif.corpus.types.Document(text, subject_set)
Bases:
tuple
- subject_set
Alias for field number 1
- text
Alias for field number 0
- class annif.corpus.types.DocumentCorpus
Bases:
object
Abstract base class for document corpora
- DOC_BATCH_SIZE = 32
- property doc_batches: Iterator[list[Document]]
Iterate through the document corpus in batches, yielding lists of Document objects.
- abstract property documents
Iterate through the document corpus, yielding Document objects.
- is_empty() bool
Check if there are no documents to iterate.
- class annif.corpus.types.Subject(uri, labels, notation)
Bases:
tuple
- labels
Alias for field number 1
- notation
Alias for field number 2
- uri
Alias for field number 0
- class annif.corpus.types.SubjectCorpus
Bases:
object
Abstract base class for subject corpora
- abstract property languages
Provide a list of language codes supported by this subject corpus.
- abstract save_skos(path)
Save the contents of the subject corpus into a SKOS/Turtle file with the given path name.
- abstract property subjects
Iterate through the subject corpus, yielding Subject objects.
Module contents
Annif corpus operations
- class annif.corpus.CombinedCorpus(corpora: list[DocumentFile])
Bases:
DocumentCorpus
Class for combining multiple corpora so they behave like a single corpus
- property documents: chain
Iterate through the document corpus, yielding Document objects.
- class annif.corpus.Document(text, subject_set)
Bases:
tuple
- subject_set
Alias for field number 1
- text
Alias for field number 0
- class annif.corpus.DocumentDirectory(path: str, subject_index: SubjectIndex | None = None, language: str | None = None, require_subjects: bool = False)
Bases:
DocumentCorpus
A directory of files as a full text document corpus
- class annif.corpus.DocumentFile(path: str, subject_index: SubjectIndex)
Bases:
DocumentCorpus
A TSV file as a corpus of documents with subjects
- class annif.corpus.DocumentList(documents)
Bases:
DocumentCorpus
A document corpus based on a list of other iterable of Document objects
- property documents
Iterate through the document corpus, yielding Document objects.
- class annif.corpus.LimitingDocumentCorpus(corpus, docs_limit)
Bases:
DocumentCorpus
A document corpus that wraps another document corpus but limits the number of documents to a given limit
- property documents
Iterate through the document corpus, yielding Document objects.
- class annif.corpus.Subject(uri, labels, notation)
Bases:
tuple
- labels
Alias for field number 1
- notation
Alias for field number 2
- uri
Alias for field number 0
- class annif.corpus.SubjectFileCSV(path: str)
Bases:
SubjectCorpus
A multilingual subject vocabulary stored in a CSV file.
- static is_csv_file(path: str) bool
return True if the path looks like a CSV file
- property languages: list[str]
Provide a list of language codes supported by this subject corpus.
- save_skos(path: str) None
Save the contents of the subject vocabulary into a SKOS/Turtle file with the given path name.
- property subjects: Generator
Iterate through the subject corpus, yielding Subject objects.
- class annif.corpus.SubjectFileSKOS(path: str)
Bases:
SubjectCorpus
A subject corpus that uses SKOS files
- PREF_LABEL_PROPERTIES = (rdflib.term.URIRef('http://www.w3.org/2004/02/skos/core#prefLabel'), rdflib.term.URIRef('http://www.w3.org/2000/01/rdf-schema#label'))
- property concepts: Iterator[URIRef]
- get_concept_labels(concept: URIRef, label_types: Sequence[URIRef]) collections.defaultdict[str | None, list[str]]
return all the labels of the given concept with the given label properties as a dict-like object where the keys are language codes and the values are lists of labels in that language
- static is_rdf_file(path: str) bool
return True if the path looks like an RDF file that can be loaded as SKOS
- property languages: set[str]
Provide a list of language codes supported by this subject corpus.
- save_skos(path: str) None
Save the contents of the subject vocabulary into a SKOS/Turtle file with the given path name.
- class annif.corpus.SubjectFileTSV(path: str, language: str)
Bases:
SubjectCorpus
A monolingual subject vocabulary stored in a TSV file.
- property languages: list[str]
Provide a list of language codes supported by this subject corpus.
- save_skos(path: str) None
Save the contents of the subject vocabulary into a SKOS/Turtle file with the given path name.
- property subjects: Generator
Iterate through the subject corpus, yielding Subject objects.
- class annif.corpus.SubjectIndex
Bases:
object
An index that remembers the associations between integers subject IDs and their URIs and labels.
- property active: list[tuple[int, Subject]]
return a list of (subject_id, subject) tuples of all subjects that are not deprecated
- by_label(label: str | None, language: str) int | None
return the subject ID of a subject by its label in a given language
- by_uri(uri: str, warnings: bool = True) int | None
return the subject ID of a subject by its URI, or None if not found. If warnings=True, log a warning message if the URI cannot be found.
- contains_uri(uri: str) bool
- deprecated_ids() list[int]
return indices of deprecated subjects
- property languages: list[str] | None
- classmethod load(path: str) SubjectIndex
Load a subject index from a CSV file and return it.
- load_subjects(corpus: SubjectCorpus) None
Initialize the subject index from a subject corpus
- save(path: str) None
Save this subject index into a file with the given path name.
- class annif.corpus.SubjectSet(subject_ids: Any | None = None)
Bases:
object
Represents a set of subjects for a document.
- as_vector(size: int | None = None, destination: np.ndarray | None = None) np.ndarray
Return the hits as a one-dimensional NumPy array in sklearn multilabel indicator format. Use destination array if given (not None), otherwise create and return a new one of the given size.
- classmethod from_string(subj_data: str, subject_index: SubjectIndex, language: str) SubjectSet
- class annif.corpus.TransformingDocumentCorpus(corpus, transform_fn)
Bases:
DocumentCorpus
A document corpus that wraps another document corpus but transforms the documents using a given transform function
- property documents
Iterate through the document corpus, yielding Document objects.