annif.corpus package¶
Submodules¶
annif.corpus.combine module¶
Class for combining multiple corpora so they behave like a single corpus
- class annif.corpus.combine.CombinedCorpus(corpora)¶
Bases:
annif.corpus.types.DocumentCorpus
Class for combining multiple corpora so they behave like a single corpus
- property documents¶
Iterate through the document corpus, yielding Document objects.
- set_subject_index(subject_index)¶
Set a subject index for looking up labels that are necessary for conversion
annif.corpus.document module¶
Clases for supporting document corpora
- class annif.corpus.document.DocumentDirectory(path, require_subjects=False)¶
Bases:
annif.corpus.types.DocumentCorpus
A directory of files as a full text document corpus
- property documents¶
Iterate through the document corpus, yielding Document objects.
- class annif.corpus.document.DocumentFile(path)¶
Bases:
annif.corpus.types.DocumentCorpus
A TSV file as a corpus of documents with subjects
- property documents¶
Iterate through the document corpus, yielding Document objects.
- class annif.corpus.document.DocumentList(documents)¶
Bases:
annif.corpus.types.DocumentCorpus
A document corpus based on a list of other iterable of Document objects
- property documents¶
Iterate through the document corpus, yielding Document objects.
- class annif.corpus.document.LimitingDocumentCorpus(corpus, docs_limit)¶
Bases:
annif.corpus.types.DocumentCorpus
A document corpus that wraps another document corpus but limits the number of documents to a given limit
- property documents¶
Iterate through the document corpus, yielding Document objects.
- class annif.corpus.document.TransformingDocumentCorpus(corpus, transform_fn)¶
Bases:
annif.corpus.types.DocumentCorpus
A document corpus that wraps another document corpus but transforms the documents using a given transform function
- property documents¶
Iterate through the document corpus, yielding Document objects.
annif.corpus.skos module¶
Support for subjects loaded from a SKOS/RDF file
- class annif.corpus.skos.SubjectFileSKOS(path, language)¶
Bases:
annif.corpus.types.SubjectCorpus
A subject corpus that uses SKOS files
- property concepts¶
- get_concept_labels(concept, label_types, language)¶
- static is_rdf_file(path)¶
return True if the path looks like an RDF file that can be loaded as SKOS
- save_skos(path, language)¶
Save the contents of the subject vocabulary into a SKOS/Turtle file with the given path name.
- property subjects¶
Iterate through the subject corpus, yielding Subject objects.
- annif.corpus.skos.serialize_subjects_to_skos(subjects, language, path)¶
Create a SKOS representation of the given subjects and serialize it into a SKOS/Turtle file with the given path name.
annif.corpus.subject module¶
Classes for supporting subject corpora expressed as directories or files
- class annif.corpus.subject.SubjectFileTSV(path)¶
Bases:
object
A subject vocabulary stored in a TSV file.
- save_skos(path, language)¶
Save the contents of the subject vocabulary into a SKOS/Turtle file with the given path name.
- property subjects¶
- class annif.corpus.subject.SubjectIndex(corpus=None)¶
Bases:
object
An index that remembers the associations between integers subject IDs and their URIs and labels.
- property active¶
return a list of (subject_id, uri, label, notation) tuples of all subjects that are not deprecated
- append(uri, label, notation)¶
- by_label(label)¶
return the subject index of a subject by its label
- by_uri(uri, warnings=True)¶
return the subject index of a subject by its URI, or None if not found. If warnings=True, log a warning message if the URI cannot be found.
- contains_uri(uri)¶
- deprecated_ids()¶
return indices of deprecated subjects
- labels_to_uris(labels)¶
return a list of URIs corresponding to the given labels; unknown labels are ignored
- classmethod load(path)¶
Load a subject index from a TSV file and return it.
- save(path)¶
Save this subject index into a file.
- uris_to_labels(uris)¶
return a list of labels corresponding to the given URIs; unknown URIs are ignored
- class annif.corpus.subject.SubjectSet(subj_data=None)¶
Bases:
object
Represents a set of subjects for a document.
- as_vector(subject_index, destination=None, warnings=True)¶
Return the hits as a one-dimensional NumPy array in sklearn multilabel indicator format, using a subject index as the source of subjects. Use destination array if given (not None), otherwise create and return a new one. If warnings=True, log warnings for unknown URIs.
- classmethod from_string(subj_data)¶
- has_uris()¶
returns True if the URIs for all subjects are known
annif.corpus.types module¶
Basic types for document and subject corpora
- class annif.corpus.types.Document(text, uris, labels)¶
Bases:
tuple
- labels¶
Alias for field number 2
- text¶
Alias for field number 0
- uris¶
Alias for field number 1
- class annif.corpus.types.DocumentCorpus¶
Bases:
object
Abstract base class for document corpora
- abstract property documents¶
Iterate through the document corpus, yielding Document objects.
- is_empty()¶
Check if there are no documents to iterate.
- set_subject_index(subject_index)¶
Set a subject index for looking up labels that are necessary for conversion
Module contents¶
Annif corpus operations
- class annif.corpus.CombinedCorpus(corpora)¶
Bases:
annif.corpus.types.DocumentCorpus
Class for combining multiple corpora so they behave like a single corpus
- property documents¶
Iterate through the document corpus, yielding Document objects.
- set_subject_index(subject_index)¶
Set a subject index for looking up labels that are necessary for conversion
- class annif.corpus.Document(text, uris, labels)¶
Bases:
tuple
- labels¶
Alias for field number 2
- text¶
Alias for field number 0
- uris¶
Alias for field number 1
- class annif.corpus.DocumentDirectory(path, require_subjects=False)¶
Bases:
annif.corpus.types.DocumentCorpus
A directory of files as a full text document corpus
- property documents¶
Iterate through the document corpus, yielding Document objects.
- class annif.corpus.DocumentFile(path)¶
Bases:
annif.corpus.types.DocumentCorpus
A TSV file as a corpus of documents with subjects
- property documents¶
Iterate through the document corpus, yielding Document objects.
- class annif.corpus.DocumentList(documents)¶
Bases:
annif.corpus.types.DocumentCorpus
A document corpus based on a list of other iterable of Document objects
- property documents¶
Iterate through the document corpus, yielding Document objects.
- class annif.corpus.LimitingDocumentCorpus(corpus, docs_limit)¶
Bases:
annif.corpus.types.DocumentCorpus
A document corpus that wraps another document corpus but limits the number of documents to a given limit
- property documents¶
Iterate through the document corpus, yielding Document objects.
- class annif.corpus.Subject(uri, label, notation, text)¶
Bases:
tuple
- label¶
Alias for field number 1
- notation¶
Alias for field number 2
- text¶
Alias for field number 3
- uri¶
Alias for field number 0
- class annif.corpus.SubjectFileSKOS(path, language)¶
Bases:
annif.corpus.types.SubjectCorpus
A subject corpus that uses SKOS files
- property concepts¶
- get_concept_labels(concept, label_types, language)¶
- static is_rdf_file(path)¶
return True if the path looks like an RDF file that can be loaded as SKOS
- save_skos(path, language)¶
Save the contents of the subject vocabulary into a SKOS/Turtle file with the given path name.
- property subjects¶
Iterate through the subject corpus, yielding Subject objects.
- class annif.corpus.SubjectFileTSV(path)¶
Bases:
object
A subject vocabulary stored in a TSV file.
- save_skos(path, language)¶
Save the contents of the subject vocabulary into a SKOS/Turtle file with the given path name.
- property subjects¶
- class annif.corpus.SubjectIndex(corpus=None)¶
Bases:
object
An index that remembers the associations between integers subject IDs and their URIs and labels.
- property active¶
return a list of (subject_id, uri, label, notation) tuples of all subjects that are not deprecated
- append(uri, label, notation)¶
- by_label(label)¶
return the subject index of a subject by its label
- by_uri(uri, warnings=True)¶
return the subject index of a subject by its URI, or None if not found. If warnings=True, log a warning message if the URI cannot be found.
- contains_uri(uri)¶
- deprecated_ids()¶
return indices of deprecated subjects
- labels_to_uris(labels)¶
return a list of URIs corresponding to the given labels; unknown labels are ignored
- classmethod load(path)¶
Load a subject index from a TSV file and return it.
- save(path)¶
Save this subject index into a file.
- uris_to_labels(uris)¶
return a list of labels corresponding to the given URIs; unknown URIs are ignored
- class annif.corpus.SubjectSet(subj_data=None)¶
Bases:
object
Represents a set of subjects for a document.
- as_vector(subject_index, destination=None, warnings=True)¶
Return the hits as a one-dimensional NumPy array in sklearn multilabel indicator format, using a subject index as the source of subjects. Use destination array if given (not None), otherwise create and return a new one. If warnings=True, log warnings for unknown URIs.
- classmethod from_string(subj_data)¶
- has_uris()¶
returns True if the URIs for all subjects are known
- class annif.corpus.TransformingDocumentCorpus(corpus, transform_fn)¶
Bases:
annif.corpus.types.DocumentCorpus
A document corpus that wraps another document corpus but transforms the documents using a given transform function
- property documents¶
Iterate through the document corpus, yielding Document objects.