annif.corpus package

Submodules

annif.corpus.combine module

Class for combining multiple corpora so they behave like a single corpus

class annif.corpus.combine.CombinedCorpus(corpora)

Bases: annif.corpus.types.DocumentCorpus

Class for combining multiple corpora so they behave like a single corpus

property documents

Iterate through the document corpus, yielding Document objects.

set_subject_index(subject_index)

Set a subject index for looking up labels that are necessary for conversion

annif.corpus.document module

Clases for supporting document corpora

class annif.corpus.document.DocumentDirectory(path, require_subjects=False)

Bases: annif.corpus.types.DocumentCorpus

A directory of files as a full text document corpus

property documents

Iterate through the document corpus, yielding Document objects.

class annif.corpus.document.DocumentFile(path)

Bases: annif.corpus.types.DocumentCorpus

A TSV file as a corpus of documents with subjects

property documents

Iterate through the document corpus, yielding Document objects.

class annif.corpus.document.DocumentList(documents)

Bases: annif.corpus.types.DocumentCorpus

A document corpus based on a list of other iterable of Document objects

property documents

Iterate through the document corpus, yielding Document objects.

class annif.corpus.document.LimitingDocumentCorpus(corpus, docs_limit)

Bases: annif.corpus.types.DocumentCorpus

A document corpus that wraps another document corpus but limits the number of documents to a given limit

property documents

Iterate through the document corpus, yielding Document objects.

class annif.corpus.document.TransformingDocumentCorpus(corpus, transform_fn)

Bases: annif.corpus.types.DocumentCorpus

A document corpus that wraps another document corpus but transforms the documents using a given transform function

property documents

Iterate through the document corpus, yielding Document objects.

annif.corpus.skos module

Support for subjects loaded from a SKOS/RDF file

class annif.corpus.skos.SubjectFileSKOS(path, language)

Bases: annif.corpus.types.SubjectCorpus

A subject corpus that uses SKOS files

property concepts
get_concept_labels(concept, label_types, language)
static is_rdf_file(path)

return True if the path looks like an RDF file that can be loaded as SKOS

save_skos(path, language)

Save the contents of the subject vocabulary into a SKOS/Turtle file with the given path name.

property subjects

Iterate through the subject corpus, yielding Subject objects.

annif.corpus.skos.serialize_subjects_to_skos(subjects, language, path)

Create a SKOS representation of the given subjects and serialize it into a SKOS/Turtle file with the given path name.

annif.corpus.subject module

Classes for supporting subject corpora expressed as directories or files

class annif.corpus.subject.SubjectFileTSV(path)

Bases: object

A subject vocabulary stored in a TSV file.

save_skos(path, language)

Save the contents of the subject vocabulary into a SKOS/Turtle file with the given path name.

property subjects
class annif.corpus.subject.SubjectIndex(corpus=None)

Bases: object

An index that remembers the associations between integers subject IDs and their URIs and labels.

property active

return a list of (subject_id, uri, label, notation) tuples of all subjects that are not deprecated

append(uri, label, notation)
by_label(label)

return the subject index of a subject by its label

by_uri(uri, warnings=True)

return the subject index of a subject by its URI, or None if not found. If warnings=True, log a warning message if the URI cannot be found.

contains_uri(uri)
deprecated_ids()

return indices of deprecated subjects

labels_to_uris(labels)

return a list of URIs corresponding to the given labels; unknown labels are ignored

classmethod load(path)

Load a subject index from a TSV file and return it.

save(path)

Save this subject index into a file.

uris_to_labels(uris)

return a list of labels corresponding to the given URIs; unknown URIs are ignored

class annif.corpus.subject.SubjectSet(subj_data=None)

Bases: object

Represents a set of subjects for a document.

as_vector(subject_index, destination=None, warnings=True)

Return the hits as a one-dimensional NumPy array in sklearn multilabel indicator format, using a subject index as the source of subjects. Use destination array if given (not None), otherwise create and return a new one. If warnings=True, log warnings for unknown URIs.

classmethod from_string(subj_data)
has_uris()

returns True if the URIs for all subjects are known

annif.corpus.types module

Basic types for document and subject corpora

class annif.corpus.types.Document(text, uris, labels)

Bases: tuple

labels

Alias for field number 2

text

Alias for field number 0

uris

Alias for field number 1

class annif.corpus.types.DocumentCorpus

Bases: object

Abstract base class for document corpora

abstract property documents

Iterate through the document corpus, yielding Document objects.

is_empty()

Check if there are no documents to iterate.

set_subject_index(subject_index)

Set a subject index for looking up labels that are necessary for conversion

class annif.corpus.types.Subject(uri, label, notation, text)

Bases: tuple

label

Alias for field number 1

notation

Alias for field number 2

text

Alias for field number 3

uri

Alias for field number 0

class annif.corpus.types.SubjectCorpus

Bases: object

Abstract base class for subject corpora

abstract property subjects

Iterate through the subject corpus, yielding Subject objects.

Module contents

Annif corpus operations

class annif.corpus.CombinedCorpus(corpora)

Bases: annif.corpus.types.DocumentCorpus

Class for combining multiple corpora so they behave like a single corpus

property documents

Iterate through the document corpus, yielding Document objects.

set_subject_index(subject_index)

Set a subject index for looking up labels that are necessary for conversion

class annif.corpus.Document(text, uris, labels)

Bases: tuple

labels

Alias for field number 2

text

Alias for field number 0

uris

Alias for field number 1

class annif.corpus.DocumentDirectory(path, require_subjects=False)

Bases: annif.corpus.types.DocumentCorpus

A directory of files as a full text document corpus

property documents

Iterate through the document corpus, yielding Document objects.

class annif.corpus.DocumentFile(path)

Bases: annif.corpus.types.DocumentCorpus

A TSV file as a corpus of documents with subjects

property documents

Iterate through the document corpus, yielding Document objects.

class annif.corpus.DocumentList(documents)

Bases: annif.corpus.types.DocumentCorpus

A document corpus based on a list of other iterable of Document objects

property documents

Iterate through the document corpus, yielding Document objects.

class annif.corpus.LimitingDocumentCorpus(corpus, docs_limit)

Bases: annif.corpus.types.DocumentCorpus

A document corpus that wraps another document corpus but limits the number of documents to a given limit

property documents

Iterate through the document corpus, yielding Document objects.

class annif.corpus.Subject(uri, label, notation, text)

Bases: tuple

label

Alias for field number 1

notation

Alias for field number 2

text

Alias for field number 3

uri

Alias for field number 0

class annif.corpus.SubjectFileSKOS(path, language)

Bases: annif.corpus.types.SubjectCorpus

A subject corpus that uses SKOS files

property concepts
get_concept_labels(concept, label_types, language)
static is_rdf_file(path)

return True if the path looks like an RDF file that can be loaded as SKOS

save_skos(path, language)

Save the contents of the subject vocabulary into a SKOS/Turtle file with the given path name.

property subjects

Iterate through the subject corpus, yielding Subject objects.

class annif.corpus.SubjectFileTSV(path)

Bases: object

A subject vocabulary stored in a TSV file.

save_skos(path, language)

Save the contents of the subject vocabulary into a SKOS/Turtle file with the given path name.

property subjects
class annif.corpus.SubjectIndex(corpus=None)

Bases: object

An index that remembers the associations between integers subject IDs and their URIs and labels.

property active

return a list of (subject_id, uri, label, notation) tuples of all subjects that are not deprecated

append(uri, label, notation)
by_label(label)

return the subject index of a subject by its label

by_uri(uri, warnings=True)

return the subject index of a subject by its URI, or None if not found. If warnings=True, log a warning message if the URI cannot be found.

contains_uri(uri)
deprecated_ids()

return indices of deprecated subjects

labels_to_uris(labels)

return a list of URIs corresponding to the given labels; unknown labels are ignored

classmethod load(path)

Load a subject index from a TSV file and return it.

save(path)

Save this subject index into a file.

uris_to_labels(uris)

return a list of labels corresponding to the given URIs; unknown URIs are ignored

class annif.corpus.SubjectSet(subj_data=None)

Bases: object

Represents a set of subjects for a document.

as_vector(subject_index, destination=None, warnings=True)

Return the hits as a one-dimensional NumPy array in sklearn multilabel indicator format, using a subject index as the source of subjects. Use destination array if given (not None), otherwise create and return a new one. If warnings=True, log warnings for unknown URIs.

classmethod from_string(subj_data)
has_uris()

returns True if the URIs for all subjects are known

class annif.corpus.TransformingDocumentCorpus(corpus, transform_fn)

Bases: annif.corpus.types.DocumentCorpus

A document corpus that wraps another document corpus but transforms the documents using a given transform function

property documents

Iterate through the document corpus, yielding Document objects.