Corpus Readers¶
After a corpus has been imported into the library, users will want to access the data through a CorpusReader
object.
The CorpusReader
API follows the NLTK CorpusReader API paradigm.
It offers a way for users to access the documents, paragraphs, sentences, and words of all the available documents in a corpus, or a specified collection of documents.
Not every corpus will support every method, e.g. a corpus of inscriptions may not support paragraphs via a para
method but the corpus provider should try to provide the interfaces that they can.
Reading a Corpus¶
Use the get_corpus
method in the readers module.
In [1]: from cltk.corpus.readers import get_corpus_reader
In [2]: latin_corpus = get_corpus_reader(corpus_name = 'latin_text_latin_library', language = 'latin')
In [3]: len(list(latin_corpus.docs()))
Out[3]: 2141
In [4]: len(list(latin_corpus.paras()))
Out[4]: 212130
In [5]: len(list(latin_corpus.sents()))
Out[5]: 1038668
In [6]: len(list(latin_corpus.words()))
Out[6]: 16455728
Adding a Corpus to the CLTK Reader¶
Modify the cltk.corpus.readers module, updating SUPPORTED_CORPORA
, adding your language and the specific corpus name.
In the get_corpus_reader
method implement the checks and mappings to return a NLTK compliant CorpusReader
API object.
Providing Metadata for Corpus Filtration¶
If you’re adding a Corpus to CLTK, please also consider providing a genre mapping if you corpus is large or is easily segmented into genres. Consider creating a file containing mappings of categories to directories and files, e.g.:
In [1]: from cltk.corpus.latin.latin_library_corpus_types import corpus_directories_by_type
In [2]: corpus_directories_by_type.keys()
Out [2]: dict_keys(['republican', 'augustan', 'early_silver', 'late_silver', 'old', 'christian', 'medieval', 'renaissance', 'neo_latin', 'misc', 'early'])
In [3]: from cltk.corpus.latin.latin_library_corpus_types import corpus_texts_by_type
In [4]: corpus_directories_by_type.values()[:3]
Out [4]: [['./caesar', './lucretius', './nepos', './cicero'], ['./livy', './ovid', './horace', './vergil', './hyginus']]
In [5]: from cltk.corpus.latin.latin_library_corpus_types import corpus_texts_by_type
In [6]: list(corpus_texts_by_type.values())[:2]
Out [6]: [['sall.1.txt', 'sall.2.txt', 'sall.cotta.txt', 'sall.ep1.txt', 'sall.ep2.txt', 'sall.frag.txt', 'sall.invectiva.txt', 'sall.lep.txt', 'sall.macer.txt', 'sall.mithr.txt', 'sall.phil.txt', 'sall.pomp.txt', 'varro.frag.txt', 'varro.ll10.txt', 'varro.ll5.txt', 'varro.ll6.txt', 'varro.ll7.txt', 'varro.ll8.txt', 'varro.ll9.txt', 'varro.rr1.txt', 'varro.rr2.txt', 'varro.rr3.txt', 'sulpicia.txt'], ['resgestae.txt', 'resgestae1.txt', 'manilius1.txt', 'manilius2.txt', 'manilius3.txt', 'manilius4.txt', 'manilius5.txt', 'catullus.txt', 'vitruvius1.txt', 'vitruvius10.txt', 'vitruvius2.txt', 'vitruvius3.txt', 'vitruvius4.txt', 'vitruvius5.txt', 'vitruvius6.txt', 'vitruvius7.txt', 'vitruvius8.txt', 'vitruvius9.txt', 'propertius1.txt', 'tibullus1.txt', 'tibullus2.txt', 'tibullus3.txt']]
The mapping is a dictionary of genre types or periods, and the values are lists of files or directories for each type.
Helper Methods for Corpus Filtration¶
Users will typically construct a CorpusReader
by selecting category types of directories or files.
The assemble_corpus
method allows users to take a CorpusReader and filter the files used provide the data for the reader.
In [1]: from cltk.corpus.readers import assemble_corpus, get_corpus_reader
In [2]: from cltk.corpus.latin.latin_library_corpus_types import corpus_texts_by_type, corpus_directories_by_type
In [3]: latin_corpus = get_corpus_reader(corpus_name = 'latin_text_latin_library', language = 'latin')
In [4]: filtered_reader, fileids, catgories = assemble_corpus(latin_corpus, types_requested=['republican', 'augustan'], type_dirs=corpus_directories_by_type,
... type_files=corpus_texts_by_type)
In [5]: len(list(filtered_reader.docs()))
Out [5]: 510
In [6]: categories
Out [6]: {'republican', 'augustan'}
In [7]: len(fileids)
Out [7]: 510