corppa¶

This repository is research software developed as part of the Ends of Prosody, which is associated with the Princeton Prosody Archive (PPA). This software is particularly focused on research and work related to PPA full-text and page image corpora.

Documentation for this package is available at https://princeton-cdh.github.io/corppa/.

WARNING: This code is primarily for internal team use. Specific portions that may be useful are included in the Ends of Prosody utilities documentation which was created for participates of the Ends of Prosody conference.

For early experimental work on this project that is not included in the corppa python package, see https://github.com/Princeton-CDH/ppa-nlp-archive

Basic Usage¶

Installation¶

Use pip to install as a python package directly from GitHub. Use a branch or tag name, e.g. @develop or @0.1 if you need to install a specific version.

pip install git+https://github.com/Princeton-CDH/corppa.git#egg=corppa

Scripts¶

Installing corppa currently provides access to two command line scripts:

corppa-filter: For filtering a PPA page-level corpus. (Corresponds to corppa.utils.filter.py)
corppa-ocr: For generating OCR text for images using Google Vision API. (Corresponds to corppa.ocr.gvision_ocr.py, requires optional dependencies)

License¶

This project is licensed under the Apache 2.0 License.

Experimental Scripts¶

Experimental scripts associated with corppa are located within the scripts directory. See this directory’s README for more detail.

corppa¶

Basic Usage¶

Installation¶

Scripts¶

License¶

Experimental Scripts¶

corppa

Navigation

Table of Contents