corppa¶
This repository is research software developed as part of the Ends of Prosody, which is associated with the Princeton Prosody Archive (PPA). This software is particularly focused on research and work related to PPA full-text and page image corpora.
Documentation for this package is available at https://princeton-cdh.github.io/corppa/.
WARNING: This code is primarily for internal team use. Specific portions that may be useful are included in the Ends of Prosody utilities documentation which was created for participates of the Ends of Prosody conference.
For early experimental work on this project that is not included in the corppa python package, see https://github.com/Princeton-CDH/ppa-nlp-archive
Basic Usage¶
Installation¶
Use pip to install as a python package directly from GitHub. Use a branch or tag name, e.g. @develop
or @0.1
if you need to install a specific version.
pip install git+https://github.com/Princeton-CDH/corppa.git#egg=corppa
Scripts¶
Installing corppa
currently provides access to two command line scripts:
corppa-filter
: For filtering a PPA page-level corpus. (Corresponds tocorppa.utils.filter.py
)corppa-ocr
: For generating OCR text for images using Google Vision API. (Corresponds tocorppa.ocr.gvision_ocr.py
, requires optional dependencies)
License¶
This project is licensed under the Apache 2.0 License.
(c)2025 Trustees of Princeton University. Permission granted for non-commercial distribution online under a standard Open Source license.
Experimental Scripts¶
Experimental scripts associated with corppa
are located within the scripts
directory.
See this directory’s README for more detail.