Code Documentation¶
OCR¶
This script OCRs images using the Google Vision API.
- corppa.ocr.gvision_ocr.ocr_image_via_gvision(gvision_client: google_vision.ImageAnnotatorClient, input_image: Path, out_txt: Path, out_json: Path) None ¶
Perform OCR for input image using the Google Cloud Vision API via the provided client. The plaintext output and json response of the OCR call are written to
out_txt
andout_json
paths respectively.
- corppa.ocr.gvision_ocr.ocr_images(in_dir: Path, out_dir: Path, exts: Iterable[str], ocr_limit: int = 0, show_progress: bool = True) dict[str, int] ¶
OCR images in in_dir with extension exts to out_dir. If ocr_limit > 0, stop after OCRing ocr_limit images.
Returns a map structure reporting the number of images OCR’d and skipped.
- corppa.ocr.gvision_ocr.ocr_volumes(vol_ids: list[str], in_dir: Path, out_dir: Path, exts: Iterable[str], ocr_limit: int = 0, show_progress: bool = True) None ¶
OCR images for volumes
vol_ids
with extension exts toout_dir
. Assumesin_dir
follows the PPA directory conventions (seecorppa.utils.path_utils
for more details). Ifocr_limit > 0
, stop after OCRingocr_limit
images.
Collate Texts¶
Script to turn directories containing multiple text files into a single JSON
file containing text contents of all files with page numbers based
on text filenames. (Page number logic is currently Gale-specific).
Note: This was used to create work-level text corpora files after running OCR.
Example usage:
python collate_txt.py top-level-input-dir top-level-output-dir
Utils¶
Filter Utility¶
Utility for filtering PPA full-text corpus to work with a subset of pages.
- Currently supports the following types of filtering:
List of PPA work ids (as a text file, id-per-line)
CSV file specifying work pages (by digital page number) (csv, page-per-line)
Filtering by key-value pair for either inclusion or exclusion
These filtering options can be combined, generally as a logical AND. Pages filtered by work ids or page numbers will be further filtered by the key-value logic. In cases where both work- and page-level filtering occurs, works not specified in the page filtering are included in full. Works that are specified in both will be limited to the pages specified in page-level filtering.
Filter methods can be run via command-line or python code. Filtering takes a jsonl file
(compressed or not) as input, and will produce a jsonl file (compressed or not) as output.
The input and output filenames can use any extension supported by orjsonl
, with or
without compression (e.g. .jsonl
, .jsonl.gz
, .jsonl.bz2
).
Example command line usages:
corppa-filter path/to/ppa_pages.jsonl output/ppa_subset_pages.jsonl --idfile my_ids.txt
corppa-filter path/to/ppa_pages.jsonl output/ppa_subset_pages.jsonl --pg-file pages.csv --include key=value
Path Utilities¶
General-purpose methods for working with paths, PPA identifiers, and directories
- corppa.utils.path_utils.decode_htid(encoded_htid: str) str ¶
Return original HathiTrust volume identifier from encoded version:
[library id].[encoded volume id]
Specifically, the volume-portion of the id undergoes the following character replacements:
"+" --> ":", "=" --> "/", "," --> "."
- corppa.utils.path_utils.encode_htid(htid: str) str ¶
Returns the “clean” version of a HathiTrust volume identifier with the form:
[library id].[volume id]
Specifically, the volume-portion of the id undergoes the following character replacements:
":" --> "+", "/" --> "=", "." --> ","
- corppa.utils.path_utils.find_relative_paths(base_dir: Path, exts: Iterable[str], follow_symlinks: bool = True, group_by_dir: bool = False) Iterator[Path] | Iterator[tuple[Path, list[Path]]] ¶
This method finds files anywhere under the specified base directory that match any of the specified file extensions (case insensitive), and returns a generator of path objects with a path relative to the base directory. File extensions should include the leading period, i.e.
[".jpg", ".tiff"]
rather than["jpg", "tiff"]
.For example, given a base directory
a/b/c/images
, an extension list of.jpg
, and files nested at different levels in the hierarchya/b/c/images/alpha.jpg
,a/b/c/images/d/beta.jpg
:a/b/c/images |-- alpha.jpg +-- d |-- beta.jpg
The result will include the two items:
alpha.jpg
andd/beta.jpg
When
group_by_dir
isTrue
, resulting files will be returned grouped by the parent directory. The return result is a tuple of a singlepathlib.Path
object for the directory and a list ofpathlib.Path
objects for the files in that directory that match the specified extensions. Given a hierarchy like this:images/vol-a/ |-- alpha.jpg |-- beta.jpg
the method would return
(vol-a, [alpha.jpg, beta.jpg])
.
- corppa.utils.path_utils.get_image_relpath(work_id: str, page_num: int) Path ¶
Get the (relative) image path for specified PPA work page
- corppa.utils.path_utils.get_page_number(pagefile: Path) str ¶
Extract and return the page number from the filename for page-level content (e.g., image or text). Returns the page number as a string with leading zeros. (Note: logic is currently specific to Gale/ECCO file naming conventions.)
- corppa.utils.path_utils.get_ppa_source(vol_id: str) str ¶
For a given volume id, return the corresponding source. Assume: * Gale volume ids begin with
"CW0"
or"CB0"
* EEBO-TCP volume ids begin withA
* Hathitrust volume ids contain a"."
- corppa.utils.path_utils.get_stub_dir(source: str, vol_id: str) Path ¶
Returns the stub directory path (
pathlib.Path
) for the specified volume (vol_id
)For Gale, the path is formed from every third number (excluding the leading 0) of the volume identifier.
Ex. CB0127060085 --> 100
For HathiTrust, we use the Stubbytree directory specification created by HTRC. The path is composed of two directories: (1) the library portion of the volume identifier and (2) every third character of the encoded volume identifier.
Ex. mdp.39015003633594 --> mdp/31039
- corppa.utils.path_utils.get_vol_dir(vol_id: str) Path ¶
Returns the volume directory (
pathlib.Path
) for the specified volume (vol_id
)
Generate PPA Page Set¶
Utility for generating a PPA page set.
This method takes three inputs: (1) an input csv, (2) an output csv, and (3) the size of the page set.
- The input CSV file must have the following fields:
work_id: PPA work id
page_start: Starting index for page range being considered for this work
page_end: Ending index for page range being considered for this work
poery_pages: Comma separated list of page numbers containing poetry
- The pages are selected as follows:
First, all pages with poetry are selected
Then, all remaining pages are chosen randomly (proportionately by work)
- The resulting output CSV file has the following fields:
work_id: PPA work id
page_num: Digital page number
Example usage:
python generate_page_set.py input.csv output.csv 300
Add Image (Relative) Paths¶
This script adds image relative paths to the PPA full-text corpus. Optionally,
a file extension (e.g., .jpg
) can be provided to be used for all relative
image paths instead of defaulting to their source-level defaults.
Example usage:
python add_image_relpaths.py ppa_corpus.jsonl ppa_with_images.jsonl
python add_image_relpaths.py ppa_corpus.jsonl ppa_with_images.jsonl --ext=.jpg
Build Text Corpus¶
Script for building a text corpus file (JSONL) from a directory of texts.
This script converts each text (.txt
) file within an input directory
(including nested files), and compiles them into a single output text corpus
where each record corresponds to a single text file with the following fields:
id
: The name of the file (without prefix)
text
: The text of the file (assumes UTF-8 formatting)
Note that the output file can also be written in any compressed form supported
by orjsonl
. If no suffix is provided, .jsonl
will be used.
Example usage:
python build_text_corpus.py input_dir out_corpus
python build_text_corpus.py input_dir out_corpus.jsonl
python build_text_corpus.py input_dir out_corpus.jsonl.gz
Annotation¶
Data Preparation¶
Preliminary Page Set Creation¶
This hard-coded script was used to create a PPA page set for our preliminary annotation efforts. Note: this script may not work.
This script takes three inputs:
A directory for a PPA text corpus
Our poetry testset file (CSV) that included manual identifications of pages containing poetry for several works in the PPA
An output JSONL
The script then returned a CSV containing page-level metadata for every page of the PPA works covered in our poetry testet. This page-level metadata included an image path, if the page is known to contain poetry, and some work-level information.
Example usage:
python create_pageset.py ppa_corpus_dir poetry_testset.csv out.jsonl
Add Metadata¶
This script is used to prepare the page-level PPA corpus data for use in
Prodigy annotation. It adds work-level metadata (title, author, year) in the
locations that Prodigy requires for display, and allows adjusting image paths
for display from the Prodigy interface. It assumes the input page corpus has
corpus has already been annotated with image paths with an image_path
attribute using corppa.utils.add_image_relpaths
.
Example usage:
python add_metadata.py ppa_with_images.jsonl ppa_metadata.csv out.jsonl
Annotation Recipes¶
This module provides custom recipes for Prodigy annotation. These were created with page-level annotation in mind and assume a page is associated with both text and an image. Each recipe displays a page’s image and text side-by-side.
- Recipes:
annotate_page_text
: Annotate a page’s text.annotate_text_and_image
: Annotate both a page’s text and image side-by-side.review_page_spans
: Review existing page-level text annotations to produce a final, adjudicated set of annotations.
Referenced images must be served out independently for display; the image url prefix for images should be specified when initializing the recipe.
Example use:
prodigy annotate_page_text poetry_spans poetry_pages.jsonl --label POETRY,PROSODY -F annotation_recipes.py --image-prefix http://localhost:8000/
prodigy annotate_text_and_image poetry_text_image poetry_pages.jsonl -l POETRY -F annotation_recipes.py --image-prefix ../ppa-web-images -FM
prodigy review_page_spans adjudicate poetry_spans -l POETRY -F annotation_recipes.py --image-prefix ../ppa-web-images -FM --sessions alice,bob
Command Recipes¶
This module contains custom command recipes for Prodigy.
Recipes:
ppa-task-progress
: Report the current progress for a PPA annotation task at the page and annotator level.
Example use:
prodigy ppa-stats task_id -F command_recipes.py
Process Adjudication Data¶
This script processes the adjudication data produced by Prodigy for our poetry detection task into two outputs:
A JSONL file that compiles the annotation data into page-level records. So, each record contains some page-level metdata and the compiled list of poetry excerpts (if any) determined in the adjudication process.
A CSV file containing excerpt-level data per line.
Note that the first file explicitly include information on the pages where no poetry was identified, while the second will only implicitly through absence and requires external knowledge of what pages were covered in the annotation rounds. So, the former is particularly useful for the evaluation process while the latter is better suited for building a final excerpt dataset.
Example command line usage:
python process_adjudication_data.py prodigy_data.jsonl adj_pages.jsonl adj_excerpts.csv
Poetry Detection¶
Scripts¶
refmatcha¶
🎶🍵 matcha matcha poem / This script is gon / na find your poems / matcha matcha poem 🎶🍵
refmatcha identifies poem excerpts by matching against a local collection of reference poems. It takes in a CSV of unidentified excerpts and outputs a CSV of labeled excerpts for those excerpts it is able to identify. By default, the output file is created in the same directory as the input with the same name plus _matches; i.e., given an input file round1_excerpts.csv, refmatcha will output identified excerpts to round1_excerpts_matches.csv. To override this, specify an output filename with –output or -o.
Setup:
Download and extract poetry-ref-data.tar.bz2 from /tigerdata/cdh/prosody/poetry-detection. You should extract it in the same directory where you plan to run this script. The script will compile reference content into full-text and metadata parquet files on the first run; to force recompilation, rename or remove the parquet files.
Example usage:
refmatcha round1_excerpts.csv
refmatcha round1_excerpts.csv --output round1_matches.csv
Merge excerpts¶
This script merges labeled and unlabeled poem excerpts, combining notes for any merged excerpts, and merging duplicate poem identifications in simple cases.
It takes two or more input files of excerpt data (labeled or unlabeled) in CSV format, merges any excerpts that can be combined, and outputs a CSV with the updated excerpt data. All excerpts in the input data files are preserved in the output, whether they were merged with any other records or not. This means that in most cases, the output will likely be a mix of labeled and unlabeled excerpts.
Merging logic is as follows:
Excerpts are grouped on the combination of page id and excerpt id, and then merged if all reference fields match exactly, or where reference fields are present in one excerpt and unset in the other.
If the same excerpt has different labels (different poem_id values), both labeled excerpts will be included in the output
If the same excerpt has duplicate labels (i.e., the same poem_id from two different identification methods), they will be merged into a single labeled excerpt; the identification_methods in the resulting labeled excerpt will be the union of methods in the merged excerpts
When merging excerpts where both records have notes, notes content will be combined.
Example usage:
./src/corppa/poetry_detection/merge_excerpts.py adjudication_excerpts.csv labeled_excerpts.csv -o merged_excerpts.csv
Limitations:
Labeled excerpts with the same poem_id but different reference data will not be merged; supporting multiple identification methods that output span information will likely require more sophisticated merge logic
CSV input and output only (JSONL may be added in future)
Notes are currently merged only with the first matching excerpt; if an unlabeled excerpt with notes has multiple labels, only the first match will have combined notes