Ends of Prosody: corppa utilities

Documentation of corppa functionality for working with PPA text and image data that may be useful to Ends of Prosody participants.

Filter Utility

The corppa-filter command is a utility for filtering a PPA page-level text corpus. The PPA page-level text corpus is shared as a json lines (.jsonl) file, which may or may not be compressed (e.g., .jsonl.gz). It’s often useful to filter the full corpus to a subset of pages for a specific task.

Currently, corppa-filter supports the following types of filtering:

  • A list of PPA work ids (as a text file, id-per-line)

  • A CSV file specifying work pages (by digital page number)(csv, page-per-line)

  • A key-value pair for either inclusion or exclusion

These filtering options can be combined, generally as a logical AND. Pages filtered by work ids or page numbers will be further filtered by the key-value logic. In cases where both work- and page-level filtering occurs, works not specified in teh page filtered are included in full. Works that are specified in both will be limited to the pages specified in page-level filtering.

NOTE: PPA work identifiers are based on source identifiers, i.e., the identifier from the original source (HathiTrust, Gale/ECCO, EEBO-TCP). In most cases the work identifier and the source identifier are the same, but if you are working with any excerpted content the work id is NOT the same as the source identifier. Excerpt ids are based on the combination of source identifier and the first original page included in the excerpt. In some cases PPA contains multiple excerpts from the same source, so this provides guaranteed unique work ids.

Examples

To create a subset corpus with all pages for a set of specific volumes, create a text file with a list of PPA work identifiers, one id per line, and then run corppa-filter with the input file, desired output file, and path to id file.

corppa-filter ppa_pages.jsonl my_subset.jsonl --idfile my_ids.txt

To create a subset of specific pages from specific volumes, create a CSV file that includes fields work_id and page_num, and pass that to the filter script with the –pg-file option:

corppa-filter ppa_pages.jsonl my_subset.jsonl --pg_file my_work_pages.csv

You can filter a page corpus to exclude or include pages based on exact-matches for attributes included in the jsonl data. For example, to get all pages with the original page number roman numeral ‘i’:

corppa-filter ppa_pages.jsonl i_pages.jsonl --include label=i

Filters can also be combined; for example, to get the original page 10 for every volume from a list, you could specify a list of ids and the –include filter:

corppa-filter ppa_pages.jsonl my_subset_page10.jsonl --idfile my_ids.txt --include label=10

PPA Identifier Utilities

Here are some utility functions that may be useful when working with PPA work and volume identifiers.

corppa.utils.path_utils.get_volume_id(work_id: str) str

Extract volume id from PPA work id

  • For full works, volume ids and work ids are the same.

  • For excerpts, the work id is composed of the prefix followed by “-p” and the starting page of the excerpt.

corppa.utils.path_utils.get_ppa_source(vol_id: str) str

For a given volume id, return the corresponding source. Assume: * Gale volume ids begin with "CW0" or "CB0" * EEBO-TCP volume ids begin with A * Hathitrust volume ids contain a "."

corppa.utils.path_utils.encode_htid(htid: str) str

Returns the “clean” version of a HathiTrust volume identifier with the form:

[library id].[volume id]

Specifically, the volume-portion of the id undergoes the following character replacements:

":" --> "+", "/" --> "=", "." --> ","
corppa.utils.path_utils.decode_htid(encoded_htid: str) str

Return original HathiTrust volume identifier from encoded version:

[library id].[encoded volume id]

Specifically, the volume-portion of the id undergoes the following character replacements:

"+" --> ":", "=" --> "/", "," --> "."

Image Path Utilities

Here are some utility functions that may be useful when working with the Gale/ECCO page images.

corppa.utils.path_utils.get_vol_dir(vol_id: str) Path

Returns the volume directory (pathlib.Path) for the specified volume (vol_id)

corppa.utils.path_utils.get_stub_dir(source: str, vol_id: str) Path

Returns the stub directory path (pathlib.Path) for the specified volume (vol_id)

For Gale, the path is formed from every third number (excluding the leading 0) of the volume identifier.

Ex. CB0127060085 --> 100

For HathiTrust, we use the Stubbytree directory specification created by HTRC. The path is composed of two directories: (1) the library portion of the volume identifier and (2) every third character of the encoded volume identifier.

Ex. mdp.39015003633594 --> mdp/31039
corppa.utils.path_utils.get_page_number(pagefile: Path) str

Extract and return the page number from the filename for page-level content (e.g., image or text). Returns the page number as a string with leading zeros. (Note: logic is currently specific to Gale/ECCO file naming conventions.)