Code Documentation

Archive

Django app for archival materials included in PPA.

ppa.archive.NO_COLLECTION_LABEL = 'Uncategorized'

label to use for items that are not in a collection

Admin

class ppa.archive.admin.ClusterAdmin(model, admin_site)[source]
get_queryset(request)[source]

Return a QuerySet of all model instances that can be edited by the admin site. This is used by changelist_view.

works(obj)[source]

Custom property to display number of works in a cluster and link to a filtered view of the digitized works list.

class ppa.archive.admin.CollectionAdmin(model, admin_site)[source]
class ppa.archive.admin.DigitizedWorkAdmin(*args, **kwargs)[source]
add_works_to_collection(request, queryset)[source]

Bulk add a queryset of ppa.archive.DigitizedWork to a ppa.archive.Collection.

get_readonly_fields(request, obj=None)[source]

Determine read only fields based on item source, to prevent editing of HathiTrust fields that should not be changed.

get_urls()[source]

Add url for import admin form

list_collections(obj)[source]

Return a list of :class:ppa.archive.models.Collection object names as a comma separated list to populate a change_list column.

resource_class

alias of DigitizedWorkResource

save_model(request, obj, form, change)[source]

Note any fields in the protected list that have been changed in the admin and preserve in database.

Ensure reindex is called when admin form is saved

source id as an html link to source record, when source url is available

suppress_works(request, queryset)[source]

Set status to suppressed for every item in the queryset that is not already suppressed.

class ppa.archive.admin.DigitizedWorkInline(parent_model, admin_site)[source]
model

alias of DigitizedWork

class ppa.archive.admin.DigitizedWorkResource(**kwargs)[source]
get_queryset()[source]

Returns a queryset of all objects for this model. Override this if you want to limit the returned queryset.

Models

class ppa.archive.models.Cluster(*args, **kwargs)[source]

A model to collect groups of works such as reprints or editions that should be collapsed in the main archive search and accessible together.

exception DoesNotExist
exception MultipleObjectsReturned
class ppa.archive.models.Collection(*args, **kwargs)[source]

A collection of ppa.archive.models.DigitizedWork instances.

exception DoesNotExist
exception MultipleObjectsReturned
description

a RichText description of the collection

exclude

flag to indicate collections to be excluded by default in public search

name

the name of the collection

property name_changed

check if name has been changed (only works on current instance)

static stats()[source]

Collection counts and date ranges, based on what is in Solr. Returns a dictionary where they keys are collection names and values are a dictionary with count and dates.

class ppa.archive.models.DigitizedWork(*args, **kwargs)[source]

Record to manage digitized works included in PPA and store their basic metadata.

exception DoesNotExist
exception MultipleObjectsReturned
static add_from_hathi(htid, bib_api=None, update=False, log_msg_src=None, user=None)[source]

Add or update a HathiTrust work in the database. Retrieves bibliographic data from Hathi api, retrieves or creates a DigitizedWork record, and populates the metadata if this is a new record, if the Hathi metadata has changed, or if update is requested. Creates admin log entry to document record creation or update.

Raises ppa.archive.hathi.HathiItemNotFound for invalid id.

Returns the new or updated DigitizedWork.

Parameters:
  • htid – HathiTrust record identifier

  • bib_api – optional HathiBibliographicAPI instance, to allow for shared sessions in scripts

  • update – update bibliographic metadata even if the hathitrust record is not newer than the local database record (default: False)

  • log_msg_src – source of the change to be used included in log entry messages (optional). Will be used as “Created/updated [log_msg_src]”.

  • user – optional user responsible for the change, to be associated with LogEntry record

added

date added to the archive

book_journal

book or journal title for excerpt or article

clean()[source]

Add custom validation to trigger a save error in the admin if someone tries to unsuppress a record that has been suppressed (not yet supported).

clean_fields(exclude=None)[source]

Clean all fields and raise a ValidationError containing a dict of all validation errors if any occur.

cluster

optional cluster for aggregating works

collections

collections that this work is part of

compare_protected_fields(db_obj)[source]

Compare protected fields in a ppa.archive.models.DigitizedWork instance and return those that are changed.

Parameters:

db_obj (object) – Database instance of a DigitizedWork.

count_pages(ptree_client=None)[source]

Count the number of pages for a digitized work. If a pages are specified for an excerpt or article, page count is determined based on the number of pages in the combined ranges. Otherwise, page count is based on the number of files in the zipfile within the pairtree content (Hathi-specific). Raises pairtree.storage_exceptions.ObjectNotFoundException if the data is not found in the pairtree storage. Returns page count found; updates the page_count attribute on the current instance, but does NOT save the object.

display_title()[source]

admin display title to allow displaying title but sorting on sort_title

enumcron

enumeration/chronology (hathi-specific; contains volume or version)

first_page()[source]

Number of the first page in range, if this is an excerpt (first of original page range, not digital)

first_page_digital()[source]

Number of the first page in range (digital pages / page index), if this is an excerpt.

Returns:

first page number for digital page range; None if no page range

Return type:

int, None

first_page_original()[source]

Number of the first page in range (original page numbering) if this is an excerpt

Returns:

first page number for original page range; None if no page range

Return type:

str, None

get_absolute_url()[source]

Return object’s url for ppa.archive.views.DigitizedWorkDetailView

get_metadata(metadata_format)[source]

Get metadata for this item in the specified format. Currently only supports marc.

Source-specific label for link on public item detail view.

property has_fulltext

Checks if an item has full text (i.e., items from HathiTrust, Gale, or EEBO-TCP).

hathi

ppa.archive.hathi.HathiObject for HathiTrust records, for working with data in HathiTrust pairtree data structure.

index_chunk_size = 2000

number of items to index at once when indexing a large number of items

property index_cluster_id

Convenience function to get a string representation of the cluster (or self if no cluster). Reduces redunadancy elsewhere.

index_data()[source]

data for indexing in Solr

index_id()[source]

use source id + first page in range (if any) as solr identifier

classmethod index_item_type()[source]

override index item type label to just work

is_public()[source]

admin display field indicating if record is public or suppressed

property is_suppressed

Item has been suppressed (based on status).

item_type

type of record, whether excerpt, article, or full; defaults to full

classmethod items_to_index()[source]

Queryset of works for indexing everything; excludes suppressed works.

metadata_from_marc(marc_record, populate=True)[source]

Get metadata from MARC record and return a dictionary of the data. When populate is True, calls populate_fields to set values.

notes

internal team notes, not displayed on the public facing site

page_count

number of pages in the work (or page range, for an excerpt)

populate_fields(field_data)[source]

Conditionally update fields as protected by flags using Hathi bibdata information.

Parameters:

field_data (dict) – A dictionary of fields updated from a ppa.archive.hathi.HathiBibliographicRecord instance.

populate_from_bibdata(bibdata)[source]

Update record fields based on Hathi bibdata information. Full record is required in order to set all fields

Parameters:

bibdata – bibliographic data returned from HathiTrust as instance of ppa.archive.hathi.HathiBibliographicRecord

classmethod prep_index_chunk(chunk)[source]

Optional method for any additional processing on chunks of items being indexed. Intended to allow adding prefetching on a chunk when iterating on Django QuerySets; since indexing uses Iterator, prefetching configured in items_to_index is ignored.

printed_by_re = '^(Printed)?( and )?(Pub(.|lished|lisht)?)?( and sold)? (by|for|at)( the)? ?'

regular expresion for cleaning preliminary text from publisher names

protected_fields

ProtectedWorkField instance to indicate metadata fields that should be preserved from bulk updates because they have been modified in Django admin.

pub_place

place of publication

public_notes

public notes field for this work

publisher

publisher

record_id

record id; for Hathi materials, used for different copies of the same work or for different editions/volumes of a work

remove_from_index()[source]

Remove the current work and associated pages from Solr index

save(*args, **kwargs)[source]

Save the current instance. Override this in a subclass if you want to control the saving process.

The ‘force_insert’ and ‘force_update’ parameters can be used to insist that the “save” must be an SQL insert or update (or equivalent for non-SQL backends), respectively. Normally, they should not be set.

sort_title

sort title: title without leading non-sort characters, from marc

source

source of the record, HathiTrust or elsewhere

source_id

source identifier; hathi id for HathiTrust materials

source_url

source url where the original can be accessed

status

status of record; currently choices are public or suppressed

subtitle

subtitle of the work; using TextField to allow for long titles

title

title of the work; using TextField to allow for long titles

updated

date of last modification of the local record

class ppa.archive.models.DigitizedWorkQuerySet(model=None, query=None, using=None, hints=None)[source]
by_first_page_orig(start_page)[source]

find records based on first page in original page range

ppa.archive.models.NO_COLLECTION_LABEL = 'Uncategorized'

label to use for items that are not in a collection

class ppa.archive.models.Page[source]

Indexable for pages to make page data available for indexing with parasolr index manage command.

index_chunk_size = 2000

number of items to index at once when indexing a large number of items

classmethod index_item_type()[source]

index item type for parasolr indexing script

classmethod items_to_index()[source]

Return a generator of page data to be indexed, with data for pages for each work returned by Page.page_index_data()

classmethod page_index_data(digwork, gale_record=None)[source]

Get page content for the specified digitized work from Hathi pairtree and return data to be indexed in solr.

Takes an optional Gale item record as returned by the Gale API, to avoid loading API content twice. (Used on import)

classmethod total_to_index(source=None)[source]

Calculate the total number of pages to be indexed by aggregating page count of items to index in the database.

class ppa.archive.models.ProtectedWorkField(verbose_name=None, name=None, **kwargs)[source]

PositiveSmallIntegerField subclass that returns a ProtectedWorkFieldFlags object and stores as integer.

from_db_value(value, expression, connection)[source]

Always return an instance of ProtectedWorkFieldFlags

get_internal_type()[source]

Preserve type as PositiveSmallIntegerField

get_prep_value(value)[source]

Perform preliminary non-db specific value checks and conversions.

to_python(value)[source]

Always return an instance of ProtectedWorkFieldFlags

class ppa.archive.models.ProtectedWorkFieldFlags(*args, **kwargs)[source]

flags.Flags instance to indicate which DigitizedWork fields should be protected if edited in the admin.

author = <ProtectedWorkFieldFlags.author bits=0x0010 data=UNDEFINED>

author

classmethod deconstruct()[source]

Give Django information needed to make ProtectedWorkFieldFlags.no_flags default in migration.

enumcron = <ProtectedWorkFieldFlags.enumcron bits=0x0008 data=UNDEFINED>

enumcron

pub_date = <ProtectedWorkFieldFlags.pub_date bits=0x0080 data=UNDEFINED>

publication date

pub_place = <ProtectedWorkFieldFlags.pub_place bits=0x0020 data=UNDEFINED>

place of publication

publisher = <ProtectedWorkFieldFlags.publisher bits=0x0040 data=UNDEFINED>

publisher

sort_title = <ProtectedWorkFieldFlags.sort_title bits=0x0004 data=UNDEFINED>

sort title

subtitle = <ProtectedWorkFieldFlags.subtitle bits=0x0002 data=UNDEFINED>

subtitle

title = <ProtectedWorkFieldFlags.title bits=0x0001 data=UNDEFINED>

title

class ppa.archive.models.SignalHandlers[source]

Signal handlers for indexing DigitizedWork records when Collection or Cluster records are saved or deleted.

static cluster_delete(sender, instance, **kwargs)[source]

signal handler for cluster delete; clear associated digitized works and reindex

static cluster_save(sender, instance, **kwargs)[source]

signal handler for cluster save; reindex pages for associated digitized works

static collection_delete(sender, instance, **kwargs)[source]

signal handler for collection delete; clear associated digitized works and reindex

static collection_save(sender, instance, **kwargs)[source]

signal handler for collection save; reindex associated digitized works

static handle_digwork_cluster_change(sender, instance, **kwargs)[source]

when a DigitizedWork is saved, reindex pages if cluster id has changed

class ppa.archive.models.TrackChangesModel(*args, **kwargs)[source]

Model mixin that keeps a copy of initial data in order to check if fields have been changed. Change detection only works on the current instance of an object.

has_changed(field)[source]

check if a field has been changed

initial_value(field)[source]

return the initial value for a field

save(*args, **kwargs)[source]

Saves data and reset copy of initial data.

ppa.archive.models.validate_page_range(value)[source]

Ensure page range can be parsed as an integer span

Views

class ppa.archive.views.AddToCollection(**kwargs)[source]

View to bulk add a queryset of ppa.archive.models.DigitizedWork to a set of ppa.archive.models.Collection instances.

form_class

alias of AddToCollectionForm

get_context_data(**kwargs)[source]

Get the context for this view.

get_queryset(*args, **kwargs)[source]

Return a queryset filtered by id, or empty list if no ids

get_success_url()[source]

Redirect to the ppa.archive.models.DigitizedWork change_list in the Django admin with pagination and filters preserved. Expects ppa.archive.admin.add_works_to_collection() to have set ‘collection-add-filters’ as a dict in the request’s session.

model

alias of DigitizedWork

post(request, *args, **kwargs)[source]

Add ppa.archive.models.DigitizedWork instances passed in form data to selected instances of ppa.archive.models.Collection, then return to change_list view.

Expects a list of DigitizedWork ids to be set in the request session.

class ppa.archive.views.DigitizedWorkByRecordId(**kwargs)[source]

Redirect from DigitizedWork record id to detail view when possible. If there is only one record found, redirect. If multiple are found, 404.

get_redirect_url(*args, **kwargs)[source]

Return the URL redirect to. Keyword arguments from the URL pattern match generating the redirect request are provided as kwargs to this method.

class ppa.archive.views.DigitizedWorkDetailView(**kwargs)[source]

Display details for a single digitized work. If a work has been surpressed, returns a 410 Gone response.

ajax_template_name = 'archive/snippets/results_within_list.html'

name of the template to use for ajax request

form_class

alias of SearchWithinWorkForm

get(*args, **kwargs)[source]

Handle get request, with redirect logic if redirect url is set for a digitized work id converted to a single excerpt.

get_context_data(**kwargs)[source]

Insert the single object into the context dict.

get_queryset()[source]

Return the QuerySet that will be used to look up the object.

This method is called by the default implementation of get_object() and may not be called if get_object() is overridden.

get_solr_lastmodified_filters()[source]

Get filters for last modified Solr query. By default returns solr_lastmodified_filters.

get_template_names()[source]

Return ajax_template_name if this is an ajax request; otherwise return default template name.

model

alias of DigitizedWork

class ppa.archive.views.DigitizedWorkListView(**kwargs)[source]

Search and browse digitized works. Based on Solr index of works and pages.

ajax_template_name = 'archive/snippets/results_list.html'

name of the template to use for ajax request

form_class

alias of SearchForm

get_context_data(**kwargs)[source]

Get the context for this view.

get_pages(solrq)[source]

If there is a keyword search, query Solr for matching pages with text highlighting. NOTE: This has to be done as a separate query because Solr doesn’t support highlighting on collapsed items.

get_queryset(**kwargs)[source]

Return the list of items for this view.

The return value must be an iterable and may be an instance of QuerySet in which case QuerySet specific behavior will be enabled.

meta_description = 'The Princeton Prosody Archive is a full-text\n    searchable database of thousands of historical documents about the\n    study of language and the study of poetry.'

page description for metadata/preview

meta_title = 'Princeton Prosody Archive'

title for metadata / preview

model

alias of DigitizedWork

class ppa.archive.views.ImportView(**kwargs)[source]

Admin view to import new records from sources that support import (HathiTrust, Gale) by providing a list of ids.

form_class

alias of ImportForm

form_valid(form)[source]

If the form is valid, redirect to the supplied URL.

get_context_data(*args, **kwargs)[source]

Insert the form into the context dict.

import_records(importer)[source]

Import records based on values submitted in the form

class ppa.archive.views.OpenSearchDescriptionView(**kwargs)[source]

Basic open search description for searching the archive via browser or other tools.

Forms

class ppa.archive.forms.AddToCollectionForm(data=None, files=None, auto_id='id_%s', prefix=None, initial=None, error_class=<class 'django.forms.utils.ErrorList'>, label_suffix=None, empty_permitted=False, field_order=None, use_required_attribute=None, renderer=None)[source]

Form to select a set of :class:ppa.archive.models.Collection to which to bulk add a queryset of :class:ppa.archive.models.DigitizedWork

property media

Return all media required to render the widgets on this form.

class ppa.archive.forms.CheckboxSelectMultipleWithDisabled(attrs=None, choices=())[source]

Subclass of django.forms.CheckboxSelectMultiple with option to mark a choice as disabled.

class ppa.archive.forms.ChoiceLabel(label, disabled=False)[source]

Custom choice label that can be used to set an option as disabled without resulting in extra choices when normalized.

class ppa.archive.forms.FacetChoiceField(*args, **kwargs)[source]

Add CheckboxSelectMultiple field with facets taken from solr query

valid_value(value)[source]

Check to see if the provided value is a valid choice.

widget

alias of CheckboxSelectMultiple

class ppa.archive.forms.ImportForm(data=None, files=None, auto_id='id_%s', prefix=None, initial=None, error_class=<class 'django.forms.utils.ErrorList'>, label_suffix=None, empty_permitted=False, field_order=None, use_required_attribute=None, renderer=None)[source]

Form to import records from sources that support import.

get_source_ids()[source]

Get list of ids from valid form input. Splits on newlines, strips whitespace, and ignores empty lines.

property media

Return all media required to render the widgets on this form.

class ppa.archive.forms.ModelMultipleChoiceFieldWithEmpty(queryset, **kwargs)[source]

Extend django.forms.ModelMultipleChoiceField to add an option for an unset or empty choice (i.e. no relationship in a many-to-many relationship such as collection membership).

clean(value)[source]

Extend clean to use default validation on all values but the empty id.

class ppa.archive.forms.RadioSelectWithDisabled(attrs=None, choices=())[source]

Subclass of django.forms.RadioSelect with option to mark a choice as disabled.

class ppa.archive.forms.RangeField(*args, **kwargs)[source]
compress(data_list)[source]

Return a single value for the given list of values. The values can be assumed to be valid.

For example, if this MultiValueField was instantiated with fields=(DateField(), TimeField()), this might return a datetime object created by combining the date and time in data_list.

widget

alias of RangeWidget

class ppa.archive.forms.RangeWidget(*args, **kwargs)[source]

date range widget, for two numeric inputs

decompress(value)[source]

Return a list of decompressed values for the given compressed value. The given value can be assumed to be valid, but not necessarily non-empty.

property media

Media for a multiwidget is the combination of all media of the subwidgets.

sep = '-'

separator string when splitting out values in decompress

class ppa.archive.forms.SearchForm(data=None, *args, **kwargs)[source]

Simple search form for digitized works.

QUESTION_POPUP_TEXT = '\n    Boolean search within a field is supported. Operators must be capitalized (AND, OR).\n    Use quotes for exact phrase.\n    '

help text to be shown with the form (appears when you hover over the question mark icon)

clean_author()[source]

Clean keyword search query term; converts any typographic quotes to straight quotes

clean_query()[source]

Clean keyword search query term; converts any typographic quotes to straight quotes

clean_title()[source]

Clean keyword search query term; converts any typographic quotes to straight quotes

cluster

hidden input to track cluster id, for searching within reprint/editions

static defaults()[source]

Default values when initializing the form. Sort by title, pre-select collections based exclude property.

get_solr_sort_field(sort=None)[source]

Set solr sort fields for the query based on sort and query strings. If sort field is not specified, will use sort in the the cleaned data in the current form. If sort is not specified and valid form data is not available, will raise an AttributeError.

Returns:

solr sort field

has_keyword_query(data)[source]

check if any of the keyword search fields have search terms

property media

Return all media required to render the widgets on this form.

pub_date_minmax()[source]

Get minimum and maximum values for DigitizedWork publication dates in the database. Used to set placeholder values for the form input and to generate the Solr facet range query. Value is cached to avoid repeatedly calculating it.

Returns:

tuple of min, max

set_choices_from_facets(facets)[source]

Set choices on field from a dictionary of facets

class ppa.archive.forms.SearchWithinWorkForm(data=None, files=None, auto_id='id_%s', prefix=None, initial=None, error_class=<class 'django.forms.utils.ErrorList'>, label_suffix=None, empty_permitted=False, field_order=None, use_required_attribute=None, renderer=None)[source]

Form to search for occurrences of a string within a particular instance of a digitized work.

QUESTION_POPUP_TEXT = '\n    Boolean search is supported. Operators must be capitalized (AND, OR).\n    '

help text to be shown with the form (appears when you hover over the question mark icon)

property media

Return all media required to render the widgets on this form.

class ppa.archive.forms.SelectDisabledMixin[source]

Mixin for django.forms.RadioSelect or django.forms.CheckboxSelect classes to set an option as disabled. To disable, the widget’s choice label option should be passed in as a dictionary with disabled set to True:

{'label': 'option', 'disabled': True}.
create_option(name, value, label, selected, index, subindex=None, attrs=None)[source]

Overide option creation to optionally disable specified values

class ppa.archive.forms.SelectWithDisabled(attrs=None, choices=())[source]

Subclass of django.forms.Select with option to mark a choice as disabled.

Solr

class ppa.archive.solr.ArchiveSearchQuerySet(solr=None)[source]

add keyword search

query_opts()[source]

Extend default query options method to combine work and keyword search options based on what filters are present.

within_cluster(cluster_id)[source]

Search within a group of reprints/editions

work_filter(*args, **kwargs)[source]

Add filters to the work query

search works by title

class ppa.archive.solr.PageSearchQuerySet(solr: SolrClient | None = None)[source]
field_aliases = {'cluster_id': 'cluster_id_s', 'group_id': 'group_id_s', 'id': 'id', 'image_url': 'image_url_s', 'label': 'label', 'order': 'order', 'score': 'score', 'source_id': 'source_id', 'title': 'title'}

map of application-specific, readable field names to actual solr fields (i.e. if using dynamic field types)

Hathi

Utilities for working with HathiTrust materials and APIs.

class ppa.archive.hathi.HathiBaseAPI[source]

Base client class for HathiTrust APIs

api_root = ''

base api URL for all requests

class ppa.archive.hathi.HathiBibliographicAPI[source]

Wrapper for HathiTrust Bibliographic API.

https://www.hathitrust.org/bib_api

api_root = 'http://catalog.hathitrust.org/api'

base api URL for all requests

brief_record(id_type, id_value)[source]

Get brief record by id type and value.

Returns:

HathiBibliographicRecord

Raises:

HathiItemNotFound

record(id_type, id_value)[source]

Get full record by id type and value.

Returns:

HathiBibliographicRecord

Raises:

HathiItemNotFound

class ppa.archive.hathi.HathiBibliographicRecord(data)[source]

Representation of a HathiTrust bibliographic record.

copy_details(htid)[source]

Details for a specific copy identified by hathi id

copy_last_updated(htid)[source]

Return last update date for a specificy copy identified by hathi id. Returns as datetime.date

marcxml

Record marcxml if included (full records only), as an instance of pymarc.Record

property pub_dates

list of available publication dates

property title

First title (standard title)

exception ppa.archive.hathi.HathiItemForbidden[source]

Permission denied to access item in data API

exception ppa.archive.hathi.HathiItemNotFound[source]

Item not found in bibliographic or data API

class ppa.archive.hathi.HathiObject(hathi_id)[source]

An object for working with a HathiTrust item with data in a locally configured pairtree datastore.

delete_pairtree_data()[source]

Delete pairtree object from the pairtree datastore.

mets_xml() MinimalMETS[source]

load METS xml file from pairtree and initialize as an instance of MinimalMETS

Return type:

MinimalMETS

Raises:

storage_exceptions.ObjectNotFoundException if the object is not found in pairtree storage

Raises:

storage_exceptions.PartNotFoundException if the mets.xml flie is not found in pairtree storage for this object

metsfile_path(ptree_client=None)[source]

path to mets xml file within the hathi contents for this work

page_data()[source]

Return a generator of page content for this HathiTrust work based on pairtree and METS data, for indexing pages in Solr.

pairtree_client()[source]

Initialize a pairtree client for the pairtree datastore this object belongs to, based on its HathiTrust record id.

pairtree_object(ptree_client=None, create=False)[source]

get a pairtree object for this record

Parameters:

ptree_client – optional pairtree_client.PairtreeStorageClient if one has already been initialized, to avoid repeated initialization (currently used in hathi_import manage command)

zipfile_path(ptree_client=None)[source]

path to zipfile within the hathi contents for this work

class ppa.archive.hathi.METSFile(node=None, context=None, **kwargs)[source]

File location information within a METS document.

id = <eulxml.xmlmap.fields.StringField>

xml identifier

location = <eulxml.xmlmap.fields.StringField>

<METS:file SIZE=”1” ID=”TXT00000001” MIMETYPE=”text/plain” CREATED=”2016-06-24T09:04:15Z” CHECKSUM=”68b329da9893e34099c7d8ad5cb9c940” SEQ=”00000001” CHECKSUMTYPE=”MD5”>

sequence = <eulxml.xmlmap.fields.StringField>

sequence attribute

class ppa.archive.hathi.MinimalMETS(node=None, context=None, **kwargs)[source]

Minimal XmlObject for METS that maps only what is needed to support page indexing for ppa.

structmap_pages = <eulxml.xmlmap.fields.NodeListField>

list of struct map pages as StructMapPage

class ppa.archive.hathi.StructMapPage(node=None, context=None, **kwargs)[source]

Single logical page within a METS StructMap

display_label

page display labeel; use order label if present; otherwise use order

label = <eulxml.xmlmap.fields.StringField>

page label

order = <eulxml.xmlmap.fields.IntegerField>

page order

orderlabel = <eulxml.xmlmap.fields.StringField>

order label

text_file

METSFiile corresponding to the text file pointer for this page

text_file_id = <eulxml.xmlmap.fields.StringField>
<METS:div ORDER=”1” LABEL=”FRONT_COVER, IMAGE_ON_PAGE, IMPLICIT_PAGE_NUMBER” TYPE=”page”>

<METS:fptr FILEID=”HTML00000001”/> <METS:fptr FILEID=”TXT00000001”/> <METS:fptr FILEID=”IMG00000001”/>

<METS:file SIZE=”1003” ID=”HTML00000496” MIMETYPE=”text/html” CREATED=”2017-03-20T10:40:21Z”

CHECKSUM=”f0a326c10b2a6dc9ae5e3ede261c9897” SEQ=”00000496” CHECKSUMTYPE=”MD5”>

text_file_location

location for the text file

Gale

class ppa.archive.gale.GaleAPI[source]

Minimal Gale API client with functionality need for PPA import.

Requires GALE_API_USERNAME configured in Django settings. Automatically uses the configured username to retrieve an API key when needed, and has logic to refresh the API key when it expires (30 minutes).

If TECHNICAL_CONTACT is configured in Django settings, it will be included in request headers when making API calls.

Implemented as a singleton; instanciating the class will return the same shared instance every time.

property api_key

Property for current api key. Uses get_api_key() to request a new one when needed.

api_root = 'https://api.gale.com/api'

base URL for all API requests

get_api_key()[source]

Get a new API key to use for requests in the next 30 minutes.

get_item(item_id)[source]

Get the full record for a single item

get_item_pages(item_id, gale_record=None)[source]

Return a generator of page content for the specified digitized work from the Gale API. Takes an optional gale_record parameter (item record as returned by Gale API), to avoid making an extra API call if data is already available.

instance = None

shared singleton instance; populated on first instantiation

refresh_api_key()[source]

clear cached api key and request a new one

exception ppa.archive.gale.GaleAPIError[source]

Base exception class for Gale API errors

exception ppa.archive.gale.GaleItemForbidden[source]

Permission denied to access item in Gale API

exception ppa.archive.gale.GaleItemNotFound[source]

Item not found in Gale API

exception ppa.archive.gale.GaleUnauthorized[source]

Permission not authorized for Gale API access

exception ppa.archive.gale.MARCRecordNotFound[source]

record not found in local MARC record storage

ppa.archive.gale.get_local_ocr(item_id)[source]

Load local OCR page text for the specified Gale volume, if available. This requires a base directory (specified by GALE_LOCAL_OCR) to be configured and assumes the following organization:

  • Volume-level directories are organized in stub directories that correspond to every third number (e.g., CW0128905397 –> 193). So, a Gale volume’s OCR data is located in the following directory: GALE_LOCAL_OCR / stub_dir / item_id.json

  • Page text is stored as a JSON dictionary with keys based on Gale page numbers, which is a 4-digit string (e.g., “0004”).

Raises a FileNotFoundError if the local OCR page text does not exist.

ppa.archive.gale.get_marc_record(marc_id)[source]

get a marc record from the pairtree storage by Gale ESTC id

ppa.archive.gale.get_marc_storage()[source]

return pairtree storage for marc records

Util

Manage Commands

Hathi Import

hathi_import is a custom manage command for bulk import of HathiTrust materials into the local database for management. It does not index into Solr for search and browse; use the index script for that after import.

This script expects a local copy of dataset files in pairtree format retrieved by rsync. (Note that pairtree data must include pairtree version file to be valid.)

Contents are inspected from the configured HATHI_DATA path; DigitizedWork records are created or updated based on identifiers found and metadata retrieved from the HathiTrust Bibliographic API. Page content is only reflected in the database via a total page count per work (but page contents will be indexed in Solr via index script).

By default, existing records are updated only when the Hathi record has changed or if requested via –update` script option.

Supports importing specific items by hathi id, but the pairtree content for the items still must exist at the configured path.

Example usage:

# import everything with defaults
python manage.py hathi_import
# import specific items
python manage.py hathi_import htid1 htid2 htid3
# re-import and update records
python manage.py hathi_import --update
# display progressbar to show status and ETA
python manage.py hathi_import -v 0 --progress
class ppa.archive.management.commands.hathi_import.Command(stdout=None, stderr=None, no_color=False, force_color=False)[source]

Import HathiTrust digitized items into PPA to be managed and searched

add_arguments(parser)[source]

Entry point for subclassed commands to add custom arguments.

count_hathi_ids()[source]

count items in the pairtree structure without loading all into memory at once.

get_hathi_ids()[source]

Generator of hathi ids from previously rsynced hathitrust data, based on the configured HATHI_DATA path in settings.

handle(*args, **kwargs)[source]

The actual logic of the command. Subclasses must implement this method.

import_digitizedwork(htid)[source]

Import a single work into the database. Retrieves bibliographic data from Hathi api. If the record already exists in the database, it is only updated if the hathi record has changed or if an update is requested by the user. Creates admin log entry for record creation or record update. Returns None if there is an error retrieving bibliographic data or no update is needed; otherwise, returns the DigitizedWork.

initialize_pairtrees()[source]

Initialize pairtree storage clients for each subdirectory in the configured HATHI_DATA path.

v_normal = 1

normal verbosity level

Hathi Excerpt

hathi_excerpt is a custom manage command to convert existing HathiTrust items into excerpts or articles. It takes a CSV file with information about the items to excerpt. It does handle multiple excerpts for the same source id, as long as that source id is present in the database and data is available in the HathiTrust pairtree data.

The CSV must include the following fields:
  • Item Type

  • Volume ID

  • Title

  • Sort Title

  • Book/Journal Title

  • Digital Page Range

  • Collection

  • Record ID

If the CSV includes these optional fields, they will be used:
  • Author

  • Publication Date

  • Publication Place

  • Publisher

  • Enumcron

  • Original Page Range

  • Notes

  • Public Notes

Updated and added records are automatically indexed in Solr.

class ppa.archive.management.commands.hathi_excerpt.Command(stdout=None, stderr=None, no_color=False, force_color=False)[source]

Convert existing HathiTrust full works into excerpts

add_arguments(parser)[source]

Entry point for subclassed commands to add custom arguments.

excerpt(row)[source]

Process a row of the spreadsheet, and either convert an existing full work to an excerpt or create a new excerpt.

handle(*args, **kwargs)[source]

The actual logic of the command. Subclasses must implement this method.

load_collections()[source]

load collections from the database and create a lookup based on collection names

load_csv(path)[source]

Load a CSV file with digworks to be excerpted.

log_action(digwork, created=True)[source]

Create a log entry to document excerpting or creating the record. Message and action flag are determined by created boolean.

setup()[source]

Run common setup steps for running the script or testing

v_normal = 1

normal verbosity level

Hathi rsync

class ppa.archive.management.commands.hathi_rsync.Command(stdout=None, stderr=None, no_color=False, force_color=False)[source]

Update HathiTrust pairtree data via rsync

add_arguments(parser)[source]

Entry point for subclassed commands to add custom arguments.

handle(*args, **kwargs)[source]

The actual logic of the command. Subclasses must implement this method.

v_normal = 1

normal verbosity level

Gale Import

gale_import is a custom manage command for bulk import of Gale materials into the local database for management. It takes either a list of Gale item ids or a path to a CSV file.

Items are imported into the database for management and also indexed into Solr as part of this import script (both works and pages).

Example usage:

# import from a csv file
python manage.py gale_import -c path/to/import.csv
# import specific items
python manage.py hathi_import galeid1 galeid2 galeid3

When using a CSV file for import, it must include an ID field; it may also include NOTES (any contents will be imported into private notes), and fields to indicate collection membership to be set on import. These are the supported collection abbreviations:

  • OB: Original Bibliography

  • LIT: Literary

  • MUS: Music

  • TYP: Typographically Unique

  • LING: Linguistic

  • DIC: Dictionaries

  • WL: Word Lists

class ppa.archive.management.commands.gale_import.Command(stdout=None, stderr=None, no_color=False, force_color=False)[source]

Import Gale content into PPA for management and search

add_arguments(parser)[source]

Entry point for subclassed commands to add custom arguments.

handle(*args, **kwargs)[source]

The actual logic of the command. Subclasses must implement this method.

import_record(gale_id, **kwargs)[source]

Import a single work into the database. Retrieves record data from Gale API.

load_collections()[source]

Load Collection records from the database and create a lookup based on the codes used in the spreadsheet.

load_csv(path)[source]

Load a CSV file with items to be imported.

v_normal = 1

normal verbosity level

Generate Corpus

generate_textcorpus is a custom manage command to generate a plain text corpus from Solr. It should be run after content has been indexed into Solr via the index manage command.

The full text corpus is generated from Solr; it does not include content for suppressed works or their pages (note that this depends on Solr content being current).

Examples:

  • Expected use:

    python manage.py generate_textcorpus

  • Specify a path:

    python manage.py generate_textcorpus –path ~/ppa_solr_corpus

  • Dry run (do not create any files or folders):

    python manage.py generate_textcorpus –dry-run

  • Partial run (save only N rows, for testing):

    python manage.py generate_textcorpus –doc-limit 100

  • Cron-style run (no progress bar, but logs)

    python manage.py generate_textcorpus –no-progress –verbosity 2

Notes:

  • Default path is ppa_corpus_{timestamp} in the current working directory

  • Default batch size is 10,000, meaning 10,000 records are pulled from solr at a time. Usage testing revealed that this default iterates over the collection the quickest.

class ppa.archive.management.commands.generate_textcorpus.Command(stdout=None, stderr=None, no_color=False, force_color=False)[source]

Custom manage command to generate a text corpus from text indexed in Solr.

add_arguments(parser)[source]

Configure additional CLI arguments

handle(*args, **options)[source]

The actual logic of the command. Subclasses must implement this method.

iter_pages()[source]

Yield results from iter_solr() with item_type=page

iter_solr(item_type='page')[source]

Returns a generator Solr documents for the requested item_type (page or work).

iter_works()[source]

Yield results from iter_solr() with item_type=work

save_metadata()[source]

Save the work-level metadata as a json file

save_pages()[source]

Save the page-level data as a jsonl file

set_params(*args, **options)[source]

Run the command, generating metadata.jsonl and pages.jsonl

v_normal = 1

normal verbosity level

ppa.archive.management.commands.generate_textcorpus.nowstr()[source]

helper method to generate timestamp for use in output filename

EEBO-TCP Import

eebo_import is a custom manage command for bulk import of EEBO-TCP materials into the local database for management. It takes a path to a CSV file and requires that the path to EEBO data is configured in Django settings.

Items are imported into the database for management and also indexed into Solr as part of this import script (both works and pages).

Example usage:

python manage.py eebo_import path/to/eebo_works.csv
class ppa.archive.management.commands.eebo_import.Command(stdout=None, stderr=None, no_color=False, force_color=False)[source]

Import EEBO-TCP content into PPA for management and search

add_arguments(parser)[source]

Entry point for subclassed commands to add custom arguments.

handle(*args, **kwargs)[source]

The actual logic of the command. Subclasses must implement this method.

load_csv(path)[source]

Load a CSV file with items to be imported.

v_normal = 1

normal verbosity level

Common

Django app for common functionality that doesn’t have an obvious home

Admin

class ppa.common.admin.LocalUserAdmin(model, admin_site)[source]

Extends django.contribut.auth.admin.UserAdmin to provide additional detail for user administration.

group_names(obj)[source]

Custom property to display group membership.

Views

class ppa.common.views.AjaxTemplateMixin(**kwargs)[source]

View mixin to use a different template when responding to an ajax request.

ajax_template_name = None

name of the template to use for ajax request

get_template_names()[source]

Return ajax_template_name if this is an ajax request; otherwise return default template name.

vary_headers = ['X-Requested-With']

vary on X-Request-With to avoid browsers caching and displaying ajax response for the non-ajax response

class ppa.common.views.VaryOnHeadersMixin(**kwargs)[source]

View mixin to set Vary header - class-based view equivalent to django.views.decorators.vary.vary_on_headers(), adapted from winthrop-django.

Define vary_headers with the list of headers.

dispatch(request, *args, **kwargs)[source]

Wrap default dispatch method to patch haeders on the response.

Pages

Management and display of home page and other content pages. Includes custom page models for collections page, contributor page, and a person snippet for contributors and authors.

Models

ppa.pages.models.ALT_TEXT_HELP = 'Alternative text for visually impaired users to\nbriefly communicate the intended message of the image in this context.'

help text for image alternative text

class ppa.pages.models.BodyContentBlock(*args, **kwargs)[source]

Common set of content blocks to be used on both content pages and editorial pages

class ppa.pages.models.CollectionPage(*args, **kwargs)[source]

Collection list page, with editable text content

exception DoesNotExist
exception MultipleObjectsReturned
get_context(request)[source]

Add collections and collection stats to template context

class ppa.pages.models.ContentPage(*args, **kwargs)[source]

Basic content page model.

exception DoesNotExist
exception MultipleObjectsReturned
class ppa.pages.models.ContributorPage(*args, **kwargs)[source]

Project contributor and advisory board page.

exception DoesNotExist
exception MultipleObjectsReturned
class ppa.pages.models.HomePage(*args, **kwargs)[source]

wagtail.models.Page model for PPA home page

exception DoesNotExist
exception MultipleObjectsReturned
get_context(request)[source]

Add collections with stats and previews for content pages to template context.

class ppa.pages.models.ImageWithCaption(*args, **kwargs)[source]

StructBlock for an image with a formatted caption, so caption can be context-specific. Also allows images to be floated right, left, or take up the width of the page.

class ppa.pages.models.LinkableSectionBlock(*args, **kwargs)[source]

StructBlock for a rich text block and an associated title that will render as an <h2>. Creates an anchor (<a>) so that the section can be directly linked to using a url fragment.

clean(value)[source]

Validate value and return a cleaned version of it, or throw a ValidationError if validation fails. The thrown ValidationError instance will subsequently be passed to render() to display the error message; the ValidationError must therefore include all detail necessary to perform that rendering, such as identifying the specific child block(s) with errors, in the case of nested blocks. (It is suggested that you use the ‘params’ attribute for this; using error_list / error_dict is unreliable because Django tends to hack around with these when nested.)

class ppa.pages.models.PagePreviewDescriptionMixin(*args, **kwargs)[source]

Page mixin with logic for page preview content. Adds an optional richtext description field, and methods to get description and plain-text description, for use in previews on the site and plain-text metadata previews.

allowed_tags = ['p', 'li', 'strong', 'b', 'acronym', 'abbr', 'ul', 'ol', 'em', 'i', 'code', 'blockquote']

allowed tags for bleach html stripping in description

get_description()[source]

Get formatted description for preview. Uses description field if there is content, otherwise uses the beginning of the body content.

get_plaintext_description()[source]

Get plain-text description for use in metadata. Uses search_description field if set; otherwise uses the result of get_description() with tags stripped.

max_length = 250

maximum length for description to be displayed

class ppa.pages.models.Person(*args, **kwargs)[source]

Common model for a person, currently used to document authorship for instances of ppa.editorial.models.EditorialPage.

exception DoesNotExist
exception MultipleObjectsReturned
description(affiliation, etc.)

description (affiliation, etc.)

name

the display name of an individual

photo

Optional profile image to be associated with a person

project_role

project role

project_years

project years

url

identifying URI for a person (VIAF, ORCID iD, personal website, etc.)

class ppa.pages.models.SVGImageBlock(*args, **kwargs)[source]

StructBlock for an SVG image with alternative text and optional formatted caption. Separate from CaptionedImageBlock because Wagtail image handling does not work with SVG.

Embed Finders

Custom EmbedFinder implementations for embedding content in wagtail pages.

class ppa.pages.embed_finders.GlitchEmbedFinder[source]

Custom oembed finder built to embed Glitch apps in wagtail pages.

To support embedding, the glitch app should include a file named embed.json, available directly under the top level url, with oembed content:

{
  "title": "title",
  "author_name": "author",
  "provider_name": "Glitch",
  "type": "rich",
  "thumbnail_url": "URL to thumbnail image",
  "width": xx,
  "height": xx
}

If the request for an embed.json file fails, no content will be embedded.

Any urls that cannot automatically be made relative by embed code (i.e. data files loaded by javascript code) should use absolute URLs, or they will not resolve when embedded.

accept(url)[source]

Accept a url if it includes .glitch.me

find_embed(url, max_width=None)[source]

Retrieve embed.json and requested url and return content for embedding it on the site.

Manage Commands

Setup Site Pages

setup_site_pages is a custom manage command to install a default set of pages and menus for the Wagtail CMS. It is designed not to touch other content.

Example usage:

python manage.py setup_site_pages
class ppa.pages.management.commands.setup_site_pages.Command(stdout=None, stderr=None, no_color=False, force_color=False)[source]

Setup initial wagtail site and pages needed for PPA navigation

create_wagtail_site(root_page)[source]

Create a wagtail site object from the current default Django site.

handle(*args, **kwargs)[source]

The actual logic of the command. Subclasses must implement this method.

v_normal = 1

normal verbosity level

Editorial

Management and display of editorial content. Includes custom page models for an editorial list page and editorial content pages, structured roughly like a scholarly blog.

Models

class ppa.editorial.models.EditorialIndexPage(*args, **kwargs)[source]

Editorial index page; list recent editorial articles.

exception DoesNotExist
exception MultipleObjectsReturned
get_context(request)[source]

Add published editorial posts to template context, most recent first

route(request, path_components)[source]

Customize editorial page routing to serve editorial pages by year/month/slug.

class ppa.editorial.models.EditorialPage(*args, **kwargs)[source]

Editorial page, for scholarly, educational, or other essay-like content related to the site

exception DoesNotExist
exception MultipleObjectsReturned
set_url_path(parent)[source]

Generate the url_path field based on this page’s slug, first publication date, and the specified parent page. Adapted from default logic to include publication date. (Parent is passed in for previewing unsaved pages)

Unapi

Django app for unAPI service

Views

class ppa.unapi.views.UnAPIView(**kwargs)[source]

Simple unAPI service endpoint. With no parameters or only id, provides a list of available metadata formats. If id and format are specified, returns the metadata for the specified item in the requested format.

See archived unAPI website for more details. https://web.archive.org/web/20140331070802/http://unapi.info/specs/

content_type = 'application/xml'

default content type, when serving format information

file_extension = {'marc': 'mrc'}

file extension for metadata formats, as a convenience to set download filename extension

formats = {'marc': {'type': 'application/marc'}}

available metadata formats

get(*args, **kwargs)[source]

Override get to check if id and format are specified; if they are, return the requested metadata. Otherwise, falls back to normal template view behavior and displays format information.

get_context_data(*args, **kwargs)[source]

pass formats and id to template context

get_metadata(item_id, data_format)[source]

get item and requested metadata

template_name = 'unapi/formats.xml'

template for format information