Code Documentation

Archive

Django app for archival materials included in PPA.

ppa.archive.NO_COLLECTION_LABEL = 'Uncategorized'

label to use for items that are not in a collection

Admin

class ppa.archive.admin.CollectionAdmin(model, admin_site)[source]
class ppa.archive.admin.DigitizedWorkAdmin(model, admin_site)[source]
bulk_add_collection(request, queryset)[source]

Bulk add a queryset of ppa.archive.DigitizedWork to a ppa.archive.Collection.

get_readonly_fields(request, obj=None)[source]

Determine read only fields based on item source, to prevent editing of HathiTrust fields that should not be changed.

list_collections(obj)[source]

Return a list of :class:ppa.archive.models.Collection object names as a comma separated list to populate a change_list column.

save_model(request, obj, form, change)[source]

Note any fields in the protected list that have been changed in the admin and preserve in database.

Ensure reindex is called when admin form is saved

Link to source record

Models

class ppa.archive.models.Collection(*args, **kwargs)[source]

A collection of ppa.archive.models.DigitizedWork instances.

exception DoesNotExist
exception MultipleObjectsReturned
description

a RichText description of the collection

exclude

flag to indicate collections to be excluded by default in public search

get_usage()

Returns a queryset of pages that link to a particular object

name

the name of the collection

property name_changed

check if name has been changed (only works on current instance)

static stats()[source]

Collection counts and date ranges, based on what is in Solr. Returns a dictionary where they keys are collection names and values are a dictionary with count and dates.

class ppa.archive.models.CollectionSignalHandlers[source]

Signal handlers for indexing DigitizedWork records when Collection records are saved or deleted.

static delete(sender, instance, **kwargs)[source]

signal handler for collection delete; clear associated digitized works and reindex

static save(sender, instance, **kwargs)[source]

signal handler for collection save; reindex associated digitized works

class ppa.archive.models.DigitizedWork(*args, **kwargs)[source]

Record to manage digitized works included in PPA and store their basic metadata.

exception DoesNotExist
exception MultipleObjectsReturned
static add_from_hathi(htid, bib_api=None, update=False, get_data=False, log_msg_src=None, user=None)[source]

Add or update a HathiTrust work in the database. Retrieves bibliographic data from Hathi api, retrieves or creates a DigitizedWork record, and populates the metadata if this is a new record, if the Hathi metadata has changed, or if update is requested. Creates admin log entry to document record creation or update. If get_data is specified, will retrieve structure and aggregate data from Hathi Data API and add it to the local pairtree datastore.

Raises ppa.archive.hathi.HathiItemNotFound for invalid id.

Returns the new or updated DigitizedWork.

Parameters
  • htid – HathiTrust record identifier

  • bib_api – optional HathiBibliographicAPI instance, to allow for shared sessions in scripts

  • update – update bibliographic metadata even if the hathitrust record is not newer than the local database record (default: False)

  • get_data – retrieve content data from Data API; for new records only (default: False)

  • log_msg_src – source of the change to be used included in log entry messages (optional). Will be used as “Created/updated [log_msg_src]”.

  • user – optional user responsible for the change, to be associated with LogEntry record

added

date added to the archive

clean()[source]

Add custom validation to trigger a save error in the admin if someone tries to unsuppress a record that has been suppressed (not yet supported).

collections

collections that this work is part of

compare_protected_fields(db_obj)[source]

Compare protected fields in a ppa.archive.models.DigitizedWork instance and return those that are changed.

Parameters

db_obj (object) – Database instance of a DigitizedWork.

count_pages(ptree_client=None)[source]

Count the number of pages for a digitized work based on the number of files in the zipfile within the pairtree content. Raises pairtree.storage_exceptions.ObjectNotFoundException if the data is not found in the pairtree storage. Returns page count found; saves the object if the count changes.

display_title()[source]

admin display title to allow displaying title but sorting on sort_title

enumcron

enumeration/chronology (hathi-specific)

get_absolute_url()[source]

Return object’s url for ppa.archive.views.DigitizedWorkDetailView

get_hathi_data()[source]

Use Data API to fetch zipfile and mets and add them to the local pairtree. Intended for use with newly added HathiTrust records not imported from local pairtree data.

Raises HathiItemNotFound for invalid id and HathiItemForbidden for a valid record that configured Data API credentials do not allow accessing.

get_metadata(metadata_format)[source]

Get metadata for this item in the specified format. Currently only supports marc.

property has_fulltext

Checks if an item has full text (currently only items from HathiTrust).

hathi

ppa.archive.hathi.HathiObject for HathiTrust records, for working with data in HathiTrust pairtree data structure.

index_data()[source]

data for indexing in Solr

index_id()[source]

source id is used as solr identifier

is_public()[source]

admin display field indicating if record is public or suppressed

property is_suppressed

Item has been suppressed (based on status).

notes

internal team notes, not displayed on the public facing site

page_count

number of pages in the work

page_index_data()[source]

Get page content for this work from Hathi pairtree and return data to be indexed in solr.

populate_fields(field_data)[source]

Conditionally update fields as protected by flags using Hathi bibdata information.

Parameters

field_data (dict) – A dictionary of fields updated from a ppa.archive.hathi.HathiBibliographicRecord instance.

populate_from_bibdata(bibdata)[source]

Update record fields based on Hathi bibdata information. Full record is required in order to set all fields

Parameters

bibdata – bibliographic data returned from HathiTrust as instance of ppa.archive.hathi.HathiBibliographicRecord

printed_by_re = '^(Printed)?( and )?(Pub(.|lished|lisht)?)?( and sold)? (by|for|at)( the)? ?'

regular expresion for cleaning preliminary text from publisher names

protected_fields

ProtectedWorkField instance to indicate metadata fields that should be preserved from bulk updates because they have been modified in Django admin.

pub_place

place of publication

public_notes

public notes field for this work

publisher

publisher

record_id

record id; for Hathi materials, used for different copies of the same work or for different editions/volumes of a work

save(*args, **kwargs)[source]

Saves data and reset copy of initial data.

sort_title

sort title: title without leading non-sort characters, from marc

source

source of the record, HathiTrust or elsewhere

source_id

source identifier; hathi id for HathiTrust materials

source_url

source url where the original can be accessed

status

status of record; currently choices are public or suppressed

subtitle

subtitle of the work; using TextField to allow for long titles

title

title of the work; using TextField to allow for long titles

updated

date of last modification of the local record

ppa.archive.models.NO_COLLECTION_LABEL = 'Uncategorized'

label to use for items that are not in a collection

class ppa.archive.models.ProtectedWorkField(verbose_name=None, name=None, **kwargs)[source]

PositiveSmallIntegerField subclass that returns a ProtectedWorkFieldFlags object and stores as integer.

from_db_value(value, expression, connection, context)[source]

Always return an instance of ProtectedWorkFieldFlags

get_prep_value(value)[source]

Perform preliminary non-db specific value checks and conversions.

to_python(value)[source]

Always return an instance of ProtectedWorkFieldFlags

class ppa.archive.models.ProtectedWorkFieldFlags[source]

flags.Flags instance to indicate which DigitizedWork fields should be protected if edited in the admin.

author = None

author

classmethod deconstruct()[source]

Give Django information needed to make ProtectedWorkFieldFlags.no_flags default in migration.

enumcron = None

enumcron

pub_date = None

publication date

pub_place = None

place of publication

publisher = None

publisher

sort_title = None

sort title

subtitle = None

subtitle

title = None

title

class ppa.archive.models.TrackChangesModel(*args, **kwargs)[source]

Model mixin that keeps a copy of initial data in order to check if fields have been changed. Change detection only works on the current instance of an object.

has_changed(field)[source]

check if a field has been changed

initial_value(field)[source]

return the initial value for a field

save(*args, **kwargs)[source]

Saves data and reset copy of initial data.

Views

class ppa.archive.views.AddFromHathiView(**kwargs)[source]

Admin view to add new HathiTrust records by providing a list of ids.

form_class

alias of ppa.archive.forms.AddFromHathiForm

form_valid(form)[source]

If the form is valid, redirect to the supplied URL.

get_context_data(*args, **kwargs)[source]

Insert the form into the context dict.

class ppa.archive.views.AddToCollection(**kwargs)[source]

View to bulk add a queryset of ppa.archive.models.DigitizedWork to a set of ppa.archive.models.Collection instances.

Restricted to staff users via staff_member_required on url.

form_class

alias of ppa.archive.forms.AddToCollectionForm

get_queryset(*args, **kwargs)[source]

Return a queryset filtered by id, or empty list if no ids

get_success_url()[source]

Redirect to the ppa.archive.models.DigitizedWork change_list in the Django admin with pagination and filters preserved. Expects ppa.archive.admin.bulk_add_collection() to have set ‘collection-add-filters’ as a dict in the request’s session.

model

alias of ppa.archive.models.DigitizedWork

post(request, *args, **kwargs)[source]

Add ppa.archive.models.DigitizedWork instances passed in form data to selected instances of ppa.archive.models.Collection, then return to change_list view.

Expects a list of DigitizedWork ids to be set in the request session.

class ppa.archive.views.DigitizedWorkByRecordId(**kwargs)[source]

Redirect from DigitizedWork record id to detail view when possible. If there is only one record found, redirect. If multiple are found, 404.

get_redirect_url(*args, **kwargs)[source]

Return the URL redirect to. Keyword arguments from the URL pattern match generating the redirect request are provided as kwargs to this method.

class ppa.archive.views.DigitizedWorkCSV(**kwargs)[source]

Export of digitized work details as CSV download.

get(*args, **kwargs)[source]

Return CSV file on GET request.

get_csv_filename()[source]

Return the CSV file name based on the current datetime.

Returns

the filename for the CSV to be generated

Return type

str

get_data()[source]

Get data for the CSV.

Returns

rows for CSV columns

Return type

tuple

model

alias of ppa.archive.models.DigitizedWork

render_to_csv(data)[source]

Render the CSV as an HTTP response.

Return type

django.http.HttpResponse

class ppa.archive.views.DigitizedWorkDetailView(**kwargs)[source]

Display details for a single digitized work. If a work has been surpressed, returns a 410 Gone response.

form_class

alias of ppa.archive.forms.SearchWithinWorkForm

get_context_data(**kwargs)[source]

Insert the single object into the context dict.

get_template_names()[source]

Return ajax_template_name if this is an ajax request; otherwise return default template name.

last_modified()[source]

get last index modification from Solr, as it will be more current than object last modified.

model

alias of ppa.archive.models.DigitizedWork

class ppa.archive.views.DigitizedWorkListView(**kwargs)[source]

Search and browse digitized works. Based on Solr index of works and pages.

form_class

alias of ppa.archive.forms.SearchForm

get_context_data(**kwargs)[source]

Get the context for this view.

get_page_highlights(page_groups)[source]

If there is a keyword search, query Solr for matching pages with text highlighting. Note that this has to be done as a separate query because Solr doesn’t support highlighting on collapsed items.

get_queryset(**kwargs)[source]

Return the list of items for this view.

The return value must be an iterable and may be an instance of QuerySet in which case QuerySet specific behavior will be enabled.

last_modified()[source]

override last modified logic to work with Solr

meta_description = 'The Princeton Prosody Archive is a full-text\n searchable database of thousands of historical documents about the\n study of language and the study of poetry.'

page description for metadata/preview

meta_title = 'Princeton Prosody Archive'

title for metadata / preview

model

alias of ppa.archive.models.DigitizedWork

class ppa.archive.views.OpenSearchDescriptionView(**kwargs)[source]

Basic open search description for searching the archive via browser or other tools.

Forms

class ppa.archive.forms.AddFromHathiForm(data=None, files=None, auto_id='id_%s', prefix=None, initial=None, error_class=<class 'django.forms.utils.ErrorList'>, label_suffix=None, empty_permitted=False, field_order=None, use_required_attribute=None, renderer=None)[source]

Form to input HathiTrust IDs for items to be added.

get_hathi_ids()[source]

Get list of ids from valid form input. Splits on newlines, strips whitespace, and ignores empty lines.

class ppa.archive.forms.AddToCollectionForm(data=None, files=None, auto_id='id_%s', prefix=None, initial=None, error_class=<class 'django.forms.utils.ErrorList'>, label_suffix=None, empty_permitted=False, field_order=None, use_required_attribute=None, renderer=None)[source]

Form to select a set of :class:ppa.archive.models.Collection to which to bulk add a queryset of :class:ppa.archive.models.DigitizedWork

class ppa.archive.forms.CheckboxSelectMultipleWithDisabled(attrs=None, choices=())[source]

Subclass of django.forms.CheckboxSelectMultiple with option to mark a choice as disabled.

class ppa.archive.forms.FacetChoiceField(*args, **kwargs)[source]

Add CheckboxSelectMultiple field with facets taken from solr query

valid_value(value)[source]

Check to see if the provided value is a valid choice

class ppa.archive.forms.ModelMultipleChoiceFieldWithEmpty(queryset, required=True, widget=None, label=None, initial=None, help_text='', *args, **kwargs)[source]

Extend django.forms.ModelMultipleChoiceField to add an option for an unset or empty choice (i.e. no relationship in a many-to-many relationship such as collection membership).

clean(value)[source]

Extend clean to use default validation on all values but the empty id.

class ppa.archive.forms.RadioSelectWithDisabled(attrs=None, choices=())[source]

Subclass of django.forms.RadioSelect with option to mark a choice as disabled.

class ppa.archive.forms.RangeField(*args, **kwargs)[source]
compress(data_list)[source]

Returns a single value for the given list of values. The values can be assumed to be valid.

For example, if this MultiValueField was instantiated with fields=(DateField(), TimeField()), this might return a datetime object created by combining the date and time in data_list.

widget

alias of RangeWidget

class ppa.archive.forms.RangeWidget(*args, **kwargs)[source]

date range widget, for two numeric inputs

decompress(value)[source]

Returns a list of decompressed values for the given compressed value. The given value can be assumed to be valid, but not necessarily non-empty.

sep = '-'

separator string when splitting out values in decompress

class ppa.archive.forms.SearchForm(data=None, *args, **kwargs)[source]

Simple search form for digitized works.

QUESTION_POPUP_TEXT = '\n Boolean search within a field is supported. Operators must be capitalized (AND, OR).\n '

help text to be shown with the form (appears when you hover over the question mark icon)

static defaults()[source]

Default values when initializing the form. Sort by title, pre-select collections based exclude property.

get_solr_sort_field(sort)[source]

Set solr sort fields for the query based on sort and query strings.

Returns

solr sort field

has_keyword_query(data)[source]

check if any of the keyword search fields have search terms

pub_date_minmax()[source]

Get minimum and maximum values for DigitizedWork publication dates in the database. Used to set placeholder values for the form input and to generate the Solr facet range query. Value is cached to avoid repeatedly calculating it.

Returns

tuple of min, max

set_choices_from_facets(facets)[source]

Set choices on field from a dictionary of facets

class ppa.archive.forms.SearchWithinWorkForm(data=None, files=None, auto_id='id_%s', prefix=None, initial=None, error_class=<class 'django.forms.utils.ErrorList'>, label_suffix=None, empty_permitted=False, field_order=None, use_required_attribute=None, renderer=None)[source]

Form to search for occurrences of a string within a particular instance of a digitized work.

QUESTION_POPUP_TEXT = '\n Boolean search is supported. Operators must be capitalized (AND, OR).\n '

help text to be shown with the form (appears when you hover over the question mark icon)

class ppa.archive.forms.SelectDisabledMixin[source]

Mixin for django.forms.RadioSelect or django.forms.CheckboxSelect classes to set an option as disabled. To disable, the widget’s choice label option should be passed in as a dictionary with disabled set to True:

{'label': 'option', 'disabled': True}.
class ppa.archive.forms.SelectWithDisabled(attrs=None, choices=())[source]

Subclass of django.forms.Select with option to mark a choice as disabled.

Solr

class ppa.archive.solr.CoreAdmin[source]

Solr Core Admin API wrapper

reload(core=None)[source]

Reload an existing Solr core, e.g. so that schema changes take effect. If core is not specified, uses the configured project collection/core.

class ppa.archive.solr.Indexable[source]

Mixin for objects that are indexed in Solr. Subclasses must implement index_id and index methods.

Subclasses may include an index_depends_on property which is used by identify_index_dependencies() to determine index dependencies on related objects, including many-to-many relationships. This property should be structured like this:

index_depends_on = {
    'attr_name': {      # string name of the attribute on this model
        'save': handle_attr_save,  # signal handler for post_save on this model
        'delete': handle_attr_delete,   # signal handler for pre_delete on this model
    }
}

If the attribute is a many-to-many field, indexing will be configured on the model when the based on relationship changes (a signal handler will listen for models.signals.m2m_changed on the through model). Signal handler methods for save and delete are optional.

classmethod identify_index_dependencies()[source]

Identify and set lists of index dependencies for the subclass of Indexable.

index(params=None)[source]

Index the current object in Solr. Allows passing in parameter, e.g. to set a commitWithin value.

index_chunk_size = 150

number of items to index at once when indexing a large number of items

index_data()[source]

should return a dictionary of data for indexing in Solr

index_id()[source]

the value that is used as the Solr id for this object

classmethod index_items(items, params=None, progbar=None)[source]

Indexable class method to index multiple items at once. Takes a list, queryset, or generator of Indexable items or dictionaries. Items are indexed in chunks, based on Indexable.index_chunk_size. Takes an optional progressbar object to update when indexing items in chunks. Returns a count of the number of items indexed.

remove_from_index(params=None)[source]

Remove the current object from Solr by identifier using index_id()

class ppa.archive.solr.PagedSolrQuery(query_opts=None)[source]

A Solr query object that wraps a SolrClient query in a way that allows search results to be paginated by django paginator.

count()[source]

Total number of results in the query

facet_ranges

Return Solr range facets, with counts converted from a list of start date and count to an OrderedDict.

get_expanded()[source]

get the expanded results from a collapsed query

get_facets()[source]

Wrap SolrClient.SolrResponse.get_facets() to get query facets as a dict of dicts.

get_highlighting()[source]

get highlighting results from the response

get_json()[source]

Return query response as JSON data, to allow full access to anything included in Solr data.

get_results()[source]

Return results of the Solr query.

Returns

docs as a list of dictionaries.

raw_response

Return the raw Solr result to provide access to return sections not exposed by SolrClient

set_limits(start, stop)[source]

Return a subsection of the results, to support slicing.

class ppa.archive.solr.SolrSchema[source]

Solr Schema object. Includes project schema configuration and methods to update configured Solr instance.

fields = [{'name': 'source_id', 'type': 'string', 'required': False}, {'name': 'content', 'type': 'text_en', 'required': False}, {'name': 'item_type', 'type': 'string', 'required': False}, {'name': 'title', 'type': 'text_en', 'required': False}, {'name': 'subtitle', 'type': 'text_en', 'required': False}, {'name': 'sort_title', 'type': 'string_i', 'required': False}, {'name': 'enumcron', 'type': 'string', 'required': False}, {'name': 'author', 'type': 'text_en', 'required': False}, {'name': 'pub_date', 'type': 'int', 'required': False}, {'name': 'pub_place', 'type': 'text_en', 'required': False}, {'name': 'publisher', 'type': 'text_en', 'required': False}, {'name': 'source_url', 'type': 'string', 'required': False}, {'name': 'order', 'type': 'string', 'required': False}, {'name': 'collections', 'type': 'text_en', 'required': False, 'multiValued': True}, {'name': 'notes', 'type': 'text_en', 'required': False}, {'name': 'label', 'type': 'text_en', 'required': False}, {'name': 'tags', 'type': 'string', 'required': False, 'multiValued': True}, {'name': 'author_exact', 'type': 'string', 'required': False}, {'name': 'collections_exact', 'type': 'string', 'required': False, 'multiValued': True}, {'name': 'title_nostem', 'type': 'text_nostem', 'required': False}, {'name': 'subtitle_nostem', 'type': 'text_nostem', 'required': False}, {'name': 'last_modified', 'type': 'date', 'default': 'NOW'}]

solr schema field definitions

solr_schema_field_types()[source]

Dictionary of currently configured Solr schema fields

solr_schema_fields()[source]

List of currently configured Solr schema fields

text_fields = []

fields to be copied into general purpose text field for searching

update_solr_schema()[source]

Update the configured solr instance schema to match the configured fields. Returns a tuple with the number of fields created and updated.

ppa.archive.solr.get_solr_connection()[source]

Initialize a Solr connection using project settings

Hathi

Utilities for working with HathiTrust materials and APIs.

class ppa.archive.hathi.HathiBaseAPI[source]

Base client class for HathiTrust APIs

api_root = ''

base api URL for all requests

class ppa.archive.hathi.HathiBibliographicAPI[source]

Wrapper for HathiTrust Bibliographic API.

https://www.hathitrust.org/bib_api

brief_record(id_type, id_value)[source]

Get brief record by id type and value.

Returns

HathiBibliographicRecord

Raises

HathiItemNotFound

record(id_type, id_value)[source]

Get full record by id type and value.

Returns

HathiBibliographicRecord

Raises

HathiItemNotFound

class ppa.archive.hathi.HathiBibliographicRecord(data)[source]

Representation of a HathiTrust bibliographic record.

copy_details(htid)[source]

Details for a specific copy identified by hathi id

copy_last_updated(htid)[source]

Return last update date for a specificy copy identified by hathi id. Returns as datetime.date

marcxml

Record marcxml if included (full records only), as an instance of pymarc.Record

property pub_dates

list of available publication dates

property title

First title (standard title)

class ppa.archive.hathi.HathiDataAPI[source]

Wrapper for HathiTrust DATA API. Pulls OAuth credentials from Django settings.

get_aggregate(htid)[source]

Get aggregate date package for a HathiTrust record by hathi id.

get_structure(htid, fmt='xml')[source]

Get structure information for a HathiTrust record by hathi id.

exception ppa.archive.hathi.HathiItemForbidden[source]

Permission denied to access item in data API

exception ppa.archive.hathi.HathiItemNotFound[source]

Item not found in bibliographic or data API

class ppa.archive.hathi.HathiObject(hathi_id)[source]

An object for working with a HathiTrust item with data in a locally configured pairtree datastore.

content_dir

content directory for this work within the appropriate pairtree

delete_pairtree_data()[source]

Delete pairtree object from the pairtree datastore.

metsfile_path(ptree_client=None)[source]

path to mets xml file within the hathi contents for this work

pairtree_client()[source]

Initialize a pairtree client for the pairtree datastore this object belongs to, based on its Hathi prefix id.

pairtree_id

pairtree identifier (second portion of source id)

pairtree_object(ptree_client=None, create=False)[source]

get a pairtree object for this record

Parameters

ptree_client – optional pairtree_client.PairtreeStorageClient if one has already been initialized, to avoid repeated initialization (currently used in hathi_import manage command)

pairtree_prefix

pairtree prefix (first portion of the hathi id, short-form identifier for owning institution)

zipfile_path(ptree_client=None)[source]

path to zipfile within the hathi contents for this work

class ppa.archive.hathi.METSFile(node=None, context=None, **kwargs)[source]

File location information within a METS document.

id = <eulxml.xmlmap.fields.StringField>

xml identifier

location = <eulxml.xmlmap.fields.StringField>

<METS:file SIZE=”1” ID=”TXT00000001” MIMETYPE=”text/plain” CREATED=”2016-06-24T09:04:15Z” CHECKSUM=”68b329da9893e34099c7d8ad5cb9c940” SEQ=”00000001” CHECKSUMTYPE=”MD5”>

sequence = <eulxml.xmlmap.fields.StringField>

sequence attribute

class ppa.archive.hathi.MinimalMETS(node=None, context=None, **kwargs)[source]

Minimal XmlObject for METS that maps only what is needed to support page indexing for ppa.

structmap_pages = <eulxml.xmlmap.fields.NodeListField>

list of struct map pages as StructMapPage

class ppa.archive.hathi.StructMapPage(node=None, context=None, **kwargs)[source]

Single logical page within a METS StructMap

property display_label

page display labeel; use order label if present; otherwise use order

label = <eulxml.xmlmap.fields.StringField>

page label

order = <eulxml.xmlmap.fields.IntegerField>

page order

orderlabel = <eulxml.xmlmap.fields.StringField>

order label

property text_file

METSFiile corresponding to the text file pointer for this page

text_file_id = <eulxml.xmlmap.fields.StringField>
<METS:div ORDER=”1” LABEL=”FRONT_COVER, IMAGE_ON_PAGE, IMPLICIT_PAGE_NUMBER” TYPE=”page”>

<METS:fptr FILEID=”HTML00000001”/> <METS:fptr FILEID=”TXT00000001”/> <METS:fptr FILEID=”IMG00000001”/>

<METS:file SIZE=”1003” ID=”HTML00000496” MIMETYPE=”text/html” CREATED=”2017-03-20T10:40:21Z”

CHECKSUM=”f0a326c10b2a6dc9ae5e3ede261c9897” SEQ=”00000496” CHECKSUMTYPE=”MD5”>

property text_file_location

location for the text file

Util

Utility code related to ppa.archives

class ppa.archive.util.HathiImporter(htids)[source]

Logic for creating new DigitizedWork records from HathiTrust. For use in views and manage commands.

SKIPPED = 2

status - skipped because already in the database

SUCCESS = 1

status - successfully imported record

add_items(log_msg_src=None, user=None)[source]

Add new items from HathiTrust.

Params log_msg_src

optional source of change to be included in log entry message

filter_existing_ids()[source]

Check for any ids that are in the database so they can be skipped for import. Populates existing_ids with an OrderedDict of htid -> id for ids already in the database and filters htids.

Parameters

htids – list of HathiTrust Identifiers (correspending to source_id)

get_status_message(status)[source]

Get a readable status message for a given status

index()[source]

Index newly imported content, both metadata and full text.

output_results()[source]

Provide human-readable report of results for each id that was processed.

status_message = {1: 'Success', 2: 'Skipped; already in the database', <class 'ppa.archive.hathi.HathiItemNotFound'>: 'Error loading record; check that id is valid.', <class 'ppa.archive.hathi.HathiItemForbidden'>: 'Permission denied to download data.', <class 'json.decoder.JSONDecodeError'>: 'HathiTrust catalog temporarily unavailable (malformed response).'}

human-readable message to display for result status

Pages

Management and display of home page and other content pages. Includes custom page models for collections page, contributor page, and a person snippet for contributors and authors.

Models

ppa.pages.models.ALT_TEXT_HELP = 'Alternative text for visually impaired users to\nbriefly communicate the intended message of the image in this context.'

help text for image alternative text

class ppa.pages.models.BodyContentBlock(local_blocks=None, **kwargs)[source]

Common set of content blocks to be used on both content pages and editorial pages

class ppa.pages.models.CollectionPage(*args, **kwargs)[source]

Collection list page, with editable text content

exception DoesNotExist
exception MultipleObjectsReturned
class ppa.pages.models.ContentPage(*args, **kwargs)[source]

Basic content page model.

exception DoesNotExist
exception MultipleObjectsReturned
class ppa.pages.models.ContributorPage(*args, **kwargs)[source]

Project contributor and advisory board page.

exception DoesNotExist
exception MultipleObjectsReturned
class ppa.pages.models.HomePage(*args, **kwargs)[source]

wagtail.core.models.Page model for PPA home page

exception DoesNotExist
exception MultipleObjectsReturned
class ppa.pages.models.ImageWithCaption(local_blocks=None, **kwargs)[source]

StructBlock for an image with a formatted caption, so caption can be context-specific. Also allows images to be floated right, left, or take up the width of the page.

class ppa.pages.models.LinkableSectionBlock(local_blocks=None, **kwargs)[source]

StructBlock for a rich text block and an associated title that will render as an <h2>. Creates an anchor (<a>) so that the section can be directly linked to using a url fragment.

clean(value)[source]

Validate value and return a cleaned version of it, or throw a ValidationError if validation fails. The thrown ValidationError instance will subsequently be passed to render() to display the error message; the ValidationError must therefore include all detail necessary to perform that rendering, such as identifying the specific child block(s) with errors, in the case of nested blocks. (It is suggested that you use the ‘params’ attribute for this; using error_list / error_dict is unreliable because Django tends to hack around with these when nested.)

class ppa.pages.models.PagePreviewDescriptionMixin(*args, **kwargs)[source]

Page mixin with logic for page preview content. Adds an optional richtext description field, and methods to get description and plain-text description, for use in previews on the site and plain-text metadata previews.

allowed_tags = ['acronym', 'b', 'p', 'strong', 'abbr', 'code', 'li', 'blockquote', 'em', 'i', 'ol', 'ul']

allowed tags for bleach html stripping in description

get_description()[source]

Get formatted description for preview. Uses description field if there is content, otherwise uses the beginning of the body content.

get_plaintext_description()[source]

Get plain-text description for use in metadata. Uses search_description field if set; otherwise uses the result of get_description() with tags stripped.

max_length = 250

maximum length for description to be displayed

class ppa.pages.models.Person(*args, **kwargs)[source]

Common model for a person, currently used to document authorship for instances of ppa.editorial.models.EditorialPage.

exception DoesNotExist
exception MultipleObjectsReturned
description

description (affiliation, etc.)

get_usage()

Returns a queryset of pages that link to a particular object

name

the display name of an individual

photo

Optional profile image to be associated with a person

project_role

project role

url

identifying URI for a person (VIAF, ORCID iD, personal website, etc.)

class ppa.pages.models.SVGImageBlock(local_blocks=None, **kwargs)[source]

StructBlock for an SVG image with alternative text and optional formatted caption. Separate from CaptionedImageBlock because Wagtail image handling does not work with SVG.

Embed Finders

Custom EmbedFinder implementations for embedding content in wagtail pages.

class ppa.pages.embed_finders.GlitchEmbedFinder[source]

Custom oembed finder built to embed Glitch apps in wagtail pages.

To support embedding, the glitch app should include a file named embed.json, available directly under the top level url, with oembed content:

{
  "title": "title",
  "author_name": "author",
  "provider_name": "Glitch",
  "type": "rich",
  "thumbnail_url": "URL to thumbnail image",
  "width": xx,
  "height": xx
}

If the request for an embed.json file fails, no content will be embedded.

Any urls that cannot automatically be made relative by embed code (i.e. data files loaded by javascript code) should use absolute URLs, or they will not resolve when embedded.

accept(url)[source]

Accept a url if it includes .glitch.me

find_embed(url, max_width=None)[source]

Retrieve embed.json and requested url and return content for embedding it on the site.

Editorial

Management and display of editorial content. Includes custom page models for an editorial list page and editorial content pages, structured roughly like a scholarly blog..

Models

class ppa.editorial.models.EditorialIndexPage(*args, **kwargs)[source]

Editorial index page; list recent editorial articles.

exception DoesNotExist
exception MultipleObjectsReturned
route(request, path_components)[source]

Customize editorial page routing to serve editorial pages by year/month/slug.

class ppa.editorial.models.EditorialPage(*args, **kwargs)[source]

Editorial page, for scholarly, educational, or other essay-like content related to the site

exception DoesNotExist
exception MultipleObjectsReturned
set_url_path(parent)[source]

Generate the url_path field based on this page’s slug, first publication date, and the specified parent page. Adapted from default logic to include publication date. (Parent is passed in for previewing unsaved pages)

Manage Commands

Solr schema

solr_schema is a custom manage command to update the schema for the configured Solr instance. Reports on the number of fields that are added or updated, and any that are out of date and were removed.

Actual logic implemented in ppa.archive.solr.SolrSchema.update_solr_schema()

Example usage:

python manage.py solr_schema

Hathi Import

hathi_import is a custom manage command for bulk import of HathiTrust materials into the local database for management. It does not index into Solr for search and browse; use the index script for that after import.

This script expects a local copy of dataset files in pairtree format retrieved by rsync. (Note that pairtree data must include pairtree version file to be valid.)

Contents are inspected from the configured HATHI_DATA path; DigitizedWork records are created or updated based on identifiers found and metadata retrieved from the HathiTrust Bibliographic API. Page content is only reflected in the database via a total page count per work (but page contents will be indexed in Solr via index script).

By default, existing records are updated only when the Hathi record has changed or if requested via –update` script option.

Supports importing specific items by hathi id, but the pairtree content for the items still must exist at the configured path.

Example usage:

# import everything with defaults
python manage.py hathi_import
# import specific items
python manage.py hathi_import htid1 htid2 htid3
# re-import and update records
python manage.py hathi_import --update
# display progressbar to show status and ETA
python manage.py hathi_import -v 0 --progress

Index

index is a custom manage command to index PPA digitized work and page content into Solr. It should be run after content has been imported to the project database via the hathi_import manage command.

Page content will be indexed from a local copy of dataset files under the configured HATHI_DATA path in pairtree format retrieved by rsync. (Note that pairtree data must include pairtree version file to be valid.)

By default, indexes _all_ content, both digitized works and pages. You can optionally specify works or pages only, or index specific items by hathi id. If you request a specific id, it must exist in the database and the pairtree content for the items still must exist at the configured path.

A progress bar will be displayed by default if there are more than 5 items to process. This can be suppressed via script options.

You may optionally request the index or part of the index to be cleared before indexing, for use when index data has changed sufficiently that previous versions need to be removed.

Example usage:

# index everything
python manage.py index
# index specific items
python manage.py index htid1 htid2 htid3
# index works only (skip pages)
python manage.py index -i works
python manage.py index --works
# index pages only (skip works)
python manage.py index -i pages
python manage.py index ---pages
# suppress progressbar
python manage.py index --no-progress
# clear everything, then index everything
python manage.py index --clear all
# clear works only, then index works
python manage.py index --clear works --index works
# clear everything, index nothing
python manage.py index --clear all --index none

Generate Corpus

generate_corpus is a custom manage command to generate and serialize a Gensim corpus from Solr. It should be run after content has been indexed into Solr via the index manage command.

The Corpus is serialized in the Matrix Market format (https://math.nist.gov/MatrixMarket/formats.html) with a .mm extension, using the Gensim Topic Modelling library (https://radimrehurek.com/gensim) Typically, an index corresponding to the .mm file is also saved by Gensim, with a .mm.index extension.

A dictionary corresponding to token IDs is also saved by default, using a .mm.dict extension. By default, this is a pickled Gensim Dictionary object. If the –dictionary-as-text flag is specified, then the dictionary is saved as a utf8-encoded and newline-separated file, where line number N contains the token with token_id N-1. Saving the dictionary can be skipped by using the –no-dictionary option.

Additional document-level metadata found in the Solr Index is also saved by default, with .mm.metadata extension. This is a csv file with a header row and one row per unique document found in the Solr index. Saving the metadata can be skipped by using the –no-metadata option.

By default, all documents found in the Solr index are serialized. This can be controlled using --doc-limit, which denotes the maximum no. of documents to serialize. This is especially useful for development, or for sanity-testing your Solr installation.

For corpus generation, the following pre-processing options are available via the –preprocess flag:

# Lower-cases words
'lower'
# Strips HTML tags
'strip_tags'
# Strips punctuation
'strip_punctuation'
# Collapses multiple whitespaces into one
'strip_multiple_whitespaces'
# Strips numeric characters
'strip_numeric'
# Removes stopwords - Note that the set of default stopwords used by Gensim
# is from Stone, Denis, Kwantes (2010).
'remove_stopwords'
# Strip short words. The lower limit on word length is 3.
'strip_short'
# Use Porter Stemmer for word-normalization.
'stem_text'

IMPORTANT - NO preprocessing filters are applied by default, but you will typically at least want to use lower. Multiple preprocessing filters can be applied (in order) by specifying multiple –preprocess flags.

Example usage:

# Save all files to the 'data' folder, with bare-minimum preprocessing
python manage.py generate_corpus --path data --preprocess lower
--preprocess strip_tags

# Restrict corpus to 1000 documents
python manage.py generate_corpus --path data --doc-limit 1000
--preprocess lower --preprocess strip_tags

# Don't generate dictionary; don't generate metadata
python manage.py generate_corpus --path data --doc-limit 1000
--preprocess lower --no-dictionary --no-metadata

Common

Django app for common functionality that doesn’t have an obvious home

Admin

class ppa.common.admin.LocalUserAdmin(model, admin_site)[source]

Extends django.contribut.auth.admin.UserAdmin to provide additional detail for user administration.

group_names(obj)[source]

Custom property to display group membership.

Views

class ppa.common.views.AjaxTemplateMixin(**kwargs)[source]

View mixin to use a different template when responding to an ajax request.

ajax_template_name = None

name of the template to use for ajax request

get_template_names()[source]

Return ajax_template_name if this is an ajax request; otherwise return default template name.

vary_headers = ['X-Requested-With']

vary on X-Request-With to avoid browsers caching and displaying ajax response for the non-ajax response

class ppa.common.views.LastModifiedListMixin(**kwargs)[source]

Variant of LastModifiedMixin for use on a list view

class ppa.common.views.LastModifiedMixin(**kwargs)[source]

View mixin to add last modified headers

static solr_timestamp_to_datetime(solr_time)[source]

Convert Solr timestamp (isoformat that may or may not include microseconds) to datetime.datetime

class ppa.common.views.VaryOnHeadersMixin(**kwargs)[source]

View mixin to set Vary header - class-based view equivalent to django.views.decorators.vary.vary_on_headers(), adapted from winthrop-django.

Define vary_headers with the list of headers.

dispatch(request, *args, **kwargs)[source]

Wrap default dispatch method to patch haeders on the response.

Editorial

Management and display of editorial content. Includes custom page models for an editorial list page and editorial content pages, structured roughly like a scholarly blog..

Models

class ppa.editorial.models.EditorialIndexPage(*args, **kwargs)[source]

Editorial index page; list recent editorial articles.

exception DoesNotExist
exception MultipleObjectsReturned
route(request, path_components)[source]

Customize editorial page routing to serve editorial pages by year/month/slug.

class ppa.editorial.models.EditorialPage(*args, **kwargs)[source]

Editorial page, for scholarly, educational, or other essay-like content related to the site

exception DoesNotExist
exception MultipleObjectsReturned
set_url_path(parent)[source]

Generate the url_path field based on this page’s slug, first publication date, and the specified parent page. Adapted from default logic to include publication date. (Parent is passed in for previewing unsaved pages)

Pages

Management and display of home page and other content pages. Includes custom page models for collections page, contributor page, and a person snippet for contributors and authors.

Models

ppa.pages.models.ALT_TEXT_HELP = 'Alternative text for visually impaired users to\nbriefly communicate the intended message of the image in this context.'

help text for image alternative text

class ppa.pages.models.BodyContentBlock(local_blocks=None, **kwargs)[source]

Common set of content blocks to be used on both content pages and editorial pages

class ppa.pages.models.CollectionPage(*args, **kwargs)[source]

Collection list page, with editable text content

exception DoesNotExist
exception MultipleObjectsReturned
class ppa.pages.models.ContentPage(*args, **kwargs)[source]

Basic content page model.

exception DoesNotExist
exception MultipleObjectsReturned
class ppa.pages.models.ContributorPage(*args, **kwargs)[source]

Project contributor and advisory board page.

exception DoesNotExist
exception MultipleObjectsReturned
class ppa.pages.models.HomePage(*args, **kwargs)[source]

wagtail.core.models.Page model for PPA home page

exception DoesNotExist
exception MultipleObjectsReturned
class ppa.pages.models.ImageWithCaption(local_blocks=None, **kwargs)[source]

StructBlock for an image with a formatted caption, so caption can be context-specific. Also allows images to be floated right, left, or take up the width of the page.

class ppa.pages.models.LinkableSectionBlock(local_blocks=None, **kwargs)[source]

StructBlock for a rich text block and an associated title that will render as an <h2>. Creates an anchor (<a>) so that the section can be directly linked to using a url fragment.

clean(value)[source]

Validate value and return a cleaned version of it, or throw a ValidationError if validation fails. The thrown ValidationError instance will subsequently be passed to render() to display the error message; the ValidationError must therefore include all detail necessary to perform that rendering, such as identifying the specific child block(s) with errors, in the case of nested blocks. (It is suggested that you use the ‘params’ attribute for this; using error_list / error_dict is unreliable because Django tends to hack around with these when nested.)

class ppa.pages.models.PagePreviewDescriptionMixin(*args, **kwargs)[source]

Page mixin with logic for page preview content. Adds an optional richtext description field, and methods to get description and plain-text description, for use in previews on the site and plain-text metadata previews.

allowed_tags = ['acronym', 'b', 'p', 'strong', 'abbr', 'code', 'li', 'blockquote', 'em', 'i', 'ol', 'ul']

allowed tags for bleach html stripping in description

get_description()[source]

Get formatted description for preview. Uses description field if there is content, otherwise uses the beginning of the body content.

get_plaintext_description()[source]

Get plain-text description for use in metadata. Uses search_description field if set; otherwise uses the result of get_description() with tags stripped.

max_length = 250

maximum length for description to be displayed

class ppa.pages.models.Person(*args, **kwargs)[source]

Common model for a person, currently used to document authorship for instances of ppa.editorial.models.EditorialPage.

exception DoesNotExist
exception MultipleObjectsReturned
description

description (affiliation, etc.)

get_usage()

Returns a queryset of pages that link to a particular object

name

the display name of an individual

photo

Optional profile image to be associated with a person

project_role

project role

url

identifying URI for a person (VIAF, ORCID iD, personal website, etc.)

class ppa.pages.models.SVGImageBlock(local_blocks=None, **kwargs)[source]

StructBlock for an SVG image with alternative text and optional formatted caption. Separate from CaptionedImageBlock because Wagtail image handling does not work with SVG.

Manage Commands

Setup Site Pages

setup_site_pages is a custom manage command to install a default set of pages and menus for the Wagtail CMS. It is designed not to touch other content.

Example usage:

python manage.py setup_site_pages

Unapi

Django app for unAPI service

Views

class ppa.unapi.views.UnAPIView(**kwargs)[source]

Simple unAPI service endpoint. With no parameters or only id, provides a list of available metadata formats. If id and format are specified, returns the metadata for the specified item in the requested format.

See archived unAPI website for more details. https://web.archive.org/web/20140331070802/http://unapi.info/specs/

content_type = 'application/xml'

default content type, when serving format information

file_extension = {'marc': 'mrc'}

file extension for metadata formats, as a convenience to set download filename extension

formats = {'marc': {'type': 'application/marc'}}

available metadata formats

get(*args, **kwargs)[source]

Override get to check if id and format are specified; if they are, return the requested metadata. Otherwise, falls back to normal template view behavior and displays format information.

get_context_data(*args, **kwargs)[source]

pass formats and id to template context

get_metadata(item_id, data_format)[source]

get item and requested metadata

template_name = 'unapi/formats.xml'

template for format information