Code Documentation¶
Archive¶
Django app for archival materials included in PPA.
- ppa.archive.NO_COLLECTION_LABEL = 'Uncategorized'¶
label to use for items that are not in a collection
Admin¶
- class ppa.archive.admin.ClusterAdmin(model, admin_site)[source]¶
- class ppa.archive.admin.DigitizedWorkAdmin(*args, **kwargs)[source]¶
- add_works_to_collection(request, queryset)[source]¶
Bulk add a queryset of
ppa.archive.DigitizedWork
to appa.archive.Collection
.
- get_readonly_fields(request, obj=None)[source]¶
Determine read only fields based on item source, to prevent editing of HathiTrust fields that should not be changed.
- list_collections(obj)[source]¶
Return a list of :class:ppa.archive.models.Collection object names as a comma separated list to populate a change_list column.
- resource_class¶
alias of
DigitizedWorkResource
- save_model(request, obj, form, change)[source]¶
Note any fields in the protected list that have been changed in the admin and preserve in database.
Ensure reindex is called when admin form is saved
- class ppa.archive.admin.DigitizedWorkInline(parent_model, admin_site)[source]¶
- model¶
alias of
DigitizedWork
Models¶
- class ppa.archive.models.Cluster(*args, **kwargs)[source]¶
A model to collect groups of works such as reprints or editions that should be collapsed in the main archive search and accessible together.
- exception DoesNotExist¶
- exception MultipleObjectsReturned¶
- class ppa.archive.models.Collection(*args, **kwargs)[source]¶
A collection of
ppa.archive.models.DigitizedWork
instances.- exception DoesNotExist¶
- exception MultipleObjectsReturned¶
- description¶
a RichText description of the collection
- exclude¶
flag to indicate collections to be excluded by default in public search
- name¶
the name of the collection
- property name_changed¶
check if name has been changed (only works on current instance)
- class ppa.archive.models.DigitizedWork(*args, **kwargs)[source]¶
Record to manage digitized works included in PPA and store their basic metadata.
- exception DoesNotExist¶
- exception MultipleObjectsReturned¶
- static add_from_hathi(htid, bib_api=None, update=False, log_msg_src=None, user=None)[source]¶
Add or update a HathiTrust work in the database. Retrieves bibliographic data from Hathi api, retrieves or creates a
DigitizedWork
record, and populates the metadata if this is a new record, if the Hathi metadata has changed, or if update is requested. Creates admin log entry to document record creation or update.Raises
ppa.archive.hathi.HathiItemNotFound
for invalid id.Returns the new or updated
DigitizedWork
.- Parameters:
htid – HathiTrust record identifier
bib_api – optional
HathiBibliographicAPI
instance, to allow for shared sessions in scriptsupdate – update bibliographic metadata even if the hathitrust record is not newer than the local database record (default: False)
log_msg_src – source of the change to be used included in log entry messages (optional). Will be used as “Created/updated [log_msg_src]”.
user – optional user responsible for the change, to be associated with
LogEntry
record
- added¶
date added to the archive
- book_journal¶
book or journal title for excerpt or article
- clean()[source]¶
Add custom validation to trigger a save error in the admin if someone tries to unsuppress a record that has been suppressed (not yet supported).
- clean_fields(exclude=None)[source]¶
Clean all fields and raise a ValidationError containing a dict of all validation errors if any occur.
- cluster¶
optional cluster for aggregating works
- collections¶
collections that this work is part of
- compare_protected_fields(db_obj)[source]¶
Compare protected fields in a
ppa.archive.models.DigitizedWork
instance and return those that are changed.- Parameters:
db_obj (object) – Database instance of a
DigitizedWork
.
- count_pages(ptree_client=None)[source]¶
Count the number of pages for a digitized work. If a pages are specified for an excerpt or article, page count is determined based on the number of pages in the combined ranges. Otherwise, page count is based on the number of files in the zipfile within the pairtree content (Hathi-specific). Raises
pairtree.storage_exceptions.ObjectNotFoundException
if the data is not found in the pairtree storage. Returns page count found; updates the page_count attribute on the current instance, but does NOT save the object.
- enumcron¶
enumeration/chronology (hathi-specific; contains volume or version)
- first_page()[source]¶
Number of the first page in range, if this is an excerpt (first of original page range, not digital)
- first_page_digital()[source]¶
Number of the first page in range (digital pages / page index), if this is an excerpt.
- Returns:
first page number for digital page range; None if no page range
- Return type:
int, None
- first_page_original()[source]¶
Number of the first page in range (original page numbering) if this is an excerpt
- Returns:
first page number for original page range; None if no page range
- Return type:
str, None
- get_absolute_url()[source]¶
Return object’s url for
ppa.archive.views.DigitizedWorkDetailView
- get_metadata(metadata_format)[source]¶
Get metadata for this item in the specified format. Currently only supports marc.
- property has_fulltext¶
Checks if an item has full text (i.e., items from HathiTrust, Gale, or EEBO-TCP).
- hathi¶
ppa.archive.hathi.HathiObject
for HathiTrust records, for working with data in HathiTrust pairtree data structure.
- index_chunk_size = 2000¶
number of items to index at once when indexing a large number of items
- property index_cluster_id¶
Convenience function to get a string representation of the cluster (or self if no cluster). Reduces redunadancy elsewhere.
- item_type¶
type of record, whether excerpt, article, or full; defaults to full
- classmethod items_to_index()[source]¶
Queryset of works for indexing everything; excludes suppressed works.
- metadata_from_marc(marc_record, populate=True)[source]¶
Get metadata from MARC record and return a dictionary of the data. When populate is True, calls populate_fields to set values.
- notes¶
internal team notes, not displayed on the public facing site
- page_count¶
number of pages in the work (or page range, for an excerpt)
- populate_fields(field_data)[source]¶
Conditionally update fields as protected by flags using Hathi bibdata information.
- Parameters:
field_data (dict) – A dictionary of fields updated from a
ppa.archive.hathi.HathiBibliographicRecord
instance.
- populate_from_bibdata(bibdata)[source]¶
Update record fields based on Hathi bibdata information. Full record is required in order to set all fields
- Parameters:
bibdata – bibliographic data returned from HathiTrust as instance of
ppa.archive.hathi.HathiBibliographicRecord
- classmethod prep_index_chunk(chunk)[source]¶
Optional method for any additional processing on chunks of items being indexed. Intended to allow adding prefetching on a chunk when iterating on Django QuerySets; since indexing uses Iterator, prefetching configured in items_to_index is ignored.
- printed_by_re = '^(Printed)?( and )?(Pub(.|lished|lisht)?)?( and sold)? (by|for|at)( the)? ?'¶
regular expresion for cleaning preliminary text from publisher names
- protected_fields¶
ProtectedWorkField
instance to indicate metadata fields that should be preserved from bulk updates because they have been modified in Django admin.
- pub_place¶
place of publication
- public_notes¶
public notes field for this work
- publisher¶
publisher
- record_id¶
record id; for Hathi materials, used for different copies of the same work or for different editions/volumes of a work
- save(*args, **kwargs)[source]¶
Save the current instance. Override this in a subclass if you want to control the saving process.
The ‘force_insert’ and ‘force_update’ parameters can be used to insist that the “save” must be an SQL insert or update (or equivalent for non-SQL backends), respectively. Normally, they should not be set.
- sort_title¶
sort title: title without leading non-sort characters, from marc
- source¶
source of the record, HathiTrust or elsewhere
- source_id¶
source identifier; hathi id for HathiTrust materials
- source_url¶
source url where the original can be accessed
- status¶
status of record; currently choices are public or suppressed
- subtitle¶
subtitle of the work; using TextField to allow for long titles
- title¶
title of the work; using TextField to allow for long titles
- updated¶
date of last modification of the local record
- class ppa.archive.models.DigitizedWorkQuerySet(model=None, query=None, using=None, hints=None)[source]¶
- ppa.archive.models.NO_COLLECTION_LABEL = 'Uncategorized'¶
label to use for items that are not in a collection
- class ppa.archive.models.Page[source]¶
Indexable for pages to make page data available for indexing with parasolr index manage command.
- index_chunk_size = 2000¶
number of items to index at once when indexing a large number of items
- classmethod items_to_index()[source]¶
Return a generator of page data to be indexed, with data for pages for each work returned by
Page.page_index_data()
- class ppa.archive.models.ProtectedWorkField(verbose_name=None, name=None, **kwargs)[source]¶
PositiveSmallIntegerField subclass that returns a
ProtectedWorkFieldFlags
object and stores as integer.- from_db_value(value, expression, connection)[source]¶
Always return an instance of
ProtectedWorkFieldFlags
- to_python(value)[source]¶
Always return an instance of
ProtectedWorkFieldFlags
- class ppa.archive.models.ProtectedWorkFieldFlags(*args, **kwargs)[source]¶
flags.Flags
instance to indicate whichDigitizedWork
fields should be protected if edited in the admin.- author = <ProtectedWorkFieldFlags.author bits=0x0010 data=UNDEFINED>¶
author
- classmethod deconstruct()[source]¶
Give Django information needed to make
ProtectedWorkFieldFlags.no_flags
default in migration.
- enumcron = <ProtectedWorkFieldFlags.enumcron bits=0x0008 data=UNDEFINED>¶
enumcron
- pub_date = <ProtectedWorkFieldFlags.pub_date bits=0x0080 data=UNDEFINED>¶
publication date
- pub_place = <ProtectedWorkFieldFlags.pub_place bits=0x0020 data=UNDEFINED>¶
place of publication
- publisher = <ProtectedWorkFieldFlags.publisher bits=0x0040 data=UNDEFINED>¶
publisher
- sort_title = <ProtectedWorkFieldFlags.sort_title bits=0x0004 data=UNDEFINED>¶
sort title
- subtitle = <ProtectedWorkFieldFlags.subtitle bits=0x0002 data=UNDEFINED>¶
subtitle
- title = <ProtectedWorkFieldFlags.title bits=0x0001 data=UNDEFINED>¶
title
- class ppa.archive.models.SignalHandlers[source]¶
Signal handlers for indexing
DigitizedWork
records whenCollection
orCluster
records are saved or deleted.- static cluster_delete(sender, instance, **kwargs)[source]¶
signal handler for cluster delete; clear associated digitized works and reindex
- static cluster_save(sender, instance, **kwargs)[source]¶
signal handler for cluster save; reindex pages for associated digitized works
- static collection_delete(sender, instance, **kwargs)[source]¶
signal handler for collection delete; clear associated digitized works and reindex
- static collection_save(sender, instance, **kwargs)[source]¶
signal handler for collection save; reindex associated digitized works
- static handle_digwork_cluster_change(sender, instance, **kwargs)[source]¶
when a
DigitizedWork
is saved, reindex pages if cluster id has changed
Views¶
- class ppa.archive.views.AddToCollection(**kwargs)[source]¶
View to bulk add a queryset of
ppa.archive.models.DigitizedWork
to a set ofppa.archive.models.Collection instances
.- form_class¶
alias of
AddToCollectionForm
- get_success_url()[source]¶
Redirect to the
ppa.archive.models.DigitizedWork
change_list in the Django admin with pagination and filters preserved. Expectsppa.archive.admin.add_works_to_collection()
to have set ‘collection-add-filters’ as a dict in the request’s session.
- model¶
alias of
DigitizedWork
- post(request, *args, **kwargs)[source]¶
Add
ppa.archive.models.DigitizedWork
instances passed in form data to selected instances ofppa.archive.models.Collection
, then return to change_list view.Expects a list of DigitizedWork ids to be set in the request session.
- class ppa.archive.views.DigitizedWorkByRecordId(**kwargs)[source]¶
Redirect from DigitizedWork record id to detail view when possible. If there is only one record found, redirect. If multiple are found, 404.
- class ppa.archive.views.DigitizedWorkDetailView(**kwargs)[source]¶
Display details for a single digitized work. If a work has been surpressed, returns a 410 Gone response.
- ajax_template_name = 'archive/snippets/results_within_list.html'¶
name of the template to use for ajax request
- form_class¶
alias of
SearchWithinWorkForm
- get(*args, **kwargs)[source]¶
Handle get request, with redirect logic if redirect url is set for a digitized work id converted to a single excerpt.
- get_queryset()[source]¶
Return the QuerySet that will be used to look up the object.
This method is called by the default implementation of get_object() and may not be called if get_object() is overridden.
- get_solr_lastmodified_filters()[source]¶
Get filters for last modified Solr query. By default returns
solr_lastmodified_filters
.
- get_template_names()[source]¶
Return
ajax_template_name
if this is an ajax request; otherwise return default template name.
- model¶
alias of
DigitizedWork
- class ppa.archive.views.DigitizedWorkListView(**kwargs)[source]¶
Search and browse digitized works. Based on Solr index of works and pages.
- ajax_template_name = 'archive/snippets/results_list.html'¶
name of the template to use for ajax request
- form_class¶
alias of
SearchForm
- get_pages(solrq)[source]¶
If there is a keyword search, query Solr for matching pages with text highlighting. NOTE: This has to be done as a separate query because Solr doesn’t support highlighting on collapsed items.
- get_queryset(**kwargs)[source]¶
Return the list of items for this view.
The return value must be an iterable and may be an instance of QuerySet in which case QuerySet specific behavior will be enabled.
- meta_description = 'The Princeton Prosody Archive is a full-text\n searchable database of thousands of historical documents about the\n study of language and the study of poetry.'¶
page description for metadata/preview
- meta_title = 'Princeton Prosody Archive'¶
title for metadata / preview
- model¶
alias of
DigitizedWork
- class ppa.archive.views.ImportView(**kwargs)[source]¶
Admin view to import new records from sources that support import (HathiTrust, Gale) by providing a list of ids.
- form_class¶
alias of
ImportForm
Forms¶
- class ppa.archive.forms.AddToCollectionForm(data=None, files=None, auto_id='id_%s', prefix=None, initial=None, error_class=<class 'django.forms.utils.ErrorList'>, label_suffix=None, empty_permitted=False, field_order=None, use_required_attribute=None, renderer=None)[source]¶
Form to select a set of :class:ppa.archive.models.Collection to which to bulk add a queryset of :class:ppa.archive.models.DigitizedWork
- property media¶
Return all media required to render the widgets on this form.
- class ppa.archive.forms.CheckboxSelectMultipleWithDisabled(attrs=None, choices=())[source]¶
Subclass of
django.forms.CheckboxSelectMultiple
with option to mark a choice as disabled.
- class ppa.archive.forms.ChoiceLabel(label, disabled=False)[source]¶
Custom choice label that can be used to set an option as disabled without resulting in extra choices when normalized.
- class ppa.archive.forms.FacetChoiceField(*args, **kwargs)[source]¶
Add CheckboxSelectMultiple field with facets taken from solr query
- widget¶
alias of
CheckboxSelectMultiple
- class ppa.archive.forms.ImportForm(data=None, files=None, auto_id='id_%s', prefix=None, initial=None, error_class=<class 'django.forms.utils.ErrorList'>, label_suffix=None, empty_permitted=False, field_order=None, use_required_attribute=None, renderer=None)[source]¶
Form to import records from sources that support import.
- get_source_ids()[source]¶
Get list of ids from valid form input. Splits on newlines, strips whitespace, and ignores empty lines.
- property media¶
Return all media required to render the widgets on this form.
- class ppa.archive.forms.ModelMultipleChoiceFieldWithEmpty(queryset, **kwargs)[source]¶
Extend
django.forms.ModelMultipleChoiceField
to add an option for an unset or empty choice (i.e. no relationship in a many-to-many relationship such as collection membership).
- class ppa.archive.forms.RadioSelectWithDisabled(attrs=None, choices=())[source]¶
Subclass of
django.forms.RadioSelect
with option to mark a choice as disabled.
- class ppa.archive.forms.RangeField(*args, **kwargs)[source]¶
- compress(data_list)[source]¶
Return a single value for the given list of values. The values can be assumed to be valid.
For example, if this MultiValueField was instantiated with fields=(DateField(), TimeField()), this might return a datetime object created by combining the date and time in data_list.
- widget¶
alias of
RangeWidget
- class ppa.archive.forms.RangeWidget(*args, **kwargs)[source]¶
date range widget, for two numeric inputs
- decompress(value)[source]¶
Return a list of decompressed values for the given compressed value. The given value can be assumed to be valid, but not necessarily non-empty.
- property media¶
Media for a multiwidget is the combination of all media of the subwidgets.
- sep = '-'¶
separator string when splitting out values in decompress
- class ppa.archive.forms.SearchForm(data=None, *args, **kwargs)[source]¶
Simple search form for digitized works.
- QUESTION_POPUP_TEXT = '\n Boolean search within a field is supported. Operators must be capitalized (AND, OR).\n Use quotes for exact phrase.\n '¶
help text to be shown with the form (appears when you hover over the question mark icon)
- clean_author()[source]¶
Clean keyword search query term; converts any typographic quotes to straight quotes
- clean_query()[source]¶
Clean keyword search query term; converts any typographic quotes to straight quotes
- clean_title()[source]¶
Clean keyword search query term; converts any typographic quotes to straight quotes
- cluster¶
hidden input to track cluster id, for searching within reprint/editions
- static defaults()[source]¶
Default values when initializing the form. Sort by title, pre-select collections based exclude property.
- get_solr_sort_field(sort=None)[source]¶
Set solr sort fields for the query based on sort and query strings. If sort field is not specified, will use sort in the the cleaned data in the current form. If sort is not specified and valid form data is not available, will raise an
AttributeError
.- Returns:
solr sort field
- property media¶
Return all media required to render the widgets on this form.
- pub_date_minmax()[source]¶
Get minimum and maximum values for
DigitizedWork
publication dates in the database. Used to set placeholder values for the form input and to generate the Solr facet range query. Value is cached to avoid repeatedly calculating it.- Returns:
tuple of min, max
- class ppa.archive.forms.SearchWithinWorkForm(data=None, files=None, auto_id='id_%s', prefix=None, initial=None, error_class=<class 'django.forms.utils.ErrorList'>, label_suffix=None, empty_permitted=False, field_order=None, use_required_attribute=None, renderer=None)[source]¶
Form to search for occurrences of a string within a particular instance of a digitized work.
- QUESTION_POPUP_TEXT = '\n Boolean search is supported. Operators must be capitalized (AND, OR).\n '¶
help text to be shown with the form (appears when you hover over the question mark icon)
- property media¶
Return all media required to render the widgets on this form.
- class ppa.archive.forms.SelectDisabledMixin[source]¶
Mixin for
django.forms.RadioSelect
ordjango.forms.CheckboxSelect
classes to set an option as disabled. To disable, the widget’s choice label option should be passed in as a dictionary with disabled set to True:{'label': 'option', 'disabled': True}.
- class ppa.archive.forms.SelectWithDisabled(attrs=None, choices=())[source]¶
Subclass of
django.forms.Select
with option to mark a choice as disabled.
Solr¶
- class ppa.archive.solr.ArchiveSearchQuerySet(solr=None)[source]¶
- class ppa.archive.solr.PageSearchQuerySet(solr: SolrClient | None = None)[source]¶
- field_aliases = {'cluster_id': 'cluster_id_s', 'group_id': 'group_id_s', 'id': 'id', 'image_url': 'image_url_s', 'label': 'label', 'order': 'order', 'score': 'score', 'source_id': 'source_id', 'title': 'title'}¶
map of application-specific, readable field names to actual solr fields (i.e. if using dynamic field types)
Hathi¶
Utilities for working with HathiTrust materials and APIs.
- class ppa.archive.hathi.HathiBaseAPI[source]¶
Base client class for HathiTrust APIs
- api_root = ''¶
base api URL for all requests
- class ppa.archive.hathi.HathiBibliographicAPI[source]¶
Wrapper for HathiTrust Bibliographic API.
https://www.hathitrust.org/bib_api
- api_root = 'http://catalog.hathitrust.org/api'¶
base api URL for all requests
- class ppa.archive.hathi.HathiBibliographicRecord(data)[source]¶
Representation of a HathiTrust bibliographic record.
- copy_last_updated(htid)[source]¶
Return last update date for a specificy copy identified by hathi id. Returns as
datetime.date
- marcxml¶
Record marcxml if included (full records only), as an instance of
pymarc.Record
- property pub_dates¶
list of available publication dates
- property title¶
First title (standard title)
- exception ppa.archive.hathi.HathiItemForbidden[source]¶
Permission denied to access item in data API
- class ppa.archive.hathi.HathiObject(hathi_id)[source]¶
An object for working with a HathiTrust item with data in a locally configured pairtree datastore.
- mets_xml() MinimalMETS [source]¶
load METS xml file from pairtree and initialize as an instance of
MinimalMETS
- Return type:
- Raises:
storage_exceptions.ObjectNotFoundException
if the object is not found in pairtree storage- Raises:
storage_exceptions.PartNotFoundException
if the mets.xml flie is not found in pairtree storage for this object
- metsfile_path(ptree_client=None)[source]¶
path to mets xml file within the hathi contents for this work
- page_data()[source]¶
Return a generator of page content for this HathiTrust work based on pairtree and METS data, for indexing pages in Solr.
- pairtree_client()[source]¶
Initialize a pairtree client for the pairtree datastore this object belongs to, based on its HathiTrust record id.
- class ppa.archive.hathi.METSFile(node=None, context=None, **kwargs)[source]¶
File location information within a METS document.
- id = <eulxml.xmlmap.fields.StringField>¶
xml identifier
- location = <eulxml.xmlmap.fields.StringField>¶
<METS:file SIZE=”1” ID=”TXT00000001” MIMETYPE=”text/plain” CREATED=”2016-06-24T09:04:15Z” CHECKSUM=”68b329da9893e34099c7d8ad5cb9c940” SEQ=”00000001” CHECKSUMTYPE=”MD5”>
- sequence = <eulxml.xmlmap.fields.StringField>¶
sequence attribute
- class ppa.archive.hathi.MinimalMETS(node=None, context=None, **kwargs)[source]¶
Minimal
XmlObject
for METS that maps only what is needed to support page indexing forppa
.- structmap_pages = <eulxml.xmlmap.fields.NodeListField>¶
list of struct map pages as
StructMapPage
- class ppa.archive.hathi.StructMapPage(node=None, context=None, **kwargs)[source]¶
Single logical page within a METS StructMap
- display_label¶
page display labeel; use order label if present; otherwise use order
- label = <eulxml.xmlmap.fields.StringField>¶
page label
- order = <eulxml.xmlmap.fields.IntegerField>¶
page order
- orderlabel = <eulxml.xmlmap.fields.StringField>¶
order label
- text_file¶
METSFiile
corresponding to the text file pointer for this page
- text_file_id = <eulxml.xmlmap.fields.StringField>¶
- <METS:div ORDER=”1” LABEL=”FRONT_COVER, IMAGE_ON_PAGE, IMPLICIT_PAGE_NUMBER” TYPE=”page”>
<METS:fptr FILEID=”HTML00000001”/> <METS:fptr FILEID=”TXT00000001”/> <METS:fptr FILEID=”IMG00000001”/>
- <METS:file SIZE=”1003” ID=”HTML00000496” MIMETYPE=”text/html” CREATED=”2017-03-20T10:40:21Z”
CHECKSUM=”f0a326c10b2a6dc9ae5e3ede261c9897” SEQ=”00000496” CHECKSUMTYPE=”MD5”>
- text_file_location¶
location for the text file
Gale¶
- class ppa.archive.gale.GaleAPI[source]¶
Minimal Gale API client with functionality need for PPA import.
Requires GALE_API_USERNAME configured in Django settings. Automatically uses the configured username to retrieve an API key when needed, and has logic to refresh the API key when it expires (30 minutes).
If TECHNICAL_CONTACT is configured in Django settings, it will be included in request headers when making API calls.
Implemented as a singleton; instanciating the class will return the same shared instance every time.
- property api_key¶
Property for current api key. Uses
get_api_key()
to request a new one when needed.
- api_root = 'https://api.gale.com/api'¶
base URL for all API requests
- get_item_pages(item_id, gale_record=None)[source]¶
Return a generator of page content for the specified digitized work from the Gale API. Takes an optional gale_record parameter (item record as returned by Gale API), to avoid making an extra API call if data is already available.
- instance = None¶
shared singleton instance; populated on first instantiation
- exception ppa.archive.gale.MARCRecordNotFound[source]¶
record not found in local MARC record storage
- ppa.archive.gale.get_local_ocr(item_id)[source]¶
Load local OCR page text for the specified Gale volume, if available. This requires a base directory (specified by GALE_LOCAL_OCR) to be configured and assumes the following organization:
Volume-level directories are organized in stub directories that correspond to every third number (e.g., CW0128905397 –> 193). So, a Gale volume’s OCR data is located in the following directory: GALE_LOCAL_OCR / stub_dir / item_id.json
Page text is stored as a JSON dictionary with keys based on Gale page numbers, which is a 4-digit string (e.g., “0004”).
Raises a FileNotFoundError if the local OCR page text does not exist.
Util¶
Manage Commands¶
Hathi Import¶
hathi_import is a custom manage command for bulk import of HathiTrust materials into the local database for management. It does not index into Solr for search and browse; use the index script for that after import.
This script expects a local copy of dataset files in pairtree format retrieved by rsync. (Note that pairtree data must include pairtree version file to be valid.)
Contents are inspected from the configured HATHI_DATA path;
DigitizedWork
records are created or updated
based on identifiers found and metadata retrieved from the HathiTrust
Bibliographic API. Page content is only reflected in the database
via a total page count per work (but page contents will be indexed in
Solr via index script).
By default, existing records are updated only when the Hathi record has changed or if requested via –update` script option.
Supports importing specific items by hathi id, but the pairtree content for the items still must exist at the configured path.
Example usage:
# import everything with defaults
python manage.py hathi_import
# import specific items
python manage.py hathi_import htid1 htid2 htid3
# re-import and update records
python manage.py hathi_import --update
# display progressbar to show status and ETA
python manage.py hathi_import -v 0 --progress
- class ppa.archive.management.commands.hathi_import.Command(stdout=None, stderr=None, no_color=False, force_color=False)[source]¶
Import HathiTrust digitized items into PPA to be managed and searched
- count_hathi_ids()[source]¶
count items in the pairtree structure without loading all into memory at once.
- get_hathi_ids()[source]¶
Generator of hathi ids from previously rsynced hathitrust data, based on the configured HATHI_DATA path in settings.
- handle(*args, **kwargs)[source]¶
The actual logic of the command. Subclasses must implement this method.
- import_digitizedwork(htid)[source]¶
Import a single work into the database. Retrieves bibliographic data from Hathi api. If the record already exists in the database, it is only updated if the hathi record has changed or if an update is requested by the user. Creates admin log entry for record creation or record update. Returns None if there is an error retrieving bibliographic data or no update is needed; otherwise, returns the
DigitizedWork
.
- initialize_pairtrees()[source]¶
Initialize pairtree storage clients for each subdirectory in the configured HATHI_DATA path.
- v_normal = 1¶
normal verbosity level
Hathi Excerpt¶
hathi_excerpt is a custom manage command to convert existing HathiTrust items into excerpts or articles. It takes a CSV file with information about the items to excerpt. It does handle multiple excerpts for the same source id, as long as that source id is present in the database and data is available in the HathiTrust pairtree data.
- The CSV must include the following fields:
Item Type
Volume ID
Title
Sort Title
Book/Journal Title
Digital Page Range
Collection
Record ID
- If the CSV includes these optional fields, they will be used:
Author
Publication Date
Publication Place
Publisher
Enumcron
Original Page Range
Notes
Public Notes
Updated and added records are automatically indexed in Solr.
- class ppa.archive.management.commands.hathi_excerpt.Command(stdout=None, stderr=None, no_color=False, force_color=False)[source]¶
Convert existing HathiTrust full works into excerpts
- excerpt(row)[source]¶
Process a row of the spreadsheet, and either convert an existing full work to an excerpt or create a new excerpt.
- handle(*args, **kwargs)[source]¶
The actual logic of the command. Subclasses must implement this method.
- load_collections()[source]¶
load collections from the database and create a lookup based on collection names
- log_action(digwork, created=True)[source]¶
Create a log entry to document excerpting or creating the record. Message and action flag are determined by created boolean.
- v_normal = 1¶
normal verbosity level
Hathi rsync¶
- class ppa.archive.management.commands.hathi_rsync.Command(stdout=None, stderr=None, no_color=False, force_color=False)[source]¶
Update HathiTrust pairtree data via rsync
- handle(*args, **kwargs)[source]¶
The actual logic of the command. Subclasses must implement this method.
- v_normal = 1¶
normal verbosity level
Gale Import¶
gale_import is a custom manage command for bulk import of Gale materials into the local database for management. It takes either a list of Gale item ids or a path to a CSV file.
Items are imported into the database for management and also indexed into Solr as part of this import script (both works and pages).
Example usage:
# import from a csv file
python manage.py gale_import -c path/to/import.csv
# import specific items
python manage.py hathi_import galeid1 galeid2 galeid3
When using a CSV file for import, it must include an ID field; it may also include NOTES (any contents will be imported into private notes), and fields to indicate collection membership to be set on import. These are the supported collection abbreviations:
OB: Original Bibliography
LIT: Literary
MUS: Music
TYP: Typographically Unique
LING: Linguistic
DIC: Dictionaries
WL: Word Lists
- class ppa.archive.management.commands.gale_import.Command(stdout=None, stderr=None, no_color=False, force_color=False)[source]¶
Import Gale content into PPA for management and search
- handle(*args, **kwargs)[source]¶
The actual logic of the command. Subclasses must implement this method.
- import_record(gale_id, **kwargs)[source]¶
Import a single work into the database. Retrieves record data from Gale API.
- load_collections()[source]¶
Load
Collection
records from the database and create a lookup based on the codes used in the spreadsheet.
- v_normal = 1¶
normal verbosity level
Generate Corpus¶
generate_textcorpus is a custom manage command to generate a plain text corpus from Solr. It should be run after content has been indexed into Solr via the index manage command.
The full text corpus is generated from Solr; it does not include content for suppressed works or their pages (note that this depends on Solr content being current).
Examples:
- Expected use:
python manage.py generate_textcorpus
- Specify a path:
python manage.py generate_textcorpus –path ~/ppa_solr_corpus
- Dry run (do not create any files or folders):
python manage.py generate_textcorpus –dry-run
- Partial run (save only N rows, for testing):
python manage.py generate_textcorpus –doc-limit 100
- Cron-style run (no progress bar, but logs)
python manage.py generate_textcorpus –no-progress –verbosity 2
Notes:
Default path is ppa_corpus_{timestamp} in the current working directory
Default batch size is 10,000, meaning 10,000 records are pulled from solr at a time. Usage testing revealed that this default iterates over the collection the quickest.
- class ppa.archive.management.commands.generate_textcorpus.Command(stdout=None, stderr=None, no_color=False, force_color=False)[source]¶
Custom manage command to generate a text corpus from text indexed in Solr.
- handle(*args, **options)[source]¶
The actual logic of the command. Subclasses must implement this method.
- iter_pages()[source]¶
Yield results from
iter_solr()
with item_type=page
- iter_solr(item_type='page')[source]¶
Returns a generator Solr documents for the requested item_type (page or work).
- iter_works()[source]¶
Yield results from
iter_solr()
with item_type=work
- v_normal = 1¶
normal verbosity level
EEBO-TCP Import¶
eebo_import is a custom manage command for bulk import of EEBO-TCP materials into the local database for management. It takes a path to a CSV file and requires that the path to EEBO data is configured in Django settings.
Items are imported into the database for management and also indexed into Solr as part of this import script (both works and pages).
Example usage:
python manage.py eebo_import path/to/eebo_works.csv
- class ppa.archive.management.commands.eebo_import.Command(stdout=None, stderr=None, no_color=False, force_color=False)[source]¶
Import EEBO-TCP content into PPA for management and search
- handle(*args, **kwargs)[source]¶
The actual logic of the command. Subclasses must implement this method.
- v_normal = 1¶
normal verbosity level
Common¶
Django app for common functionality that doesn’t have an obvious home
Admin¶
Views¶
- class ppa.common.views.AjaxTemplateMixin(**kwargs)[source]¶
View mixin to use a different template when responding to an ajax request.
- ajax_template_name = None¶
name of the template to use for ajax request
- get_template_names()[source]¶
Return
ajax_template_name
if this is an ajax request; otherwise return default template name.
- vary_headers = ['X-Requested-With']¶
vary on X-Request-With to avoid browsers caching and displaying ajax response for the non-ajax response
Pages¶
Management and display of home page and other content pages. Includes custom page models for collections page, contributor page, and a person snippet for contributors and authors.
Models¶
- ppa.pages.models.ALT_TEXT_HELP = 'Alternative text for visually impaired users to\nbriefly communicate the intended message of the image in this context.'¶
help text for image alternative text
- class ppa.pages.models.BodyContentBlock(*args, **kwargs)[source]¶
Common set of content blocks to be used on both content pages and editorial pages
- class ppa.pages.models.CollectionPage(*args, **kwargs)[source]¶
Collection list page, with editable text content
- exception DoesNotExist¶
- exception MultipleObjectsReturned¶
- class ppa.pages.models.ContentPage(*args, **kwargs)[source]¶
Basic content page model.
- exception DoesNotExist¶
- exception MultipleObjectsReturned¶
- class ppa.pages.models.ContributorPage(*args, **kwargs)[source]¶
Project contributor and advisory board page.
- exception DoesNotExist¶
- exception MultipleObjectsReturned¶
- class ppa.pages.models.HomePage(*args, **kwargs)[source]¶
wagtail.models.Page
model for PPA home page- exception DoesNotExist¶
- exception MultipleObjectsReturned¶
- class ppa.pages.models.ImageWithCaption(*args, **kwargs)[source]¶
StructBlock
for an image with a formatted caption, so caption can be context-specific. Also allows images to be floated right, left, or take up the width of the page.
- class ppa.pages.models.LinkableSectionBlock(*args, **kwargs)[source]¶
StructBlock
for a rich text block and an associated title that will render as an <h2>. Creates an anchor (<a>) so that the section can be directly linked to using a url fragment.- clean(value)[source]¶
Validate value and return a cleaned version of it, or throw a ValidationError if validation fails. The thrown ValidationError instance will subsequently be passed to render() to display the error message; the ValidationError must therefore include all detail necessary to perform that rendering, such as identifying the specific child block(s) with errors, in the case of nested blocks. (It is suggested that you use the ‘params’ attribute for this; using error_list / error_dict is unreliable because Django tends to hack around with these when nested.)
- class ppa.pages.models.PagePreviewDescriptionMixin(*args, **kwargs)[source]¶
Page mixin with logic for page preview content. Adds an optional richtext description field, and methods to get description and plain-text description, for use in previews on the site and plain-text metadata previews.
- allowed_tags = ['p', 'li', 'strong', 'b', 'acronym', 'abbr', 'ul', 'ol', 'em', 'i', 'code', 'blockquote']¶
allowed tags for bleach html stripping in description
- get_description()[source]¶
Get formatted description for preview. Uses description field if there is content, otherwise uses the beginning of the body content.
- get_plaintext_description()[source]¶
Get plain-text description for use in metadata. Uses search_description field if set; otherwise uses the result of
get_description()
with tags stripped.
- max_length = 250¶
maximum length for description to be displayed
- class ppa.pages.models.Person(*args, **kwargs)[source]¶
Common model for a person, currently used to document authorship for instances of
ppa.editorial.models.EditorialPage
.- exception DoesNotExist¶
- exception MultipleObjectsReturned¶
- description(affiliation, etc.)¶
description (affiliation, etc.)
- name¶
the display name of an individual
- photo¶
Optional profile image to be associated with a person
- project_role¶
project role
- project_years¶
project years
- url¶
identifying URI for a person (VIAF, ORCID iD, personal website, etc.)
Embed Finders¶
Custom EmbedFinder
implementations
for embedding content in wagtail pages.
- class ppa.pages.embed_finders.GlitchEmbedFinder[source]¶
Custom oembed finder built to embed Glitch apps in wagtail pages.
To support embedding, the glitch app should include a file named embed.json, available directly under the top level url, with oembed content:
{ "title": "title", "author_name": "author", "provider_name": "Glitch", "type": "rich", "thumbnail_url": "URL to thumbnail image", "width": xx, "height": xx }
If the request for an embed.json file fails, no content will be embedded.
Any urls that cannot automatically be made relative by embed code (i.e. data files loaded by javascript code) should use absolute URLs, or they will not resolve when embedded.
Manage Commands¶
Setup Site Pages¶
setup_site_pages is a custom manage command to install a default set of pages and menus for the Wagtail CMS. It is designed not to touch other content.
Example usage:
python manage.py setup_site_pages
- class ppa.pages.management.commands.setup_site_pages.Command(stdout=None, stderr=None, no_color=False, force_color=False)[source]¶
Setup initial wagtail site and pages needed for PPA navigation
- create_wagtail_site(root_page)[source]¶
Create a wagtail site object from the current default Django site.
- handle(*args, **kwargs)[source]¶
The actual logic of the command. Subclasses must implement this method.
- v_normal = 1¶
normal verbosity level
Editorial¶
Management and display of editorial content. Includes custom page models for an editorial list page and editorial content pages, structured roughly like a scholarly blog.
Models¶
- class ppa.editorial.models.EditorialIndexPage(*args, **kwargs)[source]¶
Editorial index page; list recent editorial articles.
- exception DoesNotExist¶
- exception MultipleObjectsReturned¶
Unapi¶
Django app for unAPI service
Views¶
- class ppa.unapi.views.UnAPIView(**kwargs)[source]¶
Simple unAPI service endpoint. With no parameters or only id, provides a list of available metadata formats. If id and format are specified, returns the metadata for the specified item in the requested format.
See archived unAPI website for more details. https://web.archive.org/web/20140331070802/http://unapi.info/specs/
- content_type = 'application/xml'¶
default content type, when serving format information
- file_extension = {'marc': 'mrc'}¶
file extension for metadata formats, as a convenience to set download filename extension
- formats = {'marc': {'type': 'application/marc'}}¶
available metadata formats
- get(*args, **kwargs)[source]¶
Override get to check if id and format are specified; if they are, return the requested metadata. Otherwise, falls back to normal template view behavior and displays format information.
- template_name = 'unapi/formats.xml'¶
template for format information