Documents and Fragments¶
The geniza.corpus
application is the heart of this project.
The most important models are Document
and
Fragment
, with a number of supporting models
to track the source of the fragment, document type, languages and scripts used in a document, etc.
models¶
- class geniza.corpus.models.Collection(*args, **kwargs)[source]¶
Collection or library that holds Geniza fragments
- exception DoesNotExist¶
- exception MultipleObjectsReturned¶
- property full_name¶
attempt to combine library and collection name into a human readable format
- class geniza.corpus.models.CollectionManager(*args, **kwargs)[source]¶
Custom manager for
Collection
with natural key lookup
- class geniza.corpus.models.Dating(*args, **kwargs)[source]¶
An inferred date for a document.
- exception DoesNotExist¶
- exception MultipleObjectsReturned¶
- property standard_date_display¶
Standard date in human-readable format for document details pages
- class geniza.corpus.models.Document(*args, **kwargs)[source]¶
A unified document such as a letter or legal document that appears on one or more fragments.
- exception DoesNotExist¶
- exception MultipleObjectsReturned¶
- admin_thumbnails()[source]¶
generate html for thumbnails of all iiif images, for image reordering UI in admin
- all_secondary_languages()[source]¶
comma delimited string of all secondary languages for this document
- attribution()[source]¶
Generate a tuple of three attribution components for use in IIIF manifests or wherever images/transcriptions need attribution.
- property available_digital_content¶
Helper method for the ITT viewer to collect all available panels into a list
- clean_fields(exclude=None)¶
Clean all fields and raise a ValidationError containing a dict of all validation errors if any occur.
- property collection¶
collection (abbreviation) for associated fragments
- property collections¶
collection objects for associated fragments
- dating_range()[source]¶
Return the start and end of the document’s possible date range, as PartialDate objects, including standardized document dates and inferred Datings, if any exist.
- property default_translation¶
The first translation footnote that is in the current language, or the first translation footnote ordered alphabetically by source if one is not available in the current language.
- digital_editions()[source]¶
All footnotes for this document where the document relation includes digital edition.
- digital_footnotes()[source]¶
All footnotes for this document where the document relation includes digital edition or digital translation.
- digital_translations()[source]¶
All footnotes for this document where the document relation includes digital translation.
- property formatted_citation¶
a formatted citation for display at the bottom of Document detail pages
- property fragment_historical_shelfmarks¶
Property to display set of all historical shelfmarks on the document
- fragments_other_docs()[source]¶
List of other documents that are on the same fragment(s) as this document (does not include suppressed documents). Returns a list of
Document
objects.
- classmethod from_manifest_uri(uri)[source]¶
Given a manifest URI (as used in transcription annotations), find a Document matching its pgpid
- get_deferred_fields()¶
Return a set containing names of deferred fields for this instance.
- has_digital_content()[source]¶
Helper method for the ITT viewer on the public front-end to determine whether a document has any images, digital editions, or digital translations.
- has_translation()[source]¶
Helper method to determine if document has a translation.
- Returns:
Whether document has translation
- Return type:
- iiif_images(filter_side=False, with_placeholders=False)[source]¶
Dict of IIIF images and labels for images of the Document’s Fragments, keyed on canvas.
- Parameters:
filter_side – if TextBlocks have side info, filter images by side (default: False)
with_placeholders – if there are digital editions with canvases missing images, include placeholder images for each additional canvas (default: False)
- classmethod items_to_index()[source]¶
Custom logic for finding items to be indexed when indexing in bulk.
- list_thumbnail()[source]¶
generate html for thumbnail of first image, for display in related documents lists
- merge_with(merge_docs, rationale, user=None)[source]¶
Merge the specified documents into this one. Combines all metadata into this document, adds the merged documents into list of old PGP IDs, and creates a log entry documenting the merge, including the rationale.
- classmethod prep_index_chunk(chunk)[source]¶
Prefetch related information when indexing in chunks (modifies queryset chunk in place)
- property primary_lang_code¶
Primary language code for this document, when there is only one primary language set and it has an ISO code available. Returns None if unset or unavailable.
- property primary_script¶
Primary script for this document, if shared across all primary languages.
- refresh_from_db(using=None, fields=None)¶
Reload field values from the database.
By default, the reloading happens from the database this instance was loaded from, or by the read router if this instance wasn’t loaded from any database. The using parameter will override the default.
Fields can be used to specify which fields to reload. The fields should be an iterable of field attnames. If fields is None, then all non-deferred fields are reloaded.
When accessing deferred fields of an instance, the deferred loading of the field will call this method.
List of other documents with any of the same shelfmarks as this document; does not include suppressed documents. Queries Solr and returns a list of
dict
objects.
- save(*args, **kwargs)[source]¶
Save the current instance. Override this in a subclass if you want to control the saving process.
The ‘force_insert’ and ‘force_update’ parameters can be used to insist that the “save” must be an SQL insert or update (or equivalent for non-SQL backends), respectively. Normally, they should not be set.
- property shelfmark¶
shelfmarks for associated fragments
- property shelfmark_display¶
Label for this document; by default, based on the combined shelfmarks from all certain associated fragments; uses
shelfmark_override
if set
- solr_dating_range()[source]¶
Return the document’s dating range, including inferred, as a Solr date range.
- status¶
status of record; currently choices are public or suppressed
- property title¶
Short title for identifying the document, e.g. via search.
- class geniza.corpus.models.DocumentEventRelation(*args, **kwargs)[source]¶
A relationship between a document and an event
- exception DoesNotExist¶
- exception MultipleObjectsReturned¶
- class geniza.corpus.models.DocumentSignalHandlers[source]¶
Signal handlers for indexing
Document
records when related records are saved or deleted.reindex all associated documents when related data is changed
reindex associated documents when a related object is deleted
reindex associated documents when a related object is saved
- class geniza.corpus.models.DocumentType(*args, **kwargs)[source]¶
Controlled vocabulary of document types.
- exception DoesNotExist¶
- exception MultipleObjectsReturned¶
- clean_fields(exclude=None)¶
Clean all fields and raise a ValidationError containing a dict of all validation errors if any occur.
- get_deferred_fields()¶
Return a set containing names of deferred fields for this instance.
- class property objects_by_label¶
A dict of object instances keyed on English display label, used for search form and search results, which should be based on Solr facet and query responses (indexed in English).
- refresh_from_db(using=None, fields=None)¶
Reload field values from the database.
By default, the reloading happens from the database this instance was loaded from, or by the read router if this instance wasn’t loaded from any database. The using parameter will override the default.
Fields can be used to specify which fields to reload. The fields should be an iterable of field attnames. If fields is None, then all non-deferred fields are reloaded.
When accessing deferred fields of an instance, the deferred loading of the field will call this method.
- class geniza.corpus.models.DocumentTypeManager(*args, **kwargs)[source]¶
Custom manager for
DocumentType
with natural key lookup
- class geniza.corpus.models.Fragment(*args, **kwargs)[source]¶
A single fragment or multifragment held by a particular library or archive.
- exception DoesNotExist¶
- exception MultipleObjectsReturned¶
- static admin_thumbnails(images, labels, canvases=[], selected=[])[source]¶
Convenience method for generating IIIF thumbnails HTML from lists of images and labels; separated for reuse in Document
- property attribution¶
Generate an attribution for this fragment
- clean()[source]¶
Custom validation and cleaning; currently only
clean_iiif_url()
- iiif_images()[source]¶
IIIF image URLs for this fragment. Returns a list of
IIIFImageClient
and corresponding list of labels, or None if this fragement has no IIIF url associated.
- property iiif_provenance¶
Generate a provenance statement for this fragment from IIIF
- class geniza.corpus.models.FragmentManager(*args, **kwargs)[source]¶
Custom manager for
Fragment
with natural key lookup
- class geniza.corpus.models.LanguageScript(*args, **kwargs)[source]¶
Combination language and script
- exception DoesNotExist¶
- exception MultipleObjectsReturned¶
- class geniza.corpus.models.LanguageScriptManager(*args, **kwargs)[source]¶
Custom manager for
LanguageScript
with natural key lookup
- class geniza.corpus.models.PermalinkMixin[source]¶
Mixin to generate a permalink for Django model objects by removing language code from the object’s absolute URL.
- class geniza.corpus.models.TagSignalHandlers[source]¶
Signal handlers for
taggit.Tag
records.
dates¶
- class geniza.corpus.dates.Calendar[source]¶
Codes for supported calendars
- ANNO_MUNDI = 'am'¶
Anno Mundi calendar (Hebrew)
- HIJRI = 'h'¶
Hijri calendar (Islamic)
- KHARAJI = 'k'¶
Kharaji calendar
- SELEUCID = 's'¶
Seleucid calendar
- SELEUCID_OFFSET = 3449¶
offset for Seleucid calendar: Anno Mundi - 3449
- can_convert = ['am', 'h', 's']¶
calendars that can be converted to Julian/Gregorian
- class geniza.corpus.dates.DocumentDateMixin(*args, **kwargs)[source]¶
Mixin for document date fields (original and standardized), and related logic for displaying, converting,a nd validating dates.
- clean()[source]¶
Require doc_date_original and doc_date_calendar to be set if either one is present.
- property document_date¶
Generate formatted display of combined original and standardized dates
- property end_date¶
Return the end date of the document’s standardized date or date range, if set.
- property original_date¶
Generate formatted display for the document’s original/historical date
- property parsed_date¶
Parse standard date (if set) and return as dictionary of start/end
PartialDate
objects
- standardize_date(update=False)[source]¶
Convert the document’s original date to a standardized date, if possible. If update is requested, will store the converted value on
doc_date_standard
- property start_date¶
Return the start date of the document’s standardized date or date range, if set.
- class geniza.corpus.dates.PartialDate(str)[source]¶
Simple partial date object to handle parsing and display of dates in the format YYYY, YYYY-MM, or YYYY-MM-DD. Display format is based on known precision of year, month, or day.
- display_format = {'day': 'DATE_FORMAT', 'month': 'F Y', 'year': 'Y'}¶
public display format based on date precision
- static get_date_range(old_range, new_range)[source]¶
Compute the union (widest possible date range) between two PartialDate ranges.
- iso_format = {'day': '%Y-%m-%d', 'month': '%Y-%m', 'year': '%Y'}¶
ISO format based on date precision
- isoformat(mode='min', fmt='precision')[source]¶
Display partial date in ISO format. By default, will display YYYY, YYYY-MM, or YYYY-MM-DD according to known precision. If min or max is requested, will display YYYY-MM-DD for earliest or latest date based on known precision.
- Parameters:
mode – how to fill in unknowns: min, or max (default: min)
fmt – format: precision (default), isoformat, or numeric
- num_fmt = '%Y%m%d'¶
numeric format for indexing and sorting
- numeric_format(mode='min')[source]¶
“Date in numeric format for sorting; max or min for unknowns. See
isoformat()
for more details.
- geniza.corpus.dates.calendar_converter = {'am': <module 'convertdate.hebrew' from '/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/convertdate/hebrew.py'>, 'h': <module 'convertdate.islamic' from '/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/convertdate/islamic.py'>, 's': <module 'convertdate.hebrew' from '/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/convertdate/hebrew.py'>}¶
mapping between supported calendars and corresponding convertdate module
- geniza.corpus.dates.convert_hebrew_date(historic_date)[source]¶
Convert a date in the Hebrew Anno Mundi calendar to the Julian or Gregorian calendar
- geniza.corpus.dates.convert_islamic_date(historic_date)[source]¶
Convert a date in the Islamic Hijri calendar to the Julian or Gregorian calendar
- geniza.corpus.dates.convert_seleucid_date(historic_date)[source]¶
Convert a date in the Greek Seleucid calendar to the Julian or Gregorian calendar
- geniza.corpus.dates.display_date_range(earliest, latest)[source]¶
display a date range or single date in a isoformat
- geniza.corpus.dates.get_calendar_date(converter, year, month=None, day=None, mode=None)[source]¶
Convert a date from a supported calendar and return as a
datetime.date
or tuple of dates for a date range, when the conversion is ambiguous. Takes year and optional month and day.
- geniza.corpus.dates.get_calendar_month(convertdate_module, month)[source]¶
“Convert month name to month number for the specified calendar.
- Parameters:
convertdate_module – convertdate calendar module to use
month – string month name
- Return int:
month number
- geniza.corpus.dates.get_hebrew_month(month_name)[source]¶
Convert Hebrew month name to month number. Supports local month name aliases for alternate spellings.
- geniza.corpus.dates.get_islamic_month(month_name)[source]¶
Convert Islamic month name to month number; works with or without accents, and supports local month-name overrides.
- geniza.corpus.dates.re_original_date = re.compile('(?:(?P<weekday>\\w+day),? )?(?:(?P<day>\\d+) )?(?:(?P<month>[^\\d]+( I{1,2})?) )?(?P<year>\\d{3,4})')¶
regular expression for extracting information from original date string
metadata export¶
- class geniza.corpus.metadata_export.AdminFragmentExporter(queryset=None, progress=False)[source]¶
Admin fragment export variant; adds notes, review, and admin url fields.
- get_export_data_dict(fragment)[source]¶
A given Exporter class (DocumentExporter, FootnoteExporter, etc) must implement this function. It ought to return a dictionary of exported information for a given object.
- Parameters:
obj (object) – Model object (document, footnote, etc)
- Raises:
NotImplementedError – This method must be implemented by subclasses
- class geniza.corpus.metadata_export.DocumentExporter(queryset=None, progress=False)[source]¶
A subclass of
geniza.common.metadata_export.Exporter
that exports information relating toDocument
. Extendsget_queryset()
andget_export_data_dict()
.
- class geniza.corpus.metadata_export.FragmentExporter(queryset=None, progress=False)[source]¶
A subclass of
geniza.common.metadata_export.Exporter
that exports information relating toFragment
.- get_export_data_dict(fragment)[source]¶
A given Exporter class (DocumentExporter, FootnoteExporter, etc) must implement this function. It ought to return a dictionary of exported information for a given object.
- Parameters:
obj (object) – Model object (document, footnote, etc)
- Raises:
NotImplementedError – This method must be implemented by subclasses
- class geniza.corpus.metadata_export.PublicDocumentExporter(queryset=None, progress=False)[source]¶
Public version of the document exporter. It can e.g. modify the get_queryset to ensure it deals with public documents.
views¶
- class geniza.corpus.views.DocumentAnnotationListView(**kwargs)[source]¶
Generate a IIIF Annotation List for a document to make transcription content available for inclusion in local IIIF manifest.
- get(request, *args, **kwargs)[source]¶
handle GET request: construct and return JSON annotation list
- viewname = 'corpus-uris:document-annotations'¶
bound name of this view, for use in generating absolute url for redirect
- class geniza.corpus.views.DocumentDetailBase(**kwargs)[source]¶
View mixin to handle lastmodified and redirects for documents with old PGPIDs. Overrides get request in the case of a 404, looking for any records with passed PGPID in old_pgpids, and if found, redirects to that document with current PGPID.
- class geniza.corpus.views.DocumentDetailView(**kwargs)[source]¶
public display of a single
Document
- viewname = 'corpus:document'¶
bound name of this view, for use in generating absolute url for redirect
- class geniza.corpus.views.DocumentManifestView(**kwargs)[source]¶
Generate a IIIF Presentation manifest for a document, incorporating available canvases and attaching transcription content via annotation.
- viewname = 'corpus-uris:document-manifest'¶
bound name of this view, for use in generating absolute url for redirect
- class geniza.corpus.views.DocumentScholarshipView(**kwargs)[source]¶
List of
Footnote
references for a singleDocument
- get_queryset(*args, **kwargs)[source]¶
Prefetch footnotes, and don’t show the page if there are none.
- viewname = 'corpus:document-scholarship'¶
bound name of this view, for use in generating absolute url for redirect
- class geniza.corpus.views.DocumentSearchView(**kwargs)[source]¶
- dispatch(request, *args, **kwargs)[source]¶
Wrap the dispatch method to add a last modified header if one is available, then return a conditional response.
- form_class¶
alias of
DocumentSearchForm
- get_applied_filter_labels(form, field, filters)[source]¶
return a list of objects with field/value pairs, and translated labels, one for each applied filter
- get_boolfield_label(form, fieldname)[source]¶
Return a label dict for a boolean field (works differently than other fields)
- get_context_data(**kwargs)[source]¶
extend context data to add page metadata, highlighting, and update form with facets
- get_paginate_by(queryset)[source]¶
Try to get pagination from GET request query, if there is none fallback to the original.
- get_solr_sort(sort_option, exclude_inferred=False)[source]¶
Return solr sort field for user-seleted sort option; generates random sort field using solr random dynamic field; otherwise uses solr sort field from
solr_sort
- last_modified()[source]¶
override last modified from solr mixin to not return a value when sorting by random
- solr_lastmodified_filters = {'item_type_s': 'document'}¶
solr query filter for getting last modified date
- class geniza.corpus.views.DocumentTranscribeView(**kwargs)[source]¶
View for the Transcription/Translation Editor page that uses annotorious-tahqiq
- get_context_data(**kwargs)[source]¶
Pass annotation configuration and TinyMCE API key to page context
- viewname = 'corpus:document-transcribe'¶
bound name of this view, for use in generating absolute url for redirect
- class geniza.corpus.views.DocumentTranscriptionText(**kwargs)[source]¶
Return transcription as plain text for download
- viewname = 'corpus:document-transcription-text'¶
bound name of this view, for use in generating absolute url for redirect
- class geniza.corpus.views.RelatedDocumentView(**kwargs)[source]¶
List of
Document
objects that are related to specificDocument
(e.g., by occuring on the same shelfmark).- viewname = 'corpus:related-documents'¶
bound name of this view, for use in generating absolute url for redirect
- class geniza.corpus.views.SolrDateRangeMixin[source]¶
Mixin for solr-based views with start and end date fields to get the full range of dates across the solr queryset.
- class geniza.corpus.views.TagMerge(**kwargs)[source]¶
Class-based view for merging tags, closely adapted from DocumentMerge.
- form_class¶
alias of
TagMergeForm
- geniza.corpus.views.old_pgp_edition(editions)[source]¶
output footnote and source information in a format similar to old pgp metadata editor/editions.
- geniza.corpus.views.old_pgp_tabulate_data(queryset)[source]¶
Takes a
Document
queryset and yields rows of data for serialization as csv inpgp_metadata_for_old_site()
manage commands¶
- class geniza.corpus.management.commands.add_fragment_urls.Command(*args, **options)[source]¶
Takes a CSV of shelfmarks and view URLs and/or IIIF URLs, update corresponding Fragment records in the database with those URLs. Expects CSV headers ‘shelfmark’ and one or both of ‘url’ and ‘iiif_url’
- add_fragment_urls(row)[source]¶
add view and iiif urls to fragment and save if a match is found for the shelfmark
- handle(*args, **options)[source]¶
The actual logic of the command. Subclasses must implement this method.
Importing IIIF manifests to be cached in the database.
- class geniza.corpus.management.commands.import_manifests.Command(stdout=None, stderr=None, no_color=False, force_color=False)[source]¶
Import IIIF manifests into the local database.
Script to consolidate redundant or duplicate document records. The script has two modes:
Report mode looks for merge candidates based on duplicate shelfmark combinations, document type, and descriptions. To generate a report of potential merges and actions to be taken:
python manage.py. merge_documents report
Merge mode takes a CSV file in the same format generated by the report and merges documents as specified. There should be one row for each document that is part of any group of documents to be merged. Required fields are:
group id: unique identifier for each set of documents to be merged
action: must be MERGE to merge documents; if not, rows will be ignored
pgpid: document PGPID
role: “primary” for the main document in each group
Example use:
python manage.py. merge_documents merge join-documents.csv
- class geniza.corpus.management.commands.merge_documents.Command(stdout=None, stderr=None, no_color=False, force_color=False)[source]¶
Merge documents that are variations of the same joins, based on shelfmark, document type, and description
- get_merge_candidates()[source]¶
identify merge candidates from the database. Looks for documents associated with multiple fragments, and then groups documents by combination of sorted shelfmarks and document type. Returns a dictionary of candidates. Key is sorted shelfmark + type, value is list of documents in that group.
- group_merge_candidates(joins)[source]¶
process candidates identified in
get_merge_candidates()
to determine which ones should be merged
- handle(*args, **options)[source]¶
The actual logic of the command. Subclasses must implement this method.
- merge_group(group_id, group)[source]¶
Takes a group identifier and a list of dicts for the group; there should be one primary record (role = primary) and one or more non-primary records. Each entry should have a pgpid; the status from the primary record will be used as merge rationale. Will find and merge the documents if possible.
- class geniza.corpus.management.commands.convert_dates.Command(stdout=None, stderr=None, no_color=False, force_color=False)[source]¶
Report on or update historical date conversions for current data
- clean_standard_dates()[source]¶
Find documents with standardized dates that are set but don’t match the validation pattern and correct the ones that can be fixed.
- handle(*args, **options)[source]¶
The actual logic of the command. Subclasses must implement this method.
- class geniza.corpus.management.commands.generate_fixtures.Command(stdout=None, stderr=None, no_color=False, force_color=False)[source]¶
- class geniza.corpus.management.commands.export_metadata.Command(stdout=None, stderr=None, no_color=False, force_color=False)[source]¶
- class geniza.corpus.management.commands.export_metadata.MetadataExportRepo(local_path=None, remote_url=None, print_func=None, progress=True)[source]¶
Utility class with functionality for generating metadata exports and commiting to git.
- default_commit_msg = 'Automated metadata export from PGP'¶
default commit message
- get_commit_message(modifying_users=None, msg=None)[source]¶
Construct a commit message. Uses the default commit with optional addendum specified by msg parameter, constructs a co-author commit if there are any modifying users, and combines with the base commit message.