Technical Design Document¶

This serves as a living document to record the current design decisions for this project. The expectation is not a comprehensive listing of issues or tasks, but rather a recording of the key structural and process elements, as well as primary deliverables, that will guide development.

The primary audience for this document is internal to the development team.

Note: This document was originally written for planning. It will be updated during development, but it may be out of date.

`remarx`¶

Project Title: Citing Marx: Die Neue Zeit, 1891-1918

RSE Lead: Rebecca Koeser

RSE Team Members: Laure Thompson, Hao Tan, Mary Naydan, Jeri Wieringa

Faculty PI: Edward Baring

Development Start Date: 2025-08-04

Target Release Date: Fall 2025

Statement of Purpose¶

Primary goals for the software

The primary goal of this software tool is to provide a simple user interface to enable researchers to upload files for two sets of German language content to be compared, and download a dataset of similar passages between the two sets of content.

The primary use case for this tool is to identify direct quotes of Karl Marx's Manifest der Kommunistischen Partei (Communist Manifesto) and the first volume of Das Kapital (Kapital) within a subset of Die Neue Zeit (DNZ) articles.

Task Identification and Definition¶

Task	Definition
Quotation Identification	Determine which passages of text quote a source text and identify the passages that are quoted

Out of Scope¶

Non-German quote detection
Cross-lingual quote detection is out of scope for this phase, but the tool will be built with an eye towards future expansion/support of other languages
Paraphrase / indirect quote detection
Attribution, citation, and named-reference detection

Requirements¶

Functional Requirements¶

What the software must do

Reads plaintext files corresponding to original (i.e., source of quotes) and reuse (i.e., where quotations occur) texts
Builds a quotation corpus that operates at a sentence-level granularity

Non-Functional Requirements¶

How the software must behave

Assumes input content is written in German (will be built with an eye towards supporting other languages in future, but not in this version)
Is optimized for the DNZ-Marx use case
Can be adapted to other monolingual and crosslingual settings, but with no quality guarantees
Compatible with any German original (e.g., other texts written by Marx) and reuse texts (e.g., earlier/later issues of DNZ, other journals)
Uses sentence embeddings as a core data representation
For sentence embedding construction, uses a preexisting model available through sentence-transformers or huggingface
Developed and tested on Python 3.12
If this software is run locally it should be compatible with macOS, linux, and windows based operating systems; and be compatible with both x86 and ARM architectures.

System Architecture¶

High-level description of the system and its components

System Architecture

The system is composed of two separate software programs: the sentence corpus builder and the core pipeline.

The sentence corpus builder is a program for creating the input CSV files for the core pipeline. It takes as input some number of texts and outputs a CSV file containing sentence-level text and metadata for these input texts. This program will have custom solutions for the primary use case of the tool (TEI-XML, ALTO-XML) as well as plaintext for general use.

The core pipeline has two key inputs: reuse and original sentence corpora. Each of these inputs are one or more CSV files where each row corresponds to a sentence from a reuse or original text (i.e., as produced by the sentence corpus builder). Ultimately, the system will output a CSV file containing the identified quotations which will be called the quoted corpus.

The core pipeline can be broken down into two key components:

Sentence-Level Quote Detection
Quote Compilation

Sentence-Level Quote Detection. This component takes an original and reuse sentence corpus as input and outputs a corpus of sentence-pairs corresponding to quotes.

Quote Compilation. In this step, the quote sentence pairs from the last step are refined into our final quotes corpus. Since quotes can be more than one sentence long, the primary goal of this component is to merge quote sentence-pairs that correspond to the same multi-sentence quote.

Minimum functionality: merge passages with sequential pairs of sentences from both corpora

Infrastructure and Deployment¶

Where the software will run; includes all different environments expected for development, staging/testing, and production, and any additional resources needed or expected (e.g., new VMs, OnDemand developer access, HPC, TigerData, etc).

Environments:

Development: RSE local machines as possible; della as needed
Staging: della
- To ensure software works on a common environment, we plan to use della as our staging environment for acceptance testing & quality assurance testing
Production: researcher machines (via installable python package, using uv and uv run if possible) or della via a shared Open OnDemand app (to be determined based on scale of data and compute requirements)

Resources:

TigerData project partition
Della access
OnDemand developer access

Component Details¶

Description of each component and its functionality

Sentence Corpus Builder¶

This component is a python program that extracts sentences from some number of input texts and compiles them into a single sentence-level corpus CSV file. It can be broken into two core stages: text extraction and sentence segmentation.

Text extraction will be customized to each of the supported input types:

MEGA digital TEI-XML files
Communist Manifesto TXT files as produced by a custom, one-time script converting HTML files to plaintext
Transcription ALTO-XML files as produced by the research team’s custom text transcription and segmentation pipeline
Plaintext files (TXT)

Each of these solutions will involve extracting the text from the corresponding file type that can be subsequently passed to sentence segmentation.

Sentence segmentation will rely on an off-the-shelf NLP package that supports sentence segmentation for German. We expect to use stanza for this purpose, but will have research team review segmentation quality to determine if we need to select a different tool. This step expects the input texts to be in German and may perform poorly for other languages.

This program will require unique input filenames to ensure sentences can be linked back to their original input file.

This component must also provide some linking mechanism for the passages of the output file to be linked back to their original document locations (e.g., starting character indices, line numbers). This component may also extract additional metadata such as article title or page number.

Sentence-Level Quote Detection¶

This component is a python program that identifies quote sentence-pairs for some input reuse and original sentence corpus. It will involve two core steps. First, the sentence embeddings will be constructed for the sentences in each corpus. These embeddings will be generated by a pretrained model available via SBERT or HuggingFace. Then, likely quotes will be identified using embedding similarity via an off-the-shelf approximate nearest neighbors library. Finally, the program will output a quote sentence-pair corpus CSV file.

This component expects the sentences to be in German, but other multilingual or monolingual methods could easily be swapped in. In development, we expect to save the sentence embeddings as a (pickled) numpy binary file. The final pipeline may save these embeddings as an additional intermediate file.

Quote Compilation¶

This component is a python program that will transform the quote sentence-pairs corpus into a more general quote corpus. Sentence-pairs will be merged into a single multi-sentence quote when sentence pairs are sequential or near-sequential. Two sentence pairs are sequential when both the original and reuse sentence of one pair directly preceded the original and reuse sentence respectively of the other pair.

Additionally, if the quote-sentence pair corpus does not include original and reuse sentence metadata, then these will be linked in via the sentence corpora produced by the Sentence Segmentation component.

The initial implementation will be based on sequential sentences in both corpora; this will be refined based on project team testing and feedback, as time and priorities allow. Potentially revisions include skipped sentences and alternate order within some context window (e.g., for quotations that re-order content from the same paragraph).

Application Interface¶

This component is a Marimo notebook designed to run in application mode, which will provide a graphical user interface to the remarx software. This interface will allow users to run both the sentence corpus builder and the core pipeline. This notebook will allow users to select and upload original and reuse texts and sentence corpora. The notebook will contain minimal code and primarily call methods within the remarx package. Users will have an option to save or download the output to a file location of their choice.

Depending on whether this notebook is run locally or on della via an Open OnDemand app, the pipeline’s intermediate files will be stored on either the user’s local machine or in a scratch folder on della. The software will provide a simple way to configure paths for downloaded and generated models, with sensible defaults for common scenarios.

The initial version of this application will not automatically clean up files generated by the processing, but will report locations and sizes (e.g., to allow manual cleanup). Future refinements may include clean up on user request or automated cleanup, depending on project team feedback.

POST: Final Citation Corpus Compilation¶

This component is outside of the core software pipeline. This will link the identified quotes to any additional required metadata from the original texts (Kapital and Communist Manifesto) and the reuse texts (DNZ). This could include the following metadata:

Both: Starting page number
Both: Content type (e.g., text, footnote)
Original: Marx title
DNZ: Journal name
DNZ: Volume issue
DNZ: Year of issue
DNZ: Article title
DNZ: Article author (optional, where possible)

We aim for data compilation to be done without custom development effort, and will look for a sufficient solution for this phase of the project, which will empower research team members to review and work with the data prior to export for publication. We have requested a no-code database interface from PUL (#princeton_ansible/6373); other alternatives are Google Sheets, Excel template, or AirTable.

Data¶

Data Design¶

Description of the data structures used by the software, with expected or required fields

Original/Reuse Sentence Corpus¶

A CSV file with each row corresponding to a single sentence in either an original or reuse text. This corpus may include additional data (e.g., document metadata for citation); this will be ignored by the system but preserved and passed through in the generated corpus.

Field Name	Description	Type	Required / Optional	Reason
id	Unique identifier for each sentence	String	Required	For tracking, reference
text	Sentence text content	String	Required	For quote detection
file	Corresponding document filename	String	Required	For tracking, reference, metadata linking
sent_index	Sentence-level index within document (0-based indexing)	Integer	Required	For identifying sequential sentences for quote compilation
section_type	What text section the sentence comes from (text vs. footnote)	enum	Optional	For reference and debugging
multi_page	Indicates whether the sentence spans multiple pages.	Boolean	Required	For reference and debugging

The sentence corpora produced by the Sentence Corpus Builder program must include one or more fields that link the sentence back to its location within the corresponding input document. This will vary by the input format (e.g., MEGAdigital: page and line number; plaintext: starting character index).

Quote Sentence Pairs¶

A CSV file with each row corresponding to an original-reuse sentence pair that has been identified as a quote.

Field Name	Description	Type	Required / Optional	Reason
match_score	Some match quality score	Float	Required	For development, evaluation
reuse_id	ID of the reuse text sentence	String	Required	For tracking, reference
reuse_text	Text of the reuse sentence	String	Required	For reference
reuse_file	Reuse document filename	String	Required	For tracking, reference
reuse_sent_index	Sentence-level index within reuse document	Integer	Required	For identifying sequential sentences for quote compilation
original_id	ID of the original sentence	String	Required	For tracking, reference, metadata linking
original_text	Text of the original sentence	String	Required	For reference
original_file	Original document filename	String	Required	For tracking, reference, metadata linking
original_sent_index	Sentence-level index within original document	Integer	Required	For identifying sequential sentences for quote compilation

Additional metadata for the corresponding reuse and original sentences will be included depending on the contents of the input sentence corpora. Each corpus's additional metadata fields will follow its sentence index field (i.e., reuse_sent_index, original_sent_index).

Quotes Corpus¶

A CSV file with each row corresponding to a text passage containing a quotation. The granularity of these passages operate at the sentence level; in other words, a passage corresponds to one or more sentences.

Field Name	Description	Type	Required / Optional	Reason
id	Unique identifier for quote. Starting sentence id for multi-sentence passages.	String	Required	For tracking, reference
match_score	Some match quality score. For multi-sentence passages this will be the mean of the sentence-level scores.	Float	Required	For development, evaluation, reference.
reuse_id	ID of the reuse sentence	String	Required	For tracking, reference
reuse_doc	Filename of reuse document	String	Required	For tracking, reference
reuse_text	Text of the quote as it appears in reuse document	String	Required	For reference, evaluation
original_id	ID of the original sentence	String	Required	For tracking, reference
original_doc	Filename of original document	String	Required	For tracking, reference
original_text	Text of the quote as it appears in original document	String	Required	For reference, evaluation
original_section_type	What text section the original sentence comes from (text vs. footnote)	enum	Optional	For reference and debugging
reuse_section_type	What text section the reuse sentence comes from (text vs. footnote)	enum	Optional	For reference and debugging

Any additional metadata included with the input sentences will be passed through and prefixed based on the input corpus (reuse/original).

Final Citation Corpus¶

This CSV file is a tailored version of the Quotes Corpus produced by the core software pipeline where relevant page- and document-level metadata has been added.

Field Name	Description	Type	Required / Optional
quote_id	Unique identifier for quote	String	Required
marx_quote	Text of direct quote from Marx text	String	Required
article_quote	Text of direct quote from the article text	String	Required
marx_title	Title of Marx text	String	Required
marx_page	Starting page of quote in Max text	String	Required
article_title	Title of article	String	Required
article_author	Name of article’s author	String	Optional
journal	Title of journal	String	Required
volume	Volume Issue	String	Required
issue	Journal Issue	String	Required
year	Year of article/journal publication	String	Required
certainty	A score (0-1) of how quote certainty	Float	Required

Note: article and marx naming convention could be altered to something more general. Should have the research team confirm desired form and field names and check if there are any additional fields that should be added (e.g., page vs. footnote).

Work Plan for Additional Data Work (if needed)¶

Describe the data work that needs to be done and include proposed deadlines.

Kapital Texts¶

The text of Kapital needs to be extracted from the MEGA TEI XML file. To properly handle sentence spanning page boundaries, the main text and footnotes will be split into separate text files.

Communist Manifesto Text¶

The text of the Communist Manifesto may need to be transformed into plaintext files. The scope of this effort is dependent on how the text is being acquired.

DNZ Text¶

The text of DNZ articles must be extracted from the ALTO-XML transcription files. For simplicity, each article should be separated into separate text files. If the article has footnotes, footnotes will be split into a separate text file.

Evaluation Dataset(s)¶

A dataset of identified direct quote annotations from Kapital and/or the Communist Manifesto must be constructed. This will be used as the evaluation reference during development to measure the performance of the software pipeline, and to provide an estimate for the expected scale of the dataset.

Derivative versions of the text reuse data will be created as needed to compare passages identified as text reuse with sentence level matches.

Data Management Plan¶

Datasets¶

Describe input and output data formats

Name	Description of how generated	Data type	Format
Kapital XML	MEGAdigital’s MEGA II/5 digital copy. This text corresponds to the first edition of the first volume of Das Kapital (Berlin: Dietz-Verlag 1983)	Input	TEI XML
Communist Manifesto text	For this phase, we will use content from this online edition	Input	HTML
DNZ transcriptions	Transcriptions of DNZ volumes with specialized text segment labels. These were created using eScriptorium and YALTAI.	Input	ALTO XML
Kapital sentences	Corpora of sentences from Kapital. At least two: main text, footnotes.	Input	CSV
Communist Manifesto sentences	Corpora of sentences from the Communist Manifesto.	Input	CSV
DNZ article sentences	Corpora of sentences from DNZ articles. Articles may correspond to multiple corpora.	Input	CSV
Kapital embeddings	Sentence embeddings for Kapital. For development only.	Intermediate	.npy
Communist Manifesto embeddings	Sentence embeddings for Communist Manifesto. For development.	Intermediate	.npy
DNZ embeddings	Sentence embeddings for DNZ articles. For development.	Intermediate	.npy
Quote Sentence Pairs	Identified quote sentence-pairs. Primarily for development.	Intermediate	CSV
Quotes Corpus	Identified quotes corpus.	Intermediate	CSV
Final Citation Corpus	The final corpus of identified citations.	Output	CSV

Access¶

Where will data be stored; back up plan?

GitHub. If the data is small enough, a public GitHub repository will be used. This repository would be used to store the final dataset.

TigerData. Copies of all datasets will be stored in TigerData. Assuming GitHub is an option, this will be a copy of the GitHub data repo. If the data is too large, TigerData will be the primary data source for development. Directories should be named meaningfully and include a brief readme documenting contents and how data was generated or acquired. We will use GitHub issues for light-weight review of data structuring and documentation.

GoogleDrive. The initial text inputs will be stored in the project’s GoogleDrive folder. Any intermediate files that should be reviewed by the research team will also be uploaded to this folder.

Archiving and Publication¶

What is the plan for publication or archiving of the data? What license will be applied?

The final dataset will be published on Zenodo and/or the Princeton Data Commons depending on the terms of use / copyright status of the data. If possible, an additional copy will be published on GitHub.

Q: What license will be applied for the data?

Interface Functionality¶

Description of user-facing functionality directly accessible through the user interface and its components

As possible, there will be two types of interfaces for two different types of users.

Notebook Application¶

The primary user interface is a marimo notebook. This interface will provide a graphical interface for selecting and uploading input text files and downloading the system’s output corpus file.

Command Line¶

Technical users and the development team will have an alternative interface where the software can be run directly from the terminal.

Testing Plan¶

Unit Testing¶

Unit testing will be used throughout the development process. There must be unit tests for all python source code excluding marimo notebooks. The target code coverage for python source code is 95%; the target coverage for python test code is 100%.

Staging¶

During development, della will be used for a staging environment, as needed. This ensures a consistent shared environment as opposed to each developer’s local environment.

Ongoing Acceptance Testing¶

Acceptance testing will be conducted throughout development. Each user-facing feature must have a user story and must undergo acceptance testing before an issue is considered complete. Acceptance testing should normally be done on features that have been implemented, unit tested, gone through code review, and merged to the develop branch. Acceptance testing is expected to occur every iteration where possible and otherwise as soon as possible so that software updates can be released each iteration.

Application (i.e., notebook) testing will ideally run on the target environment. We will use molab initially if helpful while getting started with development. Then, depending on resource requirements, testing will shift either to the tester’s local machine or on della via Open OnDemand.

Development Conventions¶

Software Releases¶

The goal is to complete a software release each iteration by either the project team meeting or by retro. This high frequency of releases will make the changelog more meaningful. The RSEs responsible for creating and reviewing each release will be assigned evenly across the team. Ideally, no RSE will be responsible for two consecutive releases.

The tool Repo Review should be checked before making a release.

Software Publishing¶

The software’s python source code should be packaged and published on PyPI iteratively. Ideally, with each release. We will add a GitHub workflow to automate publishing, so it will be triggered with each software release.

Python Development¶

Package Management¶

UV will be used for python package management. Any configuration settings will be added to the pyproject.toml.

Type Annotations¶

All python source code (excluding notebooks and test code) must use type annotations. This will be enforced by Ruff’s linter.

Formatting¶

We will use Ruff’s formatter to enforce a consistent code format. The formatter will be applied

Linting¶

We will use Ruff’s linter using the following rule sets:

Rule(s)	Name	Reason
F	pyflakes	Ruff default
E4, E7, E9	pycodestyle subset	Ruff default; subset compatible with Ruff’s formatter
I	isort	Import sorting / organization
ANN	flake8-annotations	Checks type annotations
PTH	flake8-use-pathlib	Ensure pathlib usage instead of os.path
B	flake8-bugbear	Flags likely bugs and design problems
D	pydocstyle subset	Checks docstrings
PERF	perflint	Checks for some performance anti-patterns
SIM	flake8-simplify	Looks for code simplification
C4	flake8-comprehensions	Checks comprehension patterns
RUF	ruff-specific rules	Potentially helpful?
NPY	numpy-specific rules	For checking numpy usage
UP	pyupgrade	Automatically update syntax to newer form

Notebook Development¶

We will use marimo for any notebook development. The application notebook will be included in the source code package and made available as package command-line executable. If needed, analysis notebooks will be organized in a top-level “notebooks” folder. Any meaningful code (i.e., methods) for all notebooks should be located within the python source code not the notebook.

Notebooks will be used both as the application interface and as needed for data analysis for development.

The Marimo app interface notebook will be tested if possible; data analysis notebooks will be excluded from code coverage.

Documentation¶

Documentation must be updated before a pull request is merged to the development branch. Documentations will be generated using mkdocs. The linter prettier will be used for linting all documentation markdown files.

We will implement automated documentation checks on pull requests to develop and main branches (documentation coverage, if supported by mkdocs; otherwise, confirm documentation has been updated)

Precommit Hooks¶

We will include precommit hooks for the following actions:

Run Ruff’s linter and formatter
Run prettier linter on markdown files
Run codespell to check for typos within any text

GitHub Actions¶

Our repository will include the following GitHubActions:

Check all unit tests pass
100% code coverage for test code
95% coverage for source code
Ruff formatting check: to prevent the occasional commit with improper formatting
Check mkdoc documentation coverage
Check change log has been updated
Python package publication on PyPI (triggered by new release on GitHub)

Final Acceptance Criteria¶

Define the requirements the deliverable must meet to be considered complete and accepted. These criteria should be testable.

Functionality¶

Using evaluation dataset as a reference. The software must identify at least 90% of the expected quotations (i.e., 90% recall).
- Tolerance for false positives will be determined in conversation with the research team after reviewing preliminary results. If necessary, we will decrease the recall threshold to avoid too many false positives in the final dataset.
Faculty collaborator can successfully run software without assistance.

Sign Offs¶

Reviewed by: Jeri Wieringa

Signatures / Date

References¶

https://zenodo.org/records/14861082