Technical Design Document¶
This serves as a living document to record the current design decisions for this project. The expectation is not a comprehensive listing of issues or tasks, but rather a recording of the key structural and process elements, as well as primary deliverables, that will guide development.
The primary audience for this document is internal to the development team.
Note: This document was originally written for planning. It will be updated during development, but it may be out of date.
remarx¶
Project Title: Citing Marx: Die Neue Zeit, 1891-1918
RSE Lead: Rebecca Koeser
RSE Team Members: Laure Thompson, Hao Tan, Mary Naydan, Jeri Wieringa
Faculty PI: Edward Baring
Development Start Date: 2025-08-04
Target Release Date: Fall 2025
Statement of Purpose¶
Primary goals for the software
The primary goal of this software tool is to provide a simple user interface to enable researchers to upload files for two sets of German language content to be compared, and download a dataset of similar passages between the two sets of content.
The primary use case for this tool is to identify direct quotes of Karl Marx's Manifest der Kommunistischen Partei (Communist Manifesto) and the first volume of Das Kapital (Kapital) within a subset of Die Neue Zeit (DNZ) articles.
Task Identification and Definition¶
| Task | Definition | 
|---|---|
| Quotation Identification | Determine which passages of text quote a source text and identify the passages that are quoted | 
Out of Scope¶
- Non-German quote detection
 - Cross-lingual quote detection is out of scope for this phase, but the tool will be built with an eye towards future expansion/support of other languages
 - Paraphrase / indirect quote detection
 - Attribution, citation, and named-reference detection
 
Requirements¶
Functional Requirements¶
What the software must do
- Reads plaintext files corresponding to original (i.e., source of quotes) and reuse (i.e., where quotations occur) texts
 - Builds a quotation corpus that operates at a sentence-level granularity
 
Non-Functional Requirements¶
How the software must behave
- Assumes input content is written in German (will be built with an eye towards supporting other languages in future, but not in this version)
 - Is optimized for the DNZ-Marx use case
 - Can be adapted to other monolingual and crosslingual settings, but with no quality guarantees
 - Compatible with any German original (e.g., other texts written by Marx) and reuse texts (e.g., earlier/later issues of DNZ, other journals)
 - Uses sentence embeddings as a core data representation
 - For sentence embedding construction, uses a preexisting model available through sentence-transformers or huggingface
 - Developed and tested on Python 3.12
 - If this software is run locally it should be compatible with macOS, linux, and windows based operating systems; and be compatible with both x86 and ARM architectures.
 
System Architecture¶
High-level description of the system and its components
The system is composed of two separate software programs: the sentence corpus builder and the core pipeline.
The sentence corpus builder is a program for creating the input CSV files for the core pipeline. It takes as input some number of texts and outputs a CSV file containing sentence-level text and metadata for these input texts. This program will have custom solutions for the primary use case of the tool (TEI-XML, ALTO-XML) as well as plaintext for general use.
The core pipeline has two key inputs: reuse and original sentence corpora. Each of these inputs are one or more CSV files where each row corresponds to a sentence from a reuse or original text (i.e., as produced by the sentence corpus builder). Ultimately, the system will output a CSV file containing the identified quotations which will be called the quoted corpus.
The core pipeline can be broken down into two key components:
- Sentence-Level Quote Detection
 - Quote Compilation
 
Sentence-Level Quote Detection. This component takes an original and reuse sentence corpus as input and outputs a corpus of sentence-pairs corresponding to quotes.
Quote Compilation. In this step, the quote sentence pairs from the last step are refined into our final quotes corpus. Since quotes can be more than one sentence long, the primary goal of this component is to merge quote sentence-pairs that correspond to the same multi-sentence quote.
- Minimum functionality: merge passages with sequential pairs of sentences from both corpora
 
Infrastructure and Deployment¶
Where the software will run; includes all different environments expected for development, staging/testing, and production, and any additional resources needed or expected (e.g., new VMs, OnDemand developer access, HPC, TigerData, etc).
Environments:
- Development: RSE local machines as possible; della as needed
 - Staging: della
- To ensure software works on a common environment, we plan to use della as our staging environment for acceptance testing & quality assurance testing
 
 - Production: researcher machines (via installable python package, using uv and uv run if possible) or della via a shared Open OnDemand app (to be determined based on scale of data and compute requirements)
 
Resources:
- TigerData project partition
 - Della access
 - OnDemand developer access
 
Component Details¶
Description of each component and its functionality
Sentence Corpus Builder¶
This component is a python program that extracts sentences from some number of input texts and compiles them into a single sentence-level corpus CSV file. It can be broken into two core stages: text extraction and sentence segmentation.
Text extraction will be customized to each of the supported input types:
- MEGA digital TEI-XML files
 - Communist Manifesto TXT files as produced by a custom, one-time script converting HTML files to plaintext
 - Transcription ALTO-XML files as produced by the research team’s custom text transcription and segmentation pipeline
 - Plaintext files (TXT)
 
Each of these solutions will involve extracting the text from the corresponding file type that can be subsequently passed to sentence segmentation.
Sentence segmentation will rely on an off-the-shelf NLP package that supports sentence segmentation for German. We expect to use stanza for this purpose, but will have research team review segmentation quality to determine if we need to select a different tool. This step expects the input texts to be in German and may perform poorly for other languages.
This program will require unique input filenames to ensure sentences can be linked back to their original input file.
This component must also provide some linking mechanism for the passages of the output file to be linked back to their original document locations (e.g., starting character indices, line numbers). This component may also extract additional metadata such as article title or page number.
Sentence-Level Quote Detection¶
This component is a python program that identifies quote sentence-pairs for some input reuse and original sentence corpus. It will involve two core steps. First, the sentence embeddings will be constructed for the sentences in each corpus. These embeddings will be generated by a pretrained model available via SBERT or HuggingFace. Then, likely quotes will be identified using embedding similarity via an off-the-shelf approximate nearest neighbors library. Finally, the program will output a quote sentence-pair corpus CSV file.
This component expects the sentences to be in German, but other multilingual or monolingual methods could easily be swapped in. In development, we expect to save the sentence embeddings as a (pickled) numpy binary file. The final pipeline may save these embeddings as an additional intermediate file.
Quote Compilation¶
This component is a python program that will transform the quote sentence-pairs corpus into a more general quote corpus. Sentence-pairs will be merged into a single multi-sentence quote when sentence pairs are sequential or near-sequential. Two sentence pairs are sequential when both the original and reuse sentence of one pair directly preceded the original and reuse sentence respectively of the other pair.
Additionally, if the quote-sentence pair corpus does not include original and reuse sentence metadata, then these will be linked in via the sentence corpora produced by the Sentence Segmentation component.
The initial implementation will be based on sequential sentences in both corpora; this will be refined based on project team testing and feedback, as time and priorities allow. Potentially revisions include skipped sentences and alternate order within some context window (e.g., for quotations that re-order content from the same paragraph).
Application Interface¶
This component is a Marimo notebook designed to run in application mode, which will provide a graphical user interface to the remarx software.
This interface will allow users to run both the sentence corpus builder and the core pipeline.
This notebook will allow users to select and upload original and reuse texts and sentence corpora.
The notebook will contain minimal code and primarily call methods within the remarx package.
Users will have an option to save or download the output to a file location of their choice.
Depending on whether this notebook is run locally or on della via an Open OnDemand app, the pipeline’s intermediate files will be stored on either the user’s local machine or in a scratch folder on della. The software will provide a simple way to configure paths for downloaded and generated models, with sensible defaults for common scenarios.
The initial version of this application will not automatically clean up files generated by the processing, but will report locations and sizes (e.g., to allow manual cleanup). Future refinements may include clean up on user request or automated cleanup, depending on project team feedback.
POST: Final Citation Corpus Compilation¶
This component is outside of the core software pipeline. This will link the identified quotes to any additional required metadata from the original texts (Kapital and Communist Manifesto) and the reuse texts (DNZ). This could include the following metadata:
- Both: Starting page number
 - Both: Content type (e.g., text, footnote)
 - Original: Marx title
 - DNZ: Journal name
 - DNZ: Volume issue
 - DNZ: Year of issue
 - DNZ: Article title
 - DNZ: Article author (optional, where possible)
 
We aim for data compilation to be done without custom development effort, and will look for a sufficient solution for this phase of the project, which will empower research team members to review and work with the data prior to export for publication. We have requested a no-code database interface from PUL (#princeton_ansible/6373); other alternatives are Google Sheets, Excel template, or AirTable.
Data¶
Data Design¶
Description of the data structures used by the software, with expected or required fields
Original/Reuse Sentence Corpus¶
A CSV file with each row corresponding to a single sentence in either an original or reuse text. This corpus may include additional data (e.g., document metadata for citation); this will be ignored by the system but preserved and passed through in the generated corpus.
| Field Name | Description | Type | Required / Optional | Reason | 
|---|---|---|---|---|
| id | Unique identifier for each sentence | String | Required | For tracking, reference | 
| text | Sentence text content | String | Required | For quote detection | 
| file | Corresponding document filename | String | Required | For tracking, reference, metadata linking | 
| sent_index | Sentence-level index within document (0-based indexing) | Integer | Required | For identifying sequential sentences for quote compilation | 
| section_type | What text section the sentence comes from (text vs. footnote) | enum | Optional | For reference and debugging | 
| multi_page | Indicates whether the sentence spans multiple pages. | Boolean | Required | For reference and debugging | 
The sentence corpora produced by the Sentence Corpus Builder program must include one or more fields that link the sentence back to its location within the corresponding input document. This will vary by the input format (e.g., MEGAdigital: page and line number; plaintext: starting character index).
Quote Sentence Pairs¶
A CSV file with each row corresponding to an original-reuse sentence pair that has been identified as a quote.
| Field Name | Description | Type | Required / Optional | Reason | 
|---|---|---|---|---|
| match_score | Some match quality score | Float | Required | For development, evaluation | 
| reuse_id | ID of the reuse text sentence | String | Required | For tracking, reference | 
| reuse_text | Text of the reuse sentence | String | Required | For reference | 
| reuse_file | Reuse document filename | String | Required | For tracking, reference | 
| reuse_sent_index | Sentence-level index within reuse document | Integer | Required | For identifying sequential sentences for quote compilation | 
| original_id | ID of the original sentence | String | Required | For tracking, reference, metadata linking | 
| original_text | Text of the original sentence | String | Required | For reference | 
| original_file | Original document filename | String | Required | For tracking, reference, metadata linking | 
| original_sent_index | Sentence-level index within original document | Integer | Required | For identifying sequential sentences for quote compilation | 
Additional metadata for the corresponding reuse and original sentences will be included depending on the contents of the input sentence corpora.
Each corpus's additional metadata fields will follow its sentence index field (i.e., reuse_sent_index, original_sent_index).
Quotes Corpus¶
A CSV file with each row corresponding to a text passage containing a quotation. The granularity of these passages operate at the sentence level; in other words, a passage corresponds to one or more sentences.
| Field Name | Description | Type | Required / Optional | Reason | 
|---|---|---|---|---|
| id | Unique identifier for quote. Starting sentence id for multi-sentence passages. | String | Required | For tracking, reference | 
| match_score | Some match quality score. For multi-sentence passages this will be the mean of the sentence-level scores. | Float | Required | For development, evaluation, reference. | 
| reuse_id | ID of the reuse sentence | String | Required | For tracking, reference | 
| reuse_doc | Filename of reuse document | String | Required | For tracking, reference | 
| reuse_text | Text of the quote as it appears in reuse document | String | Required | For reference, evaluation | 
| original_id | ID of the original sentence | String | Required | For tracking, reference | 
| original_doc | Filename of original document | String | Required | For tracking, reference | 
| original_text | Text of the quote as it appears in original document | String | Required | For reference, evaluation | 
| original_section_type | What text section the original sentence comes from (text vs. footnote) | enum | Optional | For reference and debugging | 
| reuse_section_type | What text section the reuse sentence comes from (text vs. footnote) | enum | Optional | For reference and debugging | 
Any additional metadata included with the input sentences will be passed through and prefixed based on the input corpus (reuse/original).
Final Citation Corpus¶
This CSV file is a tailored version of the Quotes Corpus produced by the core software pipeline where relevant page- and document-level metadata has been added.
| Field Name | Description | Type | Required / Optional | 
|---|---|---|---|
| quote_id | Unique identifier for quote | String | Required | 
| marx_quote | Text of direct quote from Marx text | String | Required | 
| article_quote | Text of direct quote from the article text | String | Required | 
| marx_title | Title of Marx text | String | Required | 
| marx_page | Starting page of quote in Max text | String | Required | 
| article_title | Title of article | String | Required | 
| article_author | Name of article’s author | String | Optional | 
| journal | Title of journal | String | Required | 
| volume | Volume Issue | String | Required | 
| issue | Journal Issue | String | Required | 
| year | Year of article/journal publication | String | Required | 
| certainty | A score (0-1) of how quote certainty | Float | Required | 
Note: article and marx naming convention could be altered to something more general. Should have the research team confirm desired form and field names and check if there are any additional fields that should be added (e.g., page vs. footnote).
Work Plan for Additional Data Work (if needed)¶
Describe the data work that needs to be done and include proposed deadlines.
Kapital Texts¶
The text of Kapital needs to be extracted from the MEGA TEI XML file. To properly handle sentence spanning page boundaries, the main text and footnotes will be split into separate text files.
Communist Manifesto Text¶
The text of the Communist Manifesto may need to be transformed into plaintext files. The scope of this effort is dependent on how the text is being acquired.
DNZ Text¶
The text of DNZ articles must be extracted from the ALTO-XML transcription files. For simplicity, each article should be separated into separate text files. If the article has footnotes, footnotes will be split into a separate text file.
Evaluation Dataset(s)¶
A dataset of identified direct quote annotations from Kapital and/or the Communist Manifesto must be constructed. This will be used as the evaluation reference during development to measure the performance of the software pipeline, and to provide an estimate for the expected scale of the dataset.
Derivative versions of the text reuse data will be created as needed to compare passages identified as text reuse with sentence level matches.
Data Management Plan¶
Datasets¶
Describe input and output data formats
| Name | Description of how generated | Data type | Format | Stability | 
|---|---|---|---|---|
| Kapital XML | MEGAdigital’s MEGA II/5 digital copy. This text corresponds to the first edition of the first volume of Das Kapital (Berlin: Dietz-Verlag 1983) | Input | TEI XML | |
| Communist Manifesto text | For this phase, we will use content from this online edition | Input | HTML | |
| DNZ transcriptions | Transcriptions of DNZ volumes with specialized text segment labels. These were created using eScriptorium and YALTAI. | Input | ALTO XML | |
| Kapital sentences | Corpora of sentences from Kapital. At least two: main text, footnotes. | Input | CSV | |
| Communist Manifesto sentences | Corpora of sentences from the Communist Manifesto. | Input | CSV | |
| DNZ article sentences | Corpora of sentences from DNZ articles. Articles may correspond to multiple corpora. | Input | CSV | |
| Kapital embeddings | Sentence embeddings for Kapital. For development only. | Intermediate | .npy | |
| Communist Manifesto embeddings | Sentence embeddings for Communist Manifesto. For development. | Intermediate | .npy | |
| DNZ embeddings | Sentence embeddings for DNZ articles. For development. | Intermediate | .npy | |
| Quote Sentence Pairs | Identified quote sentence-pairs. Primarily for development. | Intermediate | CSV | |
| Quotes Corpus | Identified quotes corpus. | Intermediate | CSV | |
| Final Citation Corpus | The final corpus of identified citations. | Output | CSV | 
Access¶
Where will data be stored; back up plan?
GitHub. If the data is small enough, a public GitHub repository will be used. This repository would be used to store the final dataset.
TigerData. Copies of all datasets will be stored in TigerData. Assuming GitHub is an option, this will be a copy of the GitHub data repo. If the data is too large, TigerData will be the primary data source for development. Directories should be named meaningfully and include a brief readme documenting contents and how data was generated or acquired. We will use GitHub issues for light-weight review of data structuring and documentation.
GoogleDrive. The initial text inputs will be stored in the project’s GoogleDrive folder. Any intermediate files that should be reviewed by the research team will also be uploaded to this folder.
Archiving and Publication¶
What is the plan for publication or archiving of the data? What license will be applied?
The final dataset will be published on Zenodo and/or the Princeton Data Commons depending on the terms of use / copyright status of the data. If possible, an additional copy will be published on GitHub.
Q: What license will be applied for the data?
Interface Functionality¶
Description of user-facing functionality directly accessible through the user interface and its components
As possible, there will be two types of interfaces for two different types of users.
Notebook Application¶
The primary user interface is a marimo notebook. This interface will provide a graphical interface for selecting and uploading input text files and downloading the system’s output corpus file.
Command Line¶
Technical users and the development team will have an alternative interface where the software can be run directly from the terminal.
Testing Plan¶
Unit Testing¶
Unit testing will be used throughout the development process. There must be unit tests for all python source code excluding marimo notebooks. The target code coverage for python source code is 95%; the target coverage for python test code is 100%.
Staging¶
During development, della will be used for a staging environment, as needed. This ensures a consistent shared environment as opposed to each developer’s local environment.
Ongoing Acceptance Testing¶
Acceptance testing will be conducted throughout development. Each user-facing feature must have a user story and must undergo acceptance testing before an issue is considered complete. Acceptance testing should normally be done on features that have been implemented, unit tested, gone through code review, and merged to the develop branch. Acceptance testing is expected to occur every iteration where possible and otherwise as soon as possible so that software updates can be released each iteration.
Application (i.e., notebook) testing will ideally run on the target environment. We will use molab initially if helpful while getting started with development. Then, depending on resource requirements, testing will shift either to the tester’s local machine or on della via Open OnDemand.
Development Conventions¶
Software Releases¶
The goal is to complete a software release each iteration by either the project team meeting or by retro. This high frequency of releases will make the changelog more meaningful. The RSEs responsible for creating and reviewing each release will be assigned evenly across the team. Ideally, no RSE will be responsible for two consecutive releases.
The tool Repo Review should be checked before making a release.
Software Publishing¶
The software’s python source code should be packaged and published on PyPI iteratively. Ideally, with each release. We will add a GitHub workflow to automate publishing, so it will be triggered with each software release.
Python Development¶
Package Management¶
UV will be used for python package management. Any configuration settings will be added to the pyproject.toml.
Type Annotations¶
All python source code (excluding notebooks and test code) must use type annotations. This will be enforced by Ruff’s linter.
Formatting¶
We will use Ruff’s formatter to enforce a consistent code format. The formatter will be applied
Linting¶
We will use Ruff’s linter using the following rule sets:
| Rule(s) | Name | Reason | 
|---|---|---|
| F | pyflakes | Ruff default | 
| E4, E7, E9 | pycodestyle subset | Ruff default; subset compatible with Ruff’s formatter | 
| I | isort | Import sorting / organization | 
| ANN | flake8-annotations | Checks type annotations | 
| PTH | flake8-use-pathlib | Ensure pathlib usage instead of os.path | 
| B | flake8-bugbear | Flags likely bugs and design problems | 
| D | pydocstyle subset | Checks docstrings | 
| PERF | perflint | Checks for some performance anti-patterns | 
| SIM | flake8-simplify | Looks for code simplification | 
| C4 | flake8-comprehensions | Checks comprehension patterns | 
| RUF | ruff-specific rules | Potentially helpful? | 
| NPY | numpy-specific rules | For checking numpy usage | 
| UP | pyupgrade | Automatically update syntax to newer form | 
Notebook Development¶
We will use marimo for any notebook development. The application notebook will be included in the source code package and made available as package command-line executable. If needed, analysis notebooks will be organized in a top-level “notebooks” folder. Any meaningful code (i.e., methods) for all notebooks should be located within the python source code not the notebook.
Notebooks will be used both as the application interface and as needed for data analysis for development.
The Marimo app interface notebook will be tested if possible; data analysis notebooks will be excluded from code coverage.
Documentation¶
Documentation must be updated before a pull request is merged to the development branch. Documentations will be generated using mkdocs. The linter prettier will be used for linting all documentation markdown files.
We will implement automated documentation checks on pull requests to develop and main branches (documentation coverage, if supported by mkdocs; otherwise, confirm documentation has been updated)
Precommit Hooks¶
We will include precommit hooks for the following actions:
- Run Ruff’s linter and formatter
 - Run prettier linter on markdown files
 - Run codespell to check for typos within any text
 
GitHub Actions¶
Our repository will include the following GitHubActions:
- Check all unit tests pass
 - 100% code coverage for test code
 - 95% coverage for source code
 - Ruff formatting check: to prevent the occasional commit with improper formatting
 - Check mkdoc documentation coverage
 - Check change log has been updated
 - Python package publication on PyPI (triggered by new release on GitHub)
 
Final Acceptance Criteria¶
Define the requirements the deliverable must meet to be considered complete and accepted. These criteria should be testable.
Functionality¶
- Using evaluation dataset as a reference. The software must identify at least 90% of the expected quotations (i.e., 90% recall).
- Tolerance for false positives will be determined in conversation with the research team after reviewing preliminary results. If necessary, we will decrease the recall threshold to avoid too many false positives in the final dataset.
 
 - Faculty collaborator can successfully run software without assistance.
 
Sign Offs¶
Reviewed by: Jeri Wieringa
Signatures / Date