Skip to content

Technical Design

Updated January 2026 to reflect the 1.0 release.


Statement of Purpose

The remarx software package was developed to provide a simple user interface to enable researchers to find quotations in German-language content, and supports generating a dataset of similar passages (i.e., quotations or partial quotations) between two sets of texts.

The immediate use case motivating the development of this software was to identify direct quotes of Karl Marx's Manifest der Kommunistischen Partei (Communist Manifesto) and the first volume of Das Kapital (Kapital) within a subset of Die Neue Zeit (DNZ) articles.

The core functionality is to take text content input from original (i.e., source of quotes) and reuse (i.e., where quotations occur) documents, and builds a sentence or passage level quotation corpus. The functionality is primarily accessible through a web-based interface, which is comprised of two Marimo notebooks running in application mode; the functionality is also accessible through two command-line scripts. Both interfaces provide entry points to the same functionality in the remarx project. The application is designed to run locally on researcher machines.

Application Architecture

System Architecture

The application has two components: Sentence Corpus Builder and Quote Finder.

Default directory structure

For convenience, the application recommends (but does not require) a default directory structure under a remarx-data directory in the user's home directory. The web interface for the Sentence Corpus Builder prompts saving sentence output files in the default original or reuse locations (with an option to store elsewhere), and the Quote Finder web interface defaults to loading content from the original and reuse corpora folders.

$HOME/remarx-data/
├── corpora/
│   ├── original/
│   └── reuse/
└── quotes/

Sentence embeddings are named based on the input sentence corpus file they are generated from and cache files are created in the same directory as the input

Quote Finder output by default will be saved to the quotes directory.

Sentence Corpus Builder

The Sentence Corpus Builder is used to convert input content into the format needed for the Quote Finder. It takes a single file in a supported format (TEI XML, zipfile of ALTO XML pages, or plain text) and outputs a CSV file containing a sentence-level text corpus generated from the input text.

It is implemented as python function that extracts sentences from an input text and generates a corresponding sentence-level corpus CSV file. It consists of the following:

  • text extraction
  • sentence segmentation
  • sentence filtering (exclude two-word and numeric & punctuation-only "sentences")

Supported text input formats:

  • TEI XML. Yields all body content first followed by all footnotes, to aid in consolidating sequential sentences
  • Zip file of ALTO-XML files. Uses block-level segmentation tags to determine what content to include and to provide article metadata and page numbers
  • Plain text (TXT)

TEI and ALTO input processing has been customized to suit specific project inputs for this phase (MEGA digital editions, transcriptions generated by customized segmentation and training in eScriptorium). The software may require modification to work well for other inputs.

Sentence segmentation relies on off-the-shelf NLP tools with German support (currently SpaCy, which was evaluated in comparison with Stanza). Uses a a German language model by default and may perform poorly for other languages.

Sentence Corpus output

A CSV file with each row corresponding to a single sentence in either an original or reuse text. This corpus may include additional data (e.g., document metadata for citation); this will be ignored by the system but preserved and passed through in the generated corpus.

Field Description Type Required?
sent_id Unique identifier for each sentence String Yes
file Source document filename 1 String Yes
sent_index Sentence-level index within document (0-based indexing) Integer Yes
text Sentence text content String Yes
section_type What text section the sentence comes from (e.g., text vs. footnote) String Optional; TEI/ALTO only 2
page_number Page of the source text where the sentence begins String Optional; TEI/ALTO only 2
title Title of the article the sentence comes from, when available String Optional; ALTO only 2
author Author of the article the sentence comes from , when available String Optional; ALTO only 2
page_file Page file document within the ALTO zip file String Optional; ALTO only
line_number Line number of the source text where the sentence begins String Optional; TEI only
  1. For ALTO XML, the the file field is the zip file that contains a set of ALTO page XML files.
  2. Depends on ALTO custom segmentation; may not always be present.

Quote Finder

The Quote Finder requires two sets of inputs: original and reuse sentence corpora. Each input is a CSV file as generated by the Sentence Corpus Builder, where each row corresponds to a sentence from a reuse or original text. Currently, the system allows selecting multiple original corpora and one reuse corpus. The quote finder generates a CSV file with identified quotations found when comparing the reuse and original corpora.

Sentence-Level Quote Detection

This component is a python program that identifies quote sentence-pairs for a given set of original and reuse sentence corpora. First, sentence embeddings are generated for each sentences in the input corpora, using a pretrained model available via SBERT or HuggingFace. Then, likely quotes are identified using embedding similarity with an off-the-shelf approximate nearest neighbors library (Voyager, from Spotify), which are filtered by a cutoff score. The result is a quote sentence-pair corpus CSV file (described below).

The default embeddings model is a multilingual model, but has not been tested extensively against multilingual content. Other models could be added in future.

The most compute intensive part of this process is computing sentence embeddings, which may be take some time depending on researcher hardware. For convenience, sentence embeddings are cached as a pickled numpy binary file adjacent to the input sentence corpus file; embeddings are regenerated when the input sentence corpus file is newer than the embeddings file.

Quote Consolidation

This is an optional python method to consolidate quote sentence-pairs corpus into passages, where possible. Sentence pairs that are sequential in both the original and reuse source texts are merged into a single multi-sentence quote, with a field indicating the number of sentences that were consolidated. Future versions may refine the consolidation logic to support more cases, such as skipped sentences or slight variety in order, e.g. for quotations that re-order source content.

Quote Finder output

A CSV file with each row corresponding to a text passage containing a quotation. The granularity of these passages operate at the sentence level; in other words, a passage corresponds to one or more sentences.

Field Description Type Required?
match_score Some match quality score. For multi-sentence passages this is the average of the scores within the group. Float Yes
reuse_id ID of the reuse sentence String Yes
reuse_text Text of the quote as it appears in reuse document String Yes
reuse_file Filename of reuse document String Yes
reuse_sent_index Reuse sentence index Integer Yes
reuse_... Other reuse input fields included in input sentence corpus Any
original_id ID of the original sentence String Yes
original_text Text of the quote as it appears in original document String Yes
original_file Filename of original document String Yes
original_sent_index Original sentence index Integer Yes
original_... Other original input fields included in input sentence corpus Any
num_sentences Number of sentences included Integer Consolidated quotes only

Any fields included with the input sentences will be passed through and prefixed based on the input corpus (either reuse_ or original_). All reuse fields precede all original fields.

Notes

Out of Scope

For this phase of the project, the following features are out of scope:

  • Non-German quote detection
  • Cross-lingual quote detection
  • Paraphrase / indirect quote detection
  • Attribution, citation, and named-reference detection

The tool is built with an eye towards future expansion to support other languages, and preliminary results do include some cross-lingual quotes in the output, but it has not been tested thoroughly, and sentence corpus creation currently assumes German language.

Assumptions

  • Assumes input content is written in German (support for other languages will be added in future versions)
  • Is optimized for the DNZ-Marx use case
  • Can be adapted to other monolingual and cross-lingual settings, but with no quality guarantees
  • Compatible with any German original (e.g., other texts written by Marx) and reuse texts (e.g., earlier/later issues of DNZ, other journals)
  • Uses sentence embeddings as a core data representation
  • For sentence embedding construction, uses a pre-existing model available through sentence-transformers or huggingface
  • Developed and tested on Python 3.12
  • Can be installed and run on researcher machines (via installable python package, with support and documentation for use with uv and uv run)
  • Is OS and architecture agnostic (compatible with OSX, Linux, and Windows operating systems; runs on both x86 and ARM architectures).
  • Data compilation for a research dataset combining quotes can be done without custom development effort using spreadsheets, no-code database, or similar

References

Naydan, Mary, Bennett Nagtegaal, Rebecca Sutton Koeser, and Edward Baring. “CDH Project Charter — Citing Marx 2025.” Center for Digital Humanities at Princeton, February 12, 2025. https://doi.org/10.5281/zenodo.14861082