Skip to content

remarx

remarx: Marx quote identification for the CDH Citing Marx project.

MODULE DESCRIPTION
app

This module contains the UI and utilities for the remarx app

quotation

This module contains libraries for embedding generation and quotation detection.

sentence

This module contains libraries for sentence segmentation and sentence corpus construction.

utils

Utility functions for the remarx package

utils

Utility functions for the remarx package

CLASS DESCRIPTION
CorpusPath

Paths for the default corpus directory structure.

FUNCTION DESCRIPTION
get_default_corpus_path

Return default corpus directories and optionally create them if missing.

get_default_quote_output_path

Return the default quote finder output directory path and optionally create it if missing.

configure_logging

Configure logging for the remarx application.

CorpusPath dataclass

CorpusPath(
    root: Path | None = None, original: Path | None = None, reuse: Path | None = None
)

Paths for the default corpus directory structure.

Populates unspecified directories based on the default data folder. Supports expansion of "~" or "~user" paths.

METHOD DESCRIPTION
__post_init__

Populate unset directories using the default data root, expanding "~"

ready

Return True if both default corpus directories already exist.

ensure_directories

Create the corpus directories if they do not exist.

__post_init__

__post_init__() -> None

Populate unset directories using the default data root, expanding "~" or "~user" values with pathlib.Path.expanduser() so shell-style root paths are accepted. Callers can override any directory; otherwise the root defaults to DEFAULT_CORPUS_ROOT under remarx-data and the original and reuse directories live as its subfolders.

ready

ready() -> bool

Return True if both default corpus directories already exist.

ensure_directories

ensure_directories() -> None

Create the corpus directories if they do not exist.

get_default_corpus_path

get_default_corpus_path(create: bool = False) -> tuple[bool, CorpusPath]

Return default corpus directories and optionally create them if missing.

get_default_quote_output_path

get_default_quote_output_path(create: bool = False) -> tuple[bool, Path]

Return the default quote finder output directory path and optionally create it if missing.

PARAMETER DESCRIPTION
create

If True, create the directory if it doesn't exist

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
tuple[bool, Path]

Tuple of (ready flag, path to quote output directory)

configure_logging

configure_logging(
    log_destination: Path | TextIO | None = None, log_level: int = INFO
) -> Path | None

Configure logging for the remarx application. Supports logging to any text stream, a specified file, or auto-generated timestamped file.

PARAMETER DESCRIPTION
log_destination

Where to write logs. Can be: - None (default): Creates a timestamped log file in ./logs/ directory - pathlib.Path: Write to the specified file path - Any io.TextIOBase (e.g., sys.stdout, sys.stderr, or any io.TextIOBase): Write to the given stream

TYPE: Path | TextIO | None DEFAULT: None

log_level

Logging level for remarx logger (default to logging.INFO)

TYPE: int DEFAULT: INFO

RETURNS DESCRIPTION
Path | None

Path to the created log file if file logging is used, None if stream logging

app

This module contains the UI and utilities for the remarx app

MODULE DESCRIPTION
corpus_builder

The marimo notebook corresponding to the remarx application. The application

log_viewer

Utilities for rendering remarx logs inside marimo notebooks.

quote_finder

The marimo notebook corresponding to the remarx application. The application

utils

Utility methods associated with the remarx app

utils

Utility methods associated with the remarx app

FUNCTION DESCRIPTION
lifespan

Lifespan context manager to open browser when server starts

redirect_root

Redirect root path to corpus-builder, since app currently has no home page

launch_app

Launch the remarx app in the default web browser

get_current_log_file

Get the path to the current log file from the root logger's file handler.

create_header

Create the header for the remarx notebooks

create_temp_input

Context manager to create a temporary file with the file contents and name of a file uploaded

summarize_corpus_selection

Summarize a sentence corpus CSV selected in the UI.

handle_default_corpus_creation

Update default corpus directory state based on the create button.

lifespan async

lifespan(app: FastAPI) -> AsyncGenerator[None, None]

Lifespan context manager to open browser when server starts

redirect_root async

redirect_root() -> RedirectResponse

Redirect root path to corpus-builder, since app currently has no home page

launch_app

launch_app() -> None

Launch the remarx app in the default web browser

get_current_log_file

get_current_log_file() -> Path | None

Get the path to the current log file from the root logger's file handler. Returns None if logging is not configured to a file.

create_header

create_header() -> None

Create the header for the remarx notebooks

create_temp_input

create_temp_input(file_upload: FileUploadResults) -> Generator[Path, None, None]

Context manager to create a temporary file with the file contents and name of a file uploaded to a web browser as returned by marimo.ui.file. This should be used in with statements.

RETURNS DESCRIPTION
Generator[Path, None, None]

Yields the path to the temporary file

summarize_corpus_selection

summarize_corpus_selection(
    selection: _HasPath | str | Path | None,
) -> dict[str, int | str] | None

Summarize a sentence corpus CSV selected in the UI.

Accepts either a file-browser selection (with a path attribute) or a filesystem path and returns corpus statistics that can be displayed in the Marimo table.

handle_default_corpus_creation

handle_default_corpus_creation(
    button: run_button,
    default_dirs_initial: CorpusPath,
    *,
    ready_message: str = ":white_check_mark: Default corpus folders are ready.",
    missing_message: str = ":x: Default corpus folders were not found."
) -> tuple[bool, CorpusPath, str, str]

Update default corpus directory state based on the create button.

RETURNS DESCRIPTION
tuple[bool, CorpusPath, str, str]

Tuple of (ready flag, directories object, status message, callout kind)

sentence

This module contains libraries for sentence segmentation and sentence corpus construction.

MODULE DESCRIPTION
corpus

Functionality for loading and chunking input files for sentence corpus creation.

segment

Provides functionality to break down input text into individual

segment

Provides functionality to break down input text into individual sentences and return them as tuples containing the character index where each sentence begins and the sentence text itself.

FUNCTION DESCRIPTION
segment_text

Segment a string of text into sentences with character indices.

segment_text

segment_text(text: str, model: str = 'de_core_news_sm') -> list[tuple[int, str]]

Segment a string of text into sentences with character indices.

Automatically downloads the spaCy model on first use if it is not installed.

PARAMETER DESCRIPTION
text

Input text to be segmented into sentences

TYPE: str

model

spaCy model name, defaulted to "de_core_news_sm"

TYPE: str DEFAULT: 'de_core_news_sm'

RETURNS DESCRIPTION
list[tuple[int, str]]

List of tuples where each tuple contains (start_char_index, sentence_text)

corpus

Functionality for loading and chunking input files for sentence corpus creation.

MODULE DESCRIPTION
alto_input

Functionality related to parsing ALTO XML content packaged within a zipfile,

base_input

Base file input class with common functionality. Provides a factory

create

Preliminary script and method to create sentence corpora from input

tei_input

Functionality related to parsing MEGA TEI/XML content with the

text_input

Input class for handling basic text file as input for corpus creation.

CLASS DESCRIPTION
ALTOInput

FileInput implementation for ALTO XML delivered as a zipfile.

FileInput

Base class for file input for sentence corpus creation

TEIDocument

Custom :class:neuxml.xmlmap.XmlObject instance for TEI XML document.

TEIinput

Input class for TEI/XML content. Takes a single input file,

TextInput

Basic text file input handling for sentence corpus creation. Takes

ATTRIBUTE DESCRIPTION
TEI_TAG

Convenience access to namespaced TEI tag names

ALTOInput dataclass

ALTOInput(
    input_file: Path, filename_override: str = None, filter_sections: bool = True
)

Bases: FileInput

FileInput implementation for ALTO XML delivered as a zipfile. Iterates through ALTO XML members and yields text blocks with ALTO metadata.

METHOD DESCRIPTION
get_text

Iterate over ALTO XML files contained in the zipfile and return

update_current_metadata

Update current article metadata.

check_zipfile_path

Check an individual file included in the zip archive to determine if

ATTRIBUTE DESCRIPTION
field_names

List of field names for sentences originating from ALTO XML content.

TYPE: tuple[str, ...]

file_type

Supported file extension for ALTO zipfiles (.zip)

TYPE: str

default_include

Default content sections to include

TYPE: set[str]

filter_sections

Whether to filter text sections by block type

TYPE: bool

field_names class-attribute
field_names: tuple[str, ...] = (
    *(field_names),
    "section_type",
    "title",
    "author",
    "page_number",
    "page_file",
)

List of field names for sentences originating from ALTO XML content.

file_type class-attribute
file_type: str = '.zip'

Supported file extension for ALTO zipfiles (.zip)

default_include class-attribute
default_include: set[str] = {'text', 'footnote', 'Title'}

Default content sections to include

filter_sections class-attribute instance-attribute
filter_sections: bool = True

Whether to filter text sections by block type

get_text
get_text() -> Generator[dict[str, str], None, None]

Iterate over ALTO XML files contained in the zipfile and return a generator of text content.

update_current_metadata
update_current_metadata(blocks: list[TextBlock]) -> None

Update current article metadata.

check_zipfile_path
check_zipfile_path(zip_filepath: ZipInfo, zip_archive: ZipFile) -> None | AltoDocument

Check an individual file included in the zip archive to determine if parsing should be attempted and if it is a valid ALTO XML file. Returns AltoDocument if valid, otherwise None.

FileInput dataclass

FileInput(input_file: Path, filename_override: str = None)

Base class for file input for sentence corpus creation

METHOD DESCRIPTION
include_sentence

Return True if a sentence should be included in the corpus.

get_text

Get plain-text contents for this input file with any desired chunking

get_extra_metadata

Hook method for subclasses to override to provide extra metadata for a sentence (e.g. line number).

get_sentences

Get sentences for this file, with associated metadata.

subclasses

List of available file input classes.

subclass_by_type

Dictionary of subclass by supported file extension for available

supported_types

Unique list of supported file extensions for available input classes.

create

Instantiate and return the appropriate input class for the specified

ATTRIBUTE DESCRIPTION
input_file

Reference to input file. Source of content for sentences.

TYPE: Path

filename_override

Optional filename override, e.g. when using temporary files as input

TYPE: str

field_names

List of field names for sentences from text input files.

TYPE: tuple[str, ...]

file_type

Supported file extension; subclasses must define

TYPE: str

file_name

Input file name. Associated with sentences in generated corpus.

TYPE: str

input_file instance-attribute
input_file: Path

Reference to input file. Source of content for sentences.

filename_override class-attribute instance-attribute
filename_override: str = None

Optional filename override, e.g. when using temporary files as input

field_names class-attribute
field_names: tuple[str, ...] = ('sent_id', 'file', 'sent_index', 'text')

List of field names for sentences from text input files.

file_type class-attribute
file_type: str

Supported file extension; subclasses must define

file_name cached property
file_name: str

Input file name. Associated with sentences in generated corpus.

include_sentence
include_sentence(sentence: str) -> bool

Return True if a sentence should be included in the corpus.

Drops sentences that are: - Punctuation/digits-only (or the letter 'p' alone, e.g. p.) - Fewer than min_words tokens (whitespace split)

get_text
get_text() -> Generator[dict[str, str]]

Get plain-text contents for this input file with any desired chunking (e.g. pages or other semantic unit). Subclasses must implement; no default implementation.

RETURNS DESCRIPTION
Generator[dict[str, str]]

Generator with a dictionary of text and any other metadata that applies to this unit of text.

get_extra_metadata
get_extra_metadata(
    chunk_info: dict[str, Any], _char_idx: int, sentence: str
) -> dict[str, Any]

Hook method for subclasses to override to provide extra metadata for a sentence (e.g. line number).

RETURNS DESCRIPTION
dict[str, Any]

Dictionary of additional metadata fields to include, or empty dict

get_sentences
get_sentences() -> Generator[dict[str, Any]]

Get sentences for this file, with associated metadata.

RETURNS DESCRIPTION
Generator[dict[str, Any]]

Generator of one dictionary per sentence; dictionary always includes: text (text content), file (filename), sent_index (sentence index within the document), and sent_id (sentence id). It may include other metadata, depending on the input file type.

subclasses classmethod
subclasses() -> list[type[Self]]

List of available file input classes.

subclass_by_type classmethod
subclass_by_type() -> dict[str, type[Self]]

Dictionary of subclass by supported file extension for available input classes.

supported_types classmethod
supported_types() -> list[str]

Unique list of supported file extensions for available input classes.

create classmethod
create(input_file: Path, filename_override: str | None = None) -> Self

Instantiate and return the appropriate input class for the specified input file. Takes an optional filename override parameter, which is passed through to the input class.

RAISES DESCRIPTION
ValueError

if input_file is not a supported type

TEI_TAG module-attribute

TEI_TAG = TagNames(**{tag: f'{{TEI_NAMESPACE}}{tag}'for tag in (_fields)})

Convenience access to namespaced TEI tag names

TEIDocument

Bases: BaseTEIXmlObject

Custom :class:neuxml.xmlmap.XmlObject instance for TEI XML document. Customized for MEGA TEI XML.

METHOD DESCRIPTION
init_from_file

Class method to initialize a new :class:TEIDocument from a file.

init_from_file classmethod
init_from_file(path: Path) -> Self

Class method to initialize a new :class:TEIDocument from a file.

TEIinput dataclass

TEIinput(input_file: Path, filename_override: str = None)

Bases: FileInput

Input class for TEI/XML content. Takes a single input file, and yields text content by page, with page number. Customized for MEGA TEI/XML: follows standard edition page numbering and ignores pages marked as manuscript edition.

METHOD DESCRIPTION
__post_init__

After default initialization, parse the input file as a

get_text

Get document content as plain text. The document's content is yielded in segments

get_line_number

Return the TEI line number for the specified text index and

get_extra_metadata

Calculate extra metadata including line number for a sentence in TEI documents

ATTRIBUTE DESCRIPTION
xml_doc

Parsed XML document; initialized from inherited input_file

TYPE: TEIDocument

field_names

List of field names for sentences from TEI XML input files

TYPE: tuple[str, ...]

file_type

Supported file extension for TEI/XML input

xml_doc class-attribute instance-attribute
xml_doc: TEIDocument = field(init=False)

Parsed XML document; initialized from inherited input_file

field_names class-attribute
field_names: tuple[str, ...] = (
    *(field_names),
    "page_number",
    "section_type",
    "line_number",
)

List of field names for sentences from TEI XML input files

file_type class-attribute instance-attribute
file_type = '.xml'

Supported file extension for TEI/XML input

__post_init__
__post_init__() -> None

After default initialization, parse the input file as a TEIDocument and store it as xml_doc.

get_text
get_text() -> Generator[dict[str, str]]

Get document content as plain text. The document's content is yielded in segments with each segment corresponding to a dictionary of containing its text content, page number and section type ("text" or "footnote"). Body text is yielded once per page, while each footnote is yielded individually.

RETURNS DESCRIPTION
Generator[dict[str, str]]

Generator with dictionaries of text content with page number and section_type ("text" or "footnote").

get_line_number
get_line_number(text_index: int, char_index: int) -> int | None

Return the TEI line number for the specified text index and character index. Returns the line number at or before char_index; line number offsets must be populated by get_text(). Returns None if line number cannot be determined.

get_extra_metadata
get_extra_metadata(
    chunk_info: dict[str, Any], char_idx: int, sentence: str
) -> dict[str, Any]

Calculate extra metadata including line number for a sentence in TEI documents based on the character position within the text chunk (page body or footnote).

RETURNS DESCRIPTION
dict[str, Any]

Dictionary with line_number for the sentence or empty dict

TextInput dataclass

TextInput(input_file: Path, filename_override: str = None)

Bases: FileInput

Basic text file input handling for sentence corpus creation. Takes a single text input file and returns text without chunking.

METHOD DESCRIPTION
get_text

Get plain-text contents for this file with any desired chunking (e.g.

ATTRIBUTE DESCRIPTION
file_type

Supported file extension for text input

file_type class-attribute instance-attribute
file_type = '.txt'

Supported file extension for text input

get_text
get_text() -> Generator[dict[str, str]]

Get plain-text contents for this file with any desired chunking (e.g. pages or other semantic unit). Default implementation does no chunking, no additional metadata.

RETURNS DESCRIPTION
Generator[dict[str, str]]

Generator with a dictionary of text and any other metadata that applies to this unit of text.

base_input

Base file input class with common functionality. Provides a factory method for initialization of known input classes based on supported file types.

To initialize the appropriate subclass for a supported file type, use FileInput.create().

For a list of supported file types across all registered input classes, use FileInput.supported_types().

Subclasses must define a supported file_type extension and implement the get_text method. For discovery, input classes must be imported in remarx.sentence.corpus.__init__ and included in __all__ to ensure they are found as available input classes.

tei_input

Functionality related to parsing MEGA TEI/XML content with the goal of creating a sentence corpora with associated metadata from the TEI.

text_input

Input class for handling basic text file as input for corpus creation.

quotation

This module contains libraries for embedding generation and quotation detection.

MODULE DESCRIPTION
consolidate

functionality for consolidating sequential quotes into passages

embeddings

Library for generating sentence embeddings from pretrained Sentence Transformer models.

find_quotes

Command-line script to identify sentence-level quotation pairs between corpora.

pairs

Library for finding sentence-level quote pairs.

embeddings

Library for generating sentence embeddings from pretrained Sentence Transformer models.

FUNCTION DESCRIPTION
get_cached_embeddings

Get sentence embeddings, with file caching based on source file.

get_sentence_embeddings

Extract embeddings for each sentence using the specified pretrained Sentence

get_cached_embeddings

get_cached_embeddings(
    source_file: Path,
    sentences: list[str],
    model_name: str = DEFAULT_MODEL,
    show_progress_bar: bool = False,
) -> tuple[NDArray, bool]

Get sentence embeddings, with file caching based on source file.

Returns a tuple of embeddings array and a boolean indicating whether the data was loaded from cache.

get_sentence_embeddings

get_sentence_embeddings(
    sentences: list[str],
    model_name: str = DEFAULT_MODEL,
    show_progress_bar: bool = False,
) -> NDArray

Extract embeddings for each sentence using the specified pretrained Sentence Transformers model (default is paraphrase-multilingual-mpnet-base-v2). Returns a numpy array of the embeddings with shape [# sents, # dims].

PARAMETER DESCRIPTION
sentences

List of sentences to generate embeddings for

TYPE: list[str]

model_name

Name of the pretrained sentence transformer model to use (default: paraphrase-multilingual-mpnet-base-v2)

TYPE: str DEFAULT: DEFAULT_MODEL

RETURNS DESCRIPTION
NDArray

2-dimensional numpy array of normalized sentence embeddings with shape [# sents, # dims]

consolidate

functionality for consolidating sequential quotes into passages

FUNCTION DESCRIPTION
identify_sequences

Given a polars dataframe, identify and label rows that are sequential

consolidate_quotes

Consolidate quotes that are sequential in both original and reuse texts.

identify_sequences

identify_sequences(df: DataFrame, field: str, group_field: str) -> DataFrame

Given a polars dataframe, identify and label rows that are sequential for the specified field, within the specified group field. Returns a modified dataframe with the following columns, prefixed by field name:

  • _sequential : boolean indicating whether a row is in a sequence,
  • _group : group identifier; uses field value for first in sequence

consolidate_quotes

consolidate_quotes(df: DataFrame) -> DataFrame

Consolidate quotes that are sequential in both original and reuse texts. Required fields:

  • reuse_sent_index and original_sent_index must be present for aggregation, and must be numeric
  • reuse_file and original_file must be present to ensure aggregation only happens for sequences within specific input files

If required fields are not present, raises polars.exceptions.ColumnNotFoundError. Raises ValueError when called on an empty dataframe.

Consolidation only occurs when:

  • Sentences are sequential in both reuse and original corpora
  • All sentences within a sequence belong to a single reuse corpus and original corpus (seemingly sequential sentences that span multiple files are not consolidated)

DataFrame is expected to include standard quote pair fields; for consolidated quotes, fields are aggregated as follows:

  • match_score average across the group
  • id and sent_index (both reuse and original): first value in group
  • reuse_text and original_text: combined with whitespace delimiter
  • For all other fields, unique values are combined, delimited by semicolon and space

The returned DataFrame includes a new column num_sentences which documents the number of sentences in a group (1 for unconsolidated quotes).

pairs

Library for finding sentence-level quote pairs.

Note: Currently this script only supports one original and reuse corpus.

FUNCTION DESCRIPTION
build_vector_index

Builds an index for a given set of embeddings with the specified

get_sentence_pairs

Given an array of original and reuse sentence embeddings, identify pairs

load_sent_corpus

Takes a sentence corpus file and loads it into a polars DataFrame,

compile_quote_pairs

Combine sentence metadata from original and reuse corpora with detected

find_quote_pairs

For a set of original sentence corpora and one reuse sentence corpus, finds

build_vector_index

build_vector_index(embeddings: NDArray) -> Index

Builds an index for a given set of embeddings with the specified number of trees.

get_sentence_pairs

get_sentence_pairs(
    original_vecs: NDArray,
    reuse_vecs: NDArray,
    score_cutoff: float,
    show_progress_bar: bool = False,
) -> DataFrame

Given an array of original and reuse sentence embeddings, identify pairs of original-reuse pairs with high similarity (i.e., likely quotation). Returns a polars DataFrame of sentence pairs with the following information:

  • original_index: the index of the original sentence
  • reuse_index: the index of the reuse sentence
  • match_score: the quality of the match

Uses embeddings and a vector index to find the nearest original sentence for each reuse sentence. Sentence pairs are filtered to those pairs with a match score (cosine similarity) above the specified cutoff.

load_sent_corpus

load_sent_corpus(
    sentence_corpus: Path, col_pfx: str | None = None, show_progress_bar: bool = False
) -> tuple[DataFrame, NDArray]

Takes a sentence corpus file and loads it into a polars DataFrame, and generates sentence embeddings for the text of each sentence in the corpus. Optionally supports adding a prefix to all column names in the DataFrame.

The resulting dataframe has the same fields as the input corpus, with the following adjustments:

  • a new field index corresponding to the row index
  • the sentence id field sent_id is renamed to id
  • all field names are prefixed if a column prefix is specified

Returns a tuple of DataFrame and numpy array with embeddings vectors

compile_quote_pairs

compile_quote_pairs(
    original_corpus: DataFrame, reuse_corpus: DataFrame, detected_pairs: DataFrame
) -> DataFrame

Combine sentence metadata from original and reuse corpora with detected sentence pair identifiers to form quote pairs. The original and reuse corpus dataframes must contain a row index column named original_index and reuse_index respectively. Ideally, these dataframes should be built using load_sent_corpus.

Returns a dataframe with the following fields:

  • match_score: Estimated quality of the match
  • All other fields in order from the reuse corpus except row index
  • All other fields in order from the original corpus except row index

find_quote_pairs

find_quote_pairs(
    original_corpus: list[Path],
    reuse_corpus: Path,
    output_path: Path,
    score_cutoff: float = 0.225,
    consolidate: bool = True,
    show_progress_bar: bool = False,
    benchmark: bool = False,
) -> None

For a set of original sentence corpora and one reuse sentence corpus, finds the likely sentence-level quote pairs, which are saved as a CSV file.

Optional parameters allow configuring the score_cutoff for threshold to include quote pairs, and consolidation of consecutive sentences (on by default). When benchmark is enabled, summary information is logged to report on corpus size and timings to generate embeddings and search for pairs.