remarx ¶

remarx: Marx quote identification for the CDH Citing Marx project.

MODULE	DESCRIPTION
`app`	This module contains the UI and utilities for the remarx app
`quotation`	This module contains libraries for embedding generation and quotation detection.
`sentence`	This module contains libraries for sentence segmentation and sentence corpus construction.
`utils`	Utility functions for the remarx package

utils ¶

Utility functions for the remarx package

FUNCTION	DESCRIPTION
`configure_logging`	Configure logging for the remarx application.

configure_logging ¶

configure_logging(
    log_destination: Path | TextIO | None = None,
    log_level: int = INFO,
    stanza_log_level: int = ERROR,
) -> Path | None

Configure logging for the remarx application. Supports logging to any text stream, a specified file, or auto-generated timestamped file.

PARAMETER	DESCRIPTION
`log_destination`	Where to write logs. Can be: - None (default): Creates a timestamped log file in ./logs/ directory - pathlib.Path: Write to the specified file path - Any io.TextIOBase (e.g., sys.stdout, sys.stderr, or any io.TextIOBase): Write to the given stream TYPE: `Path \| TextIO \| None` DEFAULT: `None`
`log_level`	Logging level for remarx logger (default to logging.INFO) TYPE: `int` DEFAULT: `INFO`
`stanza_log_level`	Logging level for stanza logger (default to logging.ERROR) TYPE: `int` DEFAULT: `ERROR`

RETURNS	DESCRIPTION
`Path \| None`	Path to the created log file if file logging is used, None if stream logging

app ¶

This module contains the UI and utilities for the remarx app

MODULE	DESCRIPTION
`corpus_builder`	The marimo notebook corresponding to the `remarx` application. The application
`quote_finder`	The marimo notebook corresponding to the `remarx` application. The application
`utils`	Utility methods associated with the remarx app

utils ¶

Utility methods associated with the remarx app

FUNCTION	DESCRIPTION
`lifespan`	Lifespan context manager to open browser when server starts
`redirect_root`	Redirect root path to corpus-builder, since app currently has no home page
`launch_app`	Launch the remarx app in the default web browser
`get_current_log_file`	Get the path to the current log file from the root logger's file handler.
`create_header`	Create the header for the remarx notebooks
`create_temp_input`	Context manager to create a temporary file with the file contents and name of a file uploaded

lifespan `async` ¶

lifespan(app: FastAPI) -> AsyncGenerator[None, None]

Lifespan context manager to open browser when server starts

redirect_root `async` ¶

redirect_root() -> RedirectResponse

Redirect root path to corpus-builder, since app currently has no home page

launch_app ¶

launch_app() -> None

Launch the remarx app in the default web browser

get_current_log_file ¶

get_current_log_file() -> Path | None

Get the path to the current log file from the root logger's file handler. Returns None if logging is not configured to a file.

create_header ¶

create_header() -> None

Create the header for the remarx notebooks

create_temp_input ¶

create_temp_input(file_upload: FileUploadResults) -> Generator[Path, None, None]

Context manager to create a temporary file with the file contents and name of a file uploaded to a web browser as returned by marimo.ui.file. This should be used in with statements.

RETURNS	DESCRIPTION
`Generator[Path, None, None]`	Yields the path to the temporary file

sentence ¶

This module contains libraries for sentence segmentation and sentence corpus construction.

MODULE	DESCRIPTION
`corpus`	Functionality for loading and chunking input files for sentence corpus creation.
`segment`	Provides functionality to break down input text into individual

segment ¶

Provides functionality to break down input text into individual sentences and return them as tuples containing the character index where each sentence begins and the sentence text itself.

FUNCTION	DESCRIPTION
`segment_text`	Segment a string of text into sentences with character indices.

segment_text ¶

segment_text(text: str, language: str = 'de') -> list[tuple[int, str]]

Segment a string of text into sentences with character indices.

PARAMETER	DESCRIPTION
`text`	Input text to be segmented into sentences TYPE: `str`
`language`	Language code for the Stanza pipeline TYPE: `str` DEFAULT: `'de'`

RETURNS	DESCRIPTION
`list[tuple[int, str]]`	List of tuples where each tuple contains (start_char_index, sentence_text)

corpus ¶

Functionality for loading and chunking input files for sentence corpus creation.

MODULE	DESCRIPTION
`alto_input`	Functionality related to parsing ALTO XML content packaged within a zipfile,
`base_input`	Base file input class with common functionality. Provides a factory
`create`	Preliminary script and method to create sentence corpora from input
`tei_input`	Functionality related to parsing MEGA TEI/XML content with the
`text_input`	Input class for handling basic text file as input for corpus creation.

CLASS	DESCRIPTION
`ALTOInput`	Preliminary FileInput implementation for ALTO XML delivered as a zipfile.
`FileInput`	Base class for file input for sentence corpus creation
`TEIDocument`	Custom :class:`neuxml.xmlmap.XmlObject` instance for TEI XML document.
`TEIinput`	Input class for TEI/XML content. Takes a single input file,
`TEIPage`	Custom :class:`neuxml.xmlmap.XmlObject` instance for a page
`TextInput`	Basic text file input handling for sentence corpus creation. Takes

ATTRIBUTE	DESCRIPTION
`TEI_TAG`	Convenience access to namespaced TEI tag names

ALTOInput `dataclass` ¶

ALTOInput(input_file: Path, filename_override: str = None)

Bases: FileInput

Preliminary FileInput implementation for ALTO XML delivered as a zipfile. Iterates through ALTO XML members and stubs out chunk yielding for future parsing.

METHOD	DESCRIPTION
`get_text`	Iterate over ALTO XML files contained in the zipfile and return

ATTRIBUTE	DESCRIPTION
`field_names`	List of field names for sentences originating from ALTO XML content. TYPE: `tuple[str, ...]`
`file_type`	Supported file extension for ALTO zipfiles (.zip) TYPE: `str`

field_names `class-attribute` ¶

field_names: tuple[str, ...] = (*(field_names), 'section_type')

List of field names for sentences originating from ALTO XML content.

file_type `class-attribute` ¶

file_type: str = '.zip'

Supported file extension for ALTO zipfiles (.zip)

get_text ¶

get_text() -> Generator[dict[str, str], None, None]

Iterate over ALTO XML files contained in the zipfile and return a generator of text content.

FileInput `dataclass` ¶

FileInput(input_file: Path, filename_override: str = None)

Base class for file input for sentence corpus creation

METHOD	DESCRIPTION
`get_text`	Get plain-text contents for this input file with any desired chunking
`get_extra_metadata`	Hook method for subclasses to override to provide extra metadata for a sentence (e.g. line number).
`get_sentences`	Get sentences for this file, with associated metadata.
`subclasses`	List of available file input classes.
`subclass_by_type`	Dictionary of subclass by supported file extension for available
`supported_types`	Unique list of supported file extensions for available input classes.
`create`	Instantiate and return the appropriate input class for the specified

ATTRIBUTE	DESCRIPTION
`input_file`	Reference to input file. Source of content for sentences. TYPE: `Path`
`filename_override`	Optional filename override, e.g. when using temporary files as input TYPE: `str`
`field_names`	List of field names for sentences from text input files. TYPE: `tuple[str, ...]`
`file_type`	Supported file extension; subclasses must define TYPE: `str`
`file_name`	Input file name. Associated with sentences in generated corpus. TYPE: `str`

input_file `instance-attribute` ¶

input_file: Path

Reference to input file. Source of content for sentences.

filename_override `class-attribute` `instance-attribute` ¶

filename_override: str = None

Optional filename override, e.g. when using temporary files as input

field_names `class-attribute` ¶

field_names: tuple[str, ...] = ('sent_id', 'file', 'sent_index', 'text')

List of field names for sentences from text input files.

file_type `class-attribute` ¶

file_type: str

Supported file extension; subclasses must define

file_name `cached` `property` ¶

file_name: str

Input file name. Associated with sentences in generated corpus.

get_text ¶

get_text() -> Generator[dict[str, str]]

Get plain-text contents for this input file with any desired chunking (e.g. pages or other semantic unit). Subclasses must implement; no default implementation.

RETURNS	DESCRIPTION
`Generator[dict[str, str]]`	Generator with a dictionary of text and any other metadata that applies to this unit of text.

get_extra_metadata ¶

get_extra_metadata(
    chunk_info: dict[str, Any], _char_idx: int, sentence: str
) -> dict[str, Any]

Hook method for subclasses to override to provide extra metadata for a sentence (e.g. line number).

RETURNS	DESCRIPTION
`dict[str, Any]`	Dictionary of additional metadata fields to include, or empty dict

get_sentences ¶

get_sentences() -> Generator[dict[str, Any]]

Get sentences for this file, with associated metadata.

RETURNS	DESCRIPTION
`Generator[dict[str, Any]]`	Generator of one dictionary per sentence; dictionary always includes: `text` (text content), `file` (filename), `sent_index` (sentence index within the document), and `sent_id` (sentence id). It may include other metadata, depending on the input file type.

subclasses `classmethod` ¶

subclasses() -> list[type[Self]]

List of available file input classes.

subclass_by_type `classmethod` ¶

subclass_by_type() -> dict[str, type[Self]]

Dictionary of subclass by supported file extension for available input classes.

supported_types `classmethod` ¶

supported_types() -> list[str]

Unique list of supported file extensions for available input classes.

create `classmethod` ¶

create(input_file: Path, filename_override: str | None = None) -> Self

Instantiate and return the appropriate input class for the specified input file. Takes an optional filename override parameter, which is passed through to the input class.

RAISES	DESCRIPTION
`ValueError`	if input_file is not a supported type

TEI_TAG `module-attribute` ¶

TEI_TAG = TagNames(**{tag: f'{{TEI_NAMESPACE}}{tag}'for tag in (_fields)})

Convenience access to namespaced TEI tag names

TEIDocument ¶

Bases: BaseTEIXmlObject

Custom :class:neuxml.xmlmap.XmlObject instance for TEI XML document. Customized for MEGA TEI XML.

METHOD	DESCRIPTION
`init_from_file`	Class method to initialize a new :class:`TEIDocument` from a file.

ATTRIBUTE	DESCRIPTION
`all_pages`	List of page objects, identified by page begin tag (pb). Includes all
`pages`	Standard pages for this document. Returns a list of TEIPage objects TYPE: `list[TEIPage]`
`pages_by_number`	Dictionary lookup of standard pages by page number. TYPE: `dict[str, TEIPage]`

all_pages `class-attribute` `instance-attribute` ¶

all_pages = NodeListField('//t:text//t:pb', TEIPage)

List of page objects, identified by page begin tag (pb). Includes all pages (standard and manuscript edition), because the XPath is significantly faster without filtering.

pages `cached` `property` ¶

pages: list[TEIPage]

Standard pages for this document. Returns a list of TEIPage objects for this document, omitting any pages marked as manuscript edition.

pages_by_number `cached` `property` ¶

pages_by_number: dict[str, TEIPage]

Dictionary lookup of standard pages by page number.

init_from_file `classmethod` ¶

init_from_file(path: Path) -> Self

Class method to initialize a new :class:TEIDocument from a file.

TEIinput `dataclass` ¶

TEIinput(input_file: Path, filename_override: str = None)

Bases: FileInput

Input class for TEI/XML content. Takes a single input file, and yields text content by page, with page number. Customized for MEGA TEI/XML: follows standard edition page numbering and ignores pages marked as manuscript edition.

METHOD	DESCRIPTION
`__post_init__`	After default initialization, parse the input file as a
`get_text`	Get document content as plain text. The document's content is yielded in segments
`get_extra_metadata`	Calculate extra metadata including line number for a sentence in TEI documents

ATTRIBUTE	DESCRIPTION
`xml_doc`	Parsed XML document; initialized from inherited input_file TYPE: `TEIDocument`
`field_names`	List of field names for sentences from TEI XML input files TYPE: `tuple[str, ...]`
`file_type`	Supported file extension for TEI/XML input

xml_doc `class-attribute` `instance-attribute` ¶

xml_doc: TEIDocument = field(init=False)

Parsed XML document; initialized from inherited input_file

field_names `class-attribute` ¶

field_names: tuple[str, ...] = (
    *(field_names),
    "page_number",
    "section_type",
    "line_number",
)

List of field names for sentences from TEI XML input files

file_type `class-attribute` `instance-attribute` ¶

file_type = '.xml'

Supported file extension for TEI/XML input

__post_init__ ¶

__post_init__() -> None

After default initialization, parse the input file as a TEIDocument and store it as xml_doc.

get_text ¶

get_text() -> Generator[dict[str, str]]

Get document content as plain text. The document's content is yielded in segments with each segment corresponding to a dictionary of containing its text content, page number and section type ("text" or "footnote"). Body text is yielded once per page, while each footnote is yielded individually.

RETURNS	DESCRIPTION
`Generator[dict[str, str]]`	Generator with dictionaries of text content, with page number and section_type ("text" or "footnote").

get_extra_metadata ¶

get_extra_metadata(
    chunk_info: dict[str, Any], char_idx: int, sentence: str
) -> dict[str, Any]

Calculate extra metadata including line number for a sentence in TEI documents based on the character position within the text chunk (page body or footnote).

RETURNS	DESCRIPTION
`dict[str, Any]`	Dictionary with line_number for the sentence (None if not found)

TEIPage ¶

Bases: BaseTEIXmlObject

Custom :class:neuxml.xmlmap.XmlObject instance for a page of content within a TEI XML document.

METHOD	DESCRIPTION
`is_footnote_content`	Helper function that checks if an element or any of its ancestors is footnote content.
`get_page_footnotes`	Filters footnotes to keep only the footnotes that belong to this page.
`get_body_text_line_number`	Return the TEI line number for the line at or before `char_pos`.
`find_preceding_lb`	Find the closest preceding element for an element.
`get_body_text`	Extract body text content for this page, excluding footnotes and editorial content.
`get_footnote_text`	Get all footnote content as a single string, with footnotes separated by double newlines.
`__str__`	Page text contents as a string, with body text and footnotes.

ATTRIBUTE	DESCRIPTION
`number`	page number
`edition`	page edition, if any
`text_nodes`	list of all text nodes following this tag
`following_footnotes`	list of footnote elements within this page and following pages
`next_page`	the next standard page break after this one, or None if this is the last page

number `class-attribute` `instance-attribute` ¶

number = StringField('@n')

page number

edition `class-attribute` `instance-attribute` ¶

edition = StringField('@ed')

page edition, if any

text_nodes `class-attribute` `instance-attribute` ¶

text_nodes = StringListField('following::text()')

list of all text nodes following this tag

following_footnotes `class-attribute` `instance-attribute` ¶

following_footnotes = NodeListField("following::t:note[@type='footnote']", TEIFootnote)

list of footnote elements within this page and following pages

next_page `class-attribute` `instance-attribute` ¶

next_page = NodeField('following::t:pb[not(@ed)][1]', 'self')

the next standard page break after this one, or None if this is the last page

is_footnote_content `staticmethod` ¶

is_footnote_content(el: _Element) -> bool

Helper function that checks if an element or any of its ancestors is footnote content.

get_page_footnotes ¶

get_page_footnotes() -> list[TEIFootnote]

Filters footnotes to keep only the footnotes that belong to this page. Only includes footnotes that occur between this pb and the next standard pb[not(@ed)].

get_body_text_line_number ¶

get_body_text_line_number(char_pos: int) -> int | None

Return the TEI line number for the line at or before char_pos. Returns None if no line number can be determined.

find_preceding_lb `staticmethod` ¶

find_preceding_lb(element: _Element) -> _Element | None

Find the closest preceding element for an element. Needed to find the relative to immediately following inline markup, e.g. text ...

get_body_text ¶

get_body_text() -> str

Extract body text content for this page, excluding footnotes and editorial content. While collecting the text, build a mapping of character offsets to TEI line numbers.

get_footnote_text ¶

get_footnote_text() -> str

Get all footnote content as a single string, with footnotes separated by double newlines.

str ¶

__str__() -> str

Page text contents as a string, with body text and footnotes.

TextInput `dataclass` ¶

TextInput(input_file: Path, filename_override: str = None)

Bases: FileInput

Basic text file input handling for sentence corpus creation. Takes a single text input file and returns text without chunking.

METHOD	DESCRIPTION
`get_text`	Get plain-text contents for this file with any desired chunking (e.g.

ATTRIBUTE	DESCRIPTION
`file_type`	Supported file extension for text input

file_type `class-attribute` `instance-attribute` ¶

file_type = '.txt'

Supported file extension for text input

get_text ¶

get_text() -> Generator[dict[str, str]]

Get plain-text contents for this file with any desired chunking (e.g. pages or other semantic unit). Default implementation does no chunking, no additional metadata.

RETURNS	DESCRIPTION
`Generator[dict[str, str]]`	Generator with a dictionary of text and any other metadata that applies to this unit of text.

base_input ¶

Base file input class with common functionality. Provides a factory method for initialization of known input classes based on supported file types.

To initialize the appropriate subclass for a supported file type, use FileInput.create().

For a list of supported file types across all registered input classes, use FileInput.supported_types().

Subclasses must define a supported file_type extension and implement the get_text method. For discovery, input classes must be imported in remarx.sentence.corpus.__init__ and included in __all__ to ensure they are found as available input classes.

tei_input ¶

Functionality related to parsing MEGA TEI/XML content with the goal of creating a sentence corpora with associated metadata from the TEI.

text_input ¶

Input class for handling basic text file as input for corpus creation.

quotation ¶

This module contains libraries for embedding generation and quotation detection.

MODULE	DESCRIPTION
`embeddings`	Library for generating sentence embeddings from pretrained Sentence Transformer models.
`pairs`	Library for finding sentence-level quote pairs.

embeddings ¶

Library for generating sentence embeddings from pretrained Sentence Transformer models.

FUNCTION	DESCRIPTION
`get_sentence_embeddings`	Extract embeddings for each sentence using the specified pretrained Sentence

get_sentence_embeddings ¶

get_sentence_embeddings(
    sentences: list[str],
    model_name: str = "paraphrase-multilingual-mpnet-base-v2",
    show_progress_bar: bool = False,
) -> NDArray

Extract embeddings for each sentence using the specified pretrained Sentence Transformers model (default is paraphrase-multilingual-mpnet-base-v2). Returns a numpy array of the embeddings with shape [# sents, # dims].

PARAMETER	DESCRIPTION
`sentences`	List of sentences to generate embeddings for TYPE: `list[str]`
`model_name`	Name of the pretrained sentence transformer model to use (default: paraphrase-multilingual-mpnet-base-v2) TYPE: `str` DEFAULT: `'paraphrase-multilingual-mpnet-base-v2'`

RETURNS	DESCRIPTION
`NDArray`	2-dimensional numpy array of normalized sentence embeddings with shape [# sents, # dims]

pairs ¶

Library for finding sentence-level quote pairs.

Note: Currently this script only supports one original and reuse corpus.

FUNCTION	DESCRIPTION
`build_vector_index`	Builds an index for a given set of embeddings with the specified
`get_sentence_pairs`	For a set of original and reuse sentences, identify pairs of original-reuse
`load_sent_df`	For a given sentence corpus, create a polars DataFrame suitable for finding
`compile_quote_pairs`	Link sentence metadata to the detected sentence pairs from the given original
`find_quote_pairs`	For a given original and reuse sentence corpus, finds the likely sentence-level

build_vector_index ¶

build_vector_index(embeddings: NDArray) -> Index

Builds an index for a given set of embeddings with the specified number of trees.

get_sentence_pairs ¶

get_sentence_pairs(
    original_sents: list[str],
    reuse_sents: list[str],
    score_cutoff: float,
    show_progress_bar: bool = False,
) -> DataFrame

For a set of original and reuse sentences, identify pairs of original-reuse sentence pairs where quotation is likely. Returns these sentence pairs as a polars DataFrame including for each pair:

original_index: the index of the original sentence
reuse_index: the index of the reuse sentence
match_score: the quality of the match

Likely quote pairs are identified through the sentences' embeddings. The Annoy library is used to find the nearest original sentence for each reuse sentence. Then likely quote pairs are determined by those sentence pairs with a match score (cosine similarity) above the specified cutoff. Optionally, the parameters for Annoy may be specified.

load_sent_df ¶

load_sent_df(sentence_corpus: Path, col_pfx: str = '') -> DataFrame

For a given sentence corpus, create a polars DataFrame suitable for finding sentence-level quote pairs. Optionally, a prefix can be added to all column names.

The resulting dataframe has the same fields as the input corpus except with:

a new field index corresponding to the row index
the sentence id field sent_id is renamed to id

compile_quote_pairs ¶

compile_quote_pairs(
    original_corpus: DataFrame, reuse_corpus: DataFrame, detected_pairs: DataFrame
) -> DataFrame

Link sentence metadata to the detected sentence pairs from the given original and reuse sentence corpus dataframes to form quote pairs. The original and reuse corpus dataframes must contain a row index column named original_index and reuse_index respectively. Ideally, these dataframes should be built using load_sent_df.

Returns a dataframe with the following fields:

match_score: Estimated quality of the match
All other fields in order from the reuse corpus except its row index
All other fields in order from the original corpus except its row index

find_quote_pairs ¶

find_quote_pairs(
    original_corpus: Path,
    reuse_corpus: Path,
    out_csv: Path,
    score_cutoff: float = 0.225,
    show_progress_bar: bool = False,
) -> None

For a given original and reuse sentence corpus, finds the likely sentence-level quote pairs. These quote pairs are saved as a CSV. Optionally, the required quality for quote pairs can be modified via score_cutoff.

remarx ¶

utils ¶

configure_logging ¶

app ¶

utils ¶

lifespan async ¶

redirect_root async ¶

launch_app ¶

get_current_log_file ¶

create_header ¶

create_temp_input ¶

sentence ¶

segment ¶

segment_text ¶

corpus ¶

ALTOInput dataclass ¶

field_names class-attribute ¶

file_type class-attribute ¶

get_text ¶

FileInput dataclass ¶

input_file instance-attribute ¶

filename_override class-attribute instance-attribute ¶

field_names class-attribute ¶

file_type class-attribute ¶

file_name cached property ¶

get_text ¶

get_extra_metadata ¶

get_sentences ¶

subclasses classmethod ¶

subclass_by_type classmethod ¶

supported_types classmethod ¶

create classmethod ¶

TEI_TAG module-attribute ¶

TEIDocument ¶

all_pages class-attribute instance-attribute ¶

pages cached property ¶

pages_by_number cached property ¶

init_from_file classmethod ¶

TEIinput dataclass ¶

xml_doc class-attribute instance-attribute ¶

field_names class-attribute ¶

file_type class-attribute instance-attribute ¶

__post_init__ ¶

get_text ¶

get_extra_metadata ¶

TEIPage ¶

number class-attribute instance-attribute ¶

edition class-attribute instance-attribute ¶

text_nodes class-attribute instance-attribute ¶

following_footnotes class-attribute instance-attribute ¶

next_page class-attribute instance-attribute ¶

is_footnote_content staticmethod ¶

get_page_footnotes ¶

get_body_text_line_number ¶

find_preceding_lb staticmethod ¶

get_body_text ¶

get_footnote_text ¶

__str__ ¶

TextInput dataclass ¶

file_type class-attribute instance-attribute ¶

get_text ¶

base_input ¶

tei_input ¶

text_input ¶

quotation ¶

embeddings ¶

get_sentence_embeddings ¶

pairs ¶

build_vector_index ¶

get_sentence_pairs ¶

load_sent_df ¶

compile_quote_pairs ¶

find_quote_pairs ¶

lifespan `async` ¶

redirect_root `async` ¶

ALTOInput `dataclass` ¶

field_names `class-attribute` ¶

file_type `class-attribute` ¶

FileInput `dataclass` ¶

input_file `instance-attribute` ¶

filename_override `class-attribute` `instance-attribute` ¶

field_names `class-attribute` ¶

file_type `class-attribute` ¶

file_name `cached` `property` ¶

subclasses `classmethod` ¶

subclass_by_type `classmethod` ¶

supported_types `classmethod` ¶

create `classmethod` ¶

TEI_TAG `module-attribute` ¶

all_pages `class-attribute` `instance-attribute` ¶

pages `cached` `property` ¶

pages_by_number `cached` `property` ¶

init_from_file `classmethod` ¶

TEIinput `dataclass` ¶

xml_doc `class-attribute` `instance-attribute` ¶

field_names `class-attribute` ¶

file_type `class-attribute` `instance-attribute` ¶

number `class-attribute` `instance-attribute` ¶

edition `class-attribute` `instance-attribute` ¶

text_nodes `class-attribute` `instance-attribute` ¶

following_footnotes `class-attribute` `instance-attribute` ¶

next_page `class-attribute` `instance-attribute` ¶

is_footnote_content `staticmethod` ¶

find_preceding_lb `staticmethod` ¶

str ¶

TextInput `dataclass` ¶

file_type `class-attribute` `instance-attribute` ¶