Skip to content

remarx

remarx: Marx quote identification for the CDH Citing Marx project.

MODULE DESCRIPTION
app

This module contains the UI and utilities for the remarx app

quotation

This module contains libraries for embedding generation and quotation detection.

sentence

This module contains libraries for sentence segmentation and sentence corpus construction.

utils

Utility functions for the remarx package

utils

Utility functions for the remarx package

FUNCTION DESCRIPTION
configure_logging

Configure logging for the remarx application.

configure_logging

configure_logging(
    log_destination: Path | TextIO | None = None,
    log_level: int = INFO,
    stanza_log_level: int = ERROR,
) -> Path | None

Configure logging for the remarx application. Supports logging to any text stream, a specified file, or auto-generated timestamped file.

PARAMETER DESCRIPTION
log_destination

Where to write logs. Can be: - None (default): Creates a timestamped log file in ./logs/ directory - pathlib.Path: Write to the specified file path - Any io.TextIOBase (e.g., sys.stdout, sys.stderr, or any io.TextIOBase): Write to the given stream

TYPE: Path | TextIO | None DEFAULT: None

log_level

Logging level for remarx logger (default to logging.INFO)

TYPE: int DEFAULT: INFO

stanza_log_level

Logging level for stanza logger (default to logging.ERROR)

TYPE: int DEFAULT: ERROR

RETURNS DESCRIPTION
Path | None

Path to the created log file if file logging is used, None if stream logging

app

This module contains the UI and utilities for the remarx app

MODULE DESCRIPTION
corpus_builder

The marimo notebook corresponding to the remarx application. The application

quote_finder

The marimo notebook corresponding to the remarx application. The application

utils

Utility methods associated with the remarx app

utils

Utility methods associated with the remarx app

FUNCTION DESCRIPTION
lifespan

Lifespan context manager to open browser when server starts

redirect_root

Redirect root path to corpus-builder, since app currently has no home page

launch_app

Launch the remarx app in the default web browser

get_current_log_file

Get the path to the current log file from the root logger's file handler.

create_header

Create the header for the remarx notebooks

create_temp_input

Context manager to create a temporary file with the file contents and name of a file uploaded

lifespan async

lifespan(app: FastAPI) -> AsyncGenerator[None, None]

Lifespan context manager to open browser when server starts

redirect_root async

redirect_root() -> RedirectResponse

Redirect root path to corpus-builder, since app currently has no home page

launch_app

launch_app() -> None

Launch the remarx app in the default web browser

get_current_log_file

get_current_log_file() -> Path | None

Get the path to the current log file from the root logger's file handler. Returns None if logging is not configured to a file.

create_header

create_header() -> None

Create the header for the remarx notebooks

create_temp_input

create_temp_input(file_upload: FileUploadResults) -> Generator[Path, None, None]

Context manager to create a temporary file with the file contents and name of a file uploaded to a web browser as returned by marimo.ui.file. This should be used in with statements.

RETURNS DESCRIPTION
Generator[Path, None, None]

Yields the path to the temporary file

sentence

This module contains libraries for sentence segmentation and sentence corpus construction.

MODULE DESCRIPTION
corpus

Functionality for loading and chunking input files for sentence corpus creation.

segment

Provides functionality to break down input text into individual

segment

Provides functionality to break down input text into individual sentences and return them as tuples containing the character index where each sentence begins and the sentence text itself.

FUNCTION DESCRIPTION
segment_text

Segment a string of text into sentences with character indices.

segment_text

segment_text(text: str, language: str = 'de') -> list[tuple[int, str]]

Segment a string of text into sentences with character indices.

PARAMETER DESCRIPTION
text

Input text to be segmented into sentences

TYPE: str

language

Language code for the Stanza pipeline

TYPE: str DEFAULT: 'de'

RETURNS DESCRIPTION
list[tuple[int, str]]

List of tuples where each tuple contains (start_char_index, sentence_text)

corpus

Functionality for loading and chunking input files for sentence corpus creation.

MODULE DESCRIPTION
alto_input

Functionality related to parsing ALTO XML content packaged within a zipfile,

base_input

Base file input class with common functionality. Provides a factory

create

Preliminary script and method to create sentence corpora from input

tei_input

Functionality related to parsing MEGA TEI/XML content with the

text_input

Input class for handling basic text file as input for corpus creation.

CLASS DESCRIPTION
ALTOInput

Preliminary FileInput implementation for ALTO XML delivered as a zipfile.

FileInput

Base class for file input for sentence corpus creation

TEIDocument

Custom :class:neuxml.xmlmap.XmlObject instance for TEI XML document.

TEIinput

Input class for TEI/XML content. Takes a single input file,

TEIPage

Custom :class:neuxml.xmlmap.XmlObject instance for a page

TextInput

Basic text file input handling for sentence corpus creation. Takes

ATTRIBUTE DESCRIPTION
TEI_TAG

Convenience access to namespaced TEI tag names

ALTOInput dataclass

ALTOInput(input_file: Path, filename_override: str = None)

Bases: FileInput

Preliminary FileInput implementation for ALTO XML delivered as a zipfile. Iterates through ALTO XML members and stubs out chunk yielding for future parsing.

METHOD DESCRIPTION
get_text

Iterate over ALTO XML files contained in the zipfile and return

ATTRIBUTE DESCRIPTION
field_names

List of field names for sentences originating from ALTO XML content.

TYPE: tuple[str, ...]

file_type

Supported file extension for ALTO zipfiles (.zip)

TYPE: str

field_names class-attribute
field_names: tuple[str, ...] = (*(field_names), 'section_type')

List of field names for sentences originating from ALTO XML content.

file_type class-attribute
file_type: str = '.zip'

Supported file extension for ALTO zipfiles (.zip)

get_text
get_text() -> Generator[dict[str, str], None, None]

Iterate over ALTO XML files contained in the zipfile and return a generator of text content.

FileInput dataclass

FileInput(input_file: Path, filename_override: str = None)

Base class for file input for sentence corpus creation

METHOD DESCRIPTION
get_text

Get plain-text contents for this input file with any desired chunking

get_extra_metadata

Hook method for subclasses to override to provide extra metadata for a sentence (e.g. line number).

get_sentences

Get sentences for this file, with associated metadata.

subclasses

List of available file input classes.

subclass_by_type

Dictionary of subclass by supported file extension for available

supported_types

Unique list of supported file extensions for available input classes.

create

Instantiate and return the appropriate input class for the specified

ATTRIBUTE DESCRIPTION
input_file

Reference to input file. Source of content for sentences.

TYPE: Path

filename_override

Optional filename override, e.g. when using temporary files as input

TYPE: str

field_names

List of field names for sentences from text input files.

TYPE: tuple[str, ...]

file_type

Supported file extension; subclasses must define

TYPE: str

file_name

Input file name. Associated with sentences in generated corpus.

TYPE: str

input_file instance-attribute
input_file: Path

Reference to input file. Source of content for sentences.

filename_override class-attribute instance-attribute
filename_override: str = None

Optional filename override, e.g. when using temporary files as input

field_names class-attribute
field_names: tuple[str, ...] = ('sent_id', 'file', 'sent_index', 'text')

List of field names for sentences from text input files.

file_type class-attribute
file_type: str

Supported file extension; subclasses must define

file_name cached property
file_name: str

Input file name. Associated with sentences in generated corpus.

get_text
get_text() -> Generator[dict[str, str]]

Get plain-text contents for this input file with any desired chunking (e.g. pages or other semantic unit). Subclasses must implement; no default implementation.

RETURNS DESCRIPTION
Generator[dict[str, str]]

Generator with a dictionary of text and any other metadata that applies to this unit of text.

get_extra_metadata
get_extra_metadata(
    chunk_info: dict[str, Any], _char_idx: int, sentence: str
) -> dict[str, Any]

Hook method for subclasses to override to provide extra metadata for a sentence (e.g. line number).

RETURNS DESCRIPTION
dict[str, Any]

Dictionary of additional metadata fields to include, or empty dict

get_sentences
get_sentences() -> Generator[dict[str, Any]]

Get sentences for this file, with associated metadata.

RETURNS DESCRIPTION
Generator[dict[str, Any]]

Generator of one dictionary per sentence; dictionary always includes: text (text content), file (filename), sent_index (sentence index within the document), and sent_id (sentence id). It may include other metadata, depending on the input file type.

subclasses classmethod
subclasses() -> list[type[Self]]

List of available file input classes.

subclass_by_type classmethod
subclass_by_type() -> dict[str, type[Self]]

Dictionary of subclass by supported file extension for available input classes.

supported_types classmethod
supported_types() -> list[str]

Unique list of supported file extensions for available input classes.

create classmethod
create(input_file: Path, filename_override: str | None = None) -> Self

Instantiate and return the appropriate input class for the specified input file. Takes an optional filename override parameter, which is passed through to the input class.

RAISES DESCRIPTION
ValueError

if input_file is not a supported type

TEI_TAG module-attribute

TEI_TAG = TagNames(**{tag: f'{{TEI_NAMESPACE}}{tag}'for tag in (_fields)})

Convenience access to namespaced TEI tag names

TEIDocument

Bases: BaseTEIXmlObject

Custom :class:neuxml.xmlmap.XmlObject instance for TEI XML document. Customized for MEGA TEI XML.

METHOD DESCRIPTION
init_from_file

Class method to initialize a new :class:TEIDocument from a file.

ATTRIBUTE DESCRIPTION
all_pages

List of page objects, identified by page begin tag (pb). Includes all

pages

Standard pages for this document. Returns a list of TEIPage objects

TYPE: list[TEIPage]

pages_by_number

Dictionary lookup of standard pages by page number.

TYPE: dict[str, TEIPage]

all_pages class-attribute instance-attribute
all_pages = NodeListField('//t:text//t:pb', TEIPage)

List of page objects, identified by page begin tag (pb). Includes all pages (standard and manuscript edition), because the XPath is significantly faster without filtering.

pages cached property
pages: list[TEIPage]

Standard pages for this document. Returns a list of TEIPage objects for this document, omitting any pages marked as manuscript edition.

pages_by_number cached property
pages_by_number: dict[str, TEIPage]

Dictionary lookup of standard pages by page number.

init_from_file classmethod
init_from_file(path: Path) -> Self

Class method to initialize a new :class:TEIDocument from a file.

TEIinput dataclass

TEIinput(input_file: Path, filename_override: str = None)

Bases: FileInput

Input class for TEI/XML content. Takes a single input file, and yields text content by page, with page number. Customized for MEGA TEI/XML: follows standard edition page numbering and ignores pages marked as manuscript edition.

METHOD DESCRIPTION
__post_init__

After default initialization, parse the input file as a

get_text

Get document content as plain text. The document's content is yielded in segments

get_extra_metadata

Calculate extra metadata including line number for a sentence in TEI documents

ATTRIBUTE DESCRIPTION
xml_doc

Parsed XML document; initialized from inherited input_file

TYPE: TEIDocument

field_names

List of field names for sentences from TEI XML input files

TYPE: tuple[str, ...]

file_type

Supported file extension for TEI/XML input

xml_doc class-attribute instance-attribute
xml_doc: TEIDocument = field(init=False)

Parsed XML document; initialized from inherited input_file

field_names class-attribute
field_names: tuple[str, ...] = (
    *(field_names),
    "page_number",
    "section_type",
    "line_number",
)

List of field names for sentences from TEI XML input files

file_type class-attribute instance-attribute
file_type = '.xml'

Supported file extension for TEI/XML input

__post_init__
__post_init__() -> None

After default initialization, parse the input file as a TEIDocument and store it as xml_doc.

get_text
get_text() -> Generator[dict[str, str]]

Get document content as plain text. The document's content is yielded in segments with each segment corresponding to a dictionary of containing its text content, page number and section type ("text" or "footnote"). Body text is yielded once per page, while each footnote is yielded individually.

RETURNS DESCRIPTION
Generator[dict[str, str]]

Generator with dictionaries of text content, with page number and section_type ("text" or "footnote").

get_extra_metadata
get_extra_metadata(
    chunk_info: dict[str, Any], char_idx: int, sentence: str
) -> dict[str, Any]

Calculate extra metadata including line number for a sentence in TEI documents based on the character position within the text chunk (page body or footnote).

RETURNS DESCRIPTION
dict[str, Any]

Dictionary with line_number for the sentence (None if not found)

TEIPage

Bases: BaseTEIXmlObject

Custom :class:neuxml.xmlmap.XmlObject instance for a page of content within a TEI XML document.

METHOD DESCRIPTION
is_footnote_content

Helper function that checks if an element or any of its ancestors is footnote content.

get_page_footnotes

Filters footnotes to keep only the footnotes that belong to this page.

get_body_text_line_number

Return the TEI line number for the line at or before char_pos.

find_preceding_lb

Find the closest preceding element for an element.

get_body_text

Extract body text content for this page, excluding footnotes and editorial content.

get_footnote_text

Get all footnote content as a single string, with footnotes separated by double newlines.

__str__

Page text contents as a string, with body text and footnotes.

ATTRIBUTE DESCRIPTION
number

page number

edition

page edition, if any

text_nodes

list of all text nodes following this tag

following_footnotes

list of footnote elements within this page and following pages

next_page

the next standard page break after this one, or None if this is the last page

number class-attribute instance-attribute
number = StringField('@n')

page number

edition class-attribute instance-attribute
edition = StringField('@ed')

page edition, if any

text_nodes class-attribute instance-attribute
text_nodes = StringListField('following::text()')

list of all text nodes following this tag

following_footnotes class-attribute instance-attribute
following_footnotes = NodeListField("following::t:note[@type='footnote']", TEIFootnote)

list of footnote elements within this page and following pages

next_page class-attribute instance-attribute
next_page = NodeField('following::t:pb[not(@ed)][1]', 'self')

the next standard page break after this one, or None if this is the last page

is_footnote_content staticmethod
is_footnote_content(el: _Element) -> bool

Helper function that checks if an element or any of its ancestors is footnote content.

get_page_footnotes
get_page_footnotes() -> list[TEIFootnote]

Filters footnotes to keep only the footnotes that belong to this page. Only includes footnotes that occur between this pb and the next standard pb[not(@ed)].

get_body_text_line_number
get_body_text_line_number(char_pos: int) -> int | None

Return the TEI line number for the line at or before char_pos. Returns None if no line number can be determined.

find_preceding_lb staticmethod
find_preceding_lb(element: _Element) -> _Element | None

Find the closest preceding element for an element. Needed to find the relative to immediately following inline markup, e.g. text ...

get_body_text
get_body_text() -> str

Extract body text content for this page, excluding footnotes and editorial content. While collecting the text, build a mapping of character offsets to TEI line numbers.

get_footnote_text
get_footnote_text() -> str

Get all footnote content as a single string, with footnotes separated by double newlines.

__str__
__str__() -> str

Page text contents as a string, with body text and footnotes.

TextInput dataclass

TextInput(input_file: Path, filename_override: str = None)

Bases: FileInput

Basic text file input handling for sentence corpus creation. Takes a single text input file and returns text without chunking.

METHOD DESCRIPTION
get_text

Get plain-text contents for this file with any desired chunking (e.g.

ATTRIBUTE DESCRIPTION
file_type

Supported file extension for text input

file_type class-attribute instance-attribute
file_type = '.txt'

Supported file extension for text input

get_text
get_text() -> Generator[dict[str, str]]

Get plain-text contents for this file with any desired chunking (e.g. pages or other semantic unit). Default implementation does no chunking, no additional metadata.

RETURNS DESCRIPTION
Generator[dict[str, str]]

Generator with a dictionary of text and any other metadata that applies to this unit of text.

base_input

Base file input class with common functionality. Provides a factory method for initialization of known input classes based on supported file types.

To initialize the appropriate subclass for a supported file type, use FileInput.create().

For a list of supported file types across all registered input classes, use FileInput.supported_types().

Subclasses must define a supported file_type extension and implement the get_text method. For discovery, input classes must be imported in remarx.sentence.corpus.__init__ and included in __all__ to ensure they are found as available input classes.

tei_input

Functionality related to parsing MEGA TEI/XML content with the goal of creating a sentence corpora with associated metadata from the TEI.

text_input

Input class for handling basic text file as input for corpus creation.

quotation

This module contains libraries for embedding generation and quotation detection.

MODULE DESCRIPTION
embeddings

Library for generating sentence embeddings from pretrained Sentence Transformer models.

pairs

Library for finding sentence-level quote pairs.

embeddings

Library for generating sentence embeddings from pretrained Sentence Transformer models.

FUNCTION DESCRIPTION
get_sentence_embeddings

Extract embeddings for each sentence using the specified pretrained Sentence

get_sentence_embeddings

get_sentence_embeddings(
    sentences: list[str],
    model_name: str = "paraphrase-multilingual-mpnet-base-v2",
    show_progress_bar: bool = False,
) -> NDArray

Extract embeddings for each sentence using the specified pretrained Sentence Transformers model (default is paraphrase-multilingual-mpnet-base-v2). Returns a numpy array of the embeddings with shape [# sents, # dims].

PARAMETER DESCRIPTION
sentences

List of sentences to generate embeddings for

TYPE: list[str]

model_name

Name of the pretrained sentence transformer model to use (default: paraphrase-multilingual-mpnet-base-v2)

TYPE: str DEFAULT: 'paraphrase-multilingual-mpnet-base-v2'

RETURNS DESCRIPTION
NDArray

2-dimensional numpy array of normalized sentence embeddings with shape [# sents, # dims]

pairs

Library for finding sentence-level quote pairs.

Note: Currently this script only supports one original and reuse corpus.

FUNCTION DESCRIPTION
build_vector_index

Builds an index for a given set of embeddings with the specified

get_sentence_pairs

For a set of original and reuse sentences, identify pairs of original-reuse

load_sent_df

For a given sentence corpus, create a polars DataFrame suitable for finding

compile_quote_pairs

Link sentence metadata to the detected sentence pairs from the given original

find_quote_pairs

For a given original and reuse sentence corpus, finds the likely sentence-level

build_vector_index

build_vector_index(embeddings: NDArray) -> Index

Builds an index for a given set of embeddings with the specified number of trees.

get_sentence_pairs

get_sentence_pairs(
    original_sents: list[str],
    reuse_sents: list[str],
    score_cutoff: float,
    show_progress_bar: bool = False,
) -> DataFrame

For a set of original and reuse sentences, identify pairs of original-reuse sentence pairs where quotation is likely. Returns these sentence pairs as a polars DataFrame including for each pair:

  • original_index: the index of the original sentence
  • reuse_index: the index of the reuse sentence
  • match_score: the quality of the match

Likely quote pairs are identified through the sentences' embeddings. The Annoy library is used to find the nearest original sentence for each reuse sentence. Then likely quote pairs are determined by those sentence pairs with a match score (cosine similarity) above the specified cutoff. Optionally, the parameters for Annoy may be specified.

load_sent_df

load_sent_df(sentence_corpus: Path, col_pfx: str = '') -> DataFrame

For a given sentence corpus, create a polars DataFrame suitable for finding sentence-level quote pairs. Optionally, a prefix can be added to all column names.

The resulting dataframe has the same fields as the input corpus except with:

  • a new field index corresponding to the row index
  • the sentence id field sent_id is renamed to id

compile_quote_pairs

compile_quote_pairs(
    original_corpus: DataFrame, reuse_corpus: DataFrame, detected_pairs: DataFrame
) -> DataFrame

Link sentence metadata to the detected sentence pairs from the given original and reuse sentence corpus dataframes to form quote pairs. The original and reuse corpus dataframes must contain a row index column named original_index and reuse_index respectively. Ideally, these dataframes should be built using load_sent_df.

Returns a dataframe with the following fields:

  • match_score: Estimated quality of the match
  • All other fields in order from the reuse corpus except its row index
  • All other fields in order from the original corpus except its row index

find_quote_pairs

find_quote_pairs(
    original_corpus: Path,
    reuse_corpus: Path,
    out_csv: Path,
    score_cutoff: float = 0.225,
    show_progress_bar: bool = False,
) -> None

For a given original and reuse sentence corpus, finds the likely sentence-level quote pairs. These quote pairs are saved as a CSV. Optionally, the required quality for quote pairs can be modified via score_cutoff.