remarx
¶
remarx: Marx quote identification for the CDH Citing Marx project.
| MODULE | DESCRIPTION |
|---|---|
app |
This module contains the UI and utilities for the remarx app |
quotation |
This module contains libraries for embedding generation and quotation detection. |
sentence |
This module contains libraries for sentence segmentation and sentence corpus construction. |
utils |
Utility functions for the remarx package |
utils
¶
Utility functions for the remarx package
| CLASS | DESCRIPTION |
|---|---|
CorpusPath |
Paths for the default corpus directory structure. |
| FUNCTION | DESCRIPTION |
|---|---|
get_default_corpus_path |
Return default corpus directories and optionally create them if missing. |
get_default_quote_output_path |
Return the default quote finder output directory path and optionally create it if missing. |
configure_logging |
Configure logging for the remarx application. |
CorpusPath
dataclass
¶
Paths for the default corpus directory structure.
Populates unspecified directories based on the default data folder. Supports expansion of "~" or "~user" paths.
| METHOD | DESCRIPTION |
|---|---|
__post_init__ |
Populate unset directories using the default data root, expanding "~" |
ready |
Return True if both default corpus directories already exist. |
ensure_directories |
Create the corpus directories if they do not exist. |
__post_init__
¶
Populate unset directories using the default data root, expanding "~"
or "~user" values with pathlib.Path.expanduser() so shell-style root
paths are accepted. Callers can override any directory; otherwise the
root defaults to DEFAULT_CORPUS_ROOT under remarx-data and the
original and reuse directories live as its subfolders.
ensure_directories
¶
Create the corpus directories if they do not exist.
get_default_corpus_path
¶
get_default_corpus_path(create: bool = False) -> tuple[bool, CorpusPath]
Return default corpus directories and optionally create them if missing.
get_default_quote_output_path
¶
Return the default quote finder output directory path and optionally create it if missing.
| PARAMETER | DESCRIPTION |
|---|---|
create
|
If True, create the directory if it doesn't exist
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple[bool, Path]
|
Tuple of (ready flag, path to quote output directory) |
configure_logging
¶
configure_logging(
log_destination: Path | TextIO | None = None, log_level: int = INFO
) -> Path | None
Configure logging for the remarx application. Supports logging to any text stream, a specified file, or auto-generated timestamped file.
| PARAMETER | DESCRIPTION |
|---|---|
log_destination
|
Where to write logs. Can be: - None (default): Creates a timestamped log file in ./logs/ directory - pathlib.Path: Write to the specified file path - Any io.TextIOBase (e.g., sys.stdout, sys.stderr, or any io.TextIOBase): Write to the given stream
TYPE:
|
log_level
|
Logging level for remarx logger (default to logging.INFO)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Path | None
|
Path to the created log file if file logging is used, None if stream logging |
app
¶
This module contains the UI and utilities for the remarx app
| MODULE | DESCRIPTION |
|---|---|
corpus_builder |
The marimo notebook corresponding to the |
log_viewer |
Utilities for rendering remarx logs inside marimo notebooks. |
quote_finder |
The marimo notebook corresponding to the |
utils |
Utility methods associated with the remarx app |
utils
¶
Utility methods associated with the remarx app
| FUNCTION | DESCRIPTION |
|---|---|
lifespan |
Lifespan context manager to open browser when server starts |
redirect_root |
Redirect root path to corpus-builder, since app currently has no home page |
launch_app |
Launch the remarx app in the default web browser |
get_current_log_file |
Get the path to the current log file from the root logger's file handler. |
create_header |
Create the header for the remarx notebooks |
create_temp_input |
Context manager to create a temporary file with the file contents and name of a file uploaded |
summarize_corpus_selection |
Summarize a sentence corpus CSV selected in the UI. |
handle_default_corpus_creation |
Update default corpus directory state based on the create button. |
lifespan
async
¶
Lifespan context manager to open browser when server starts
redirect_root
async
¶
Redirect root path to corpus-builder, since app currently has no home page
get_current_log_file
¶
Get the path to the current log file from the root logger's file handler. Returns None if logging is not configured to a file.
create_temp_input
¶
Context manager to create a temporary file with the file contents and name of a file uploaded to a web browser as returned by marimo.ui.file. This should be used in with statements.
| RETURNS | DESCRIPTION |
|---|---|
Generator[Path, None, None]
|
Yields the path to the temporary file |
summarize_corpus_selection
¶
summarize_corpus_selection(
selection: _HasPath | str | Path | None,
) -> dict[str, int | str] | None
Summarize a sentence corpus CSV selected in the UI.
Accepts either a file-browser selection (with a path attribute) or a
filesystem path and returns corpus statistics that can be displayed in the
Marimo table.
handle_default_corpus_creation
¶
handle_default_corpus_creation(
button: run_button,
default_dirs_initial: CorpusPath,
*,
ready_message: str = ":white_check_mark: Default corpus folders are ready.",
missing_message: str = ":x: Default corpus folders were not found."
) -> tuple[bool, CorpusPath, str, str]
Update default corpus directory state based on the create button.
| RETURNS | DESCRIPTION |
|---|---|
tuple[bool, CorpusPath, str, str]
|
Tuple of (ready flag, directories object, status message, callout kind) |
sentence
¶
segment
¶
Provides functionality to break down input text into individual sentences and return them as tuples containing the character index where each sentence begins and the sentence text itself.
| FUNCTION | DESCRIPTION |
|---|---|
segment_text |
Segment a string of text into sentences with character indices. |
segment_text
¶
Segment a string of text into sentences with character indices.
Automatically downloads the spaCy model on first use if it is not installed.
| PARAMETER | DESCRIPTION |
|---|---|
text
|
Input text to be segmented into sentences
TYPE:
|
model
|
spaCy model name, defaulted to "de_core_news_sm"
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[tuple[int, str]]
|
List of tuples where each tuple contains (start_char_index, sentence_text) |
corpus
¶
Functionality for loading and chunking input files for sentence corpus creation.
| MODULE | DESCRIPTION |
|---|---|
alto_input |
Functionality related to parsing ALTO XML content packaged within a zipfile, |
base_input |
Base file input class with common functionality. Provides a factory |
create |
Preliminary script and method to create sentence corpora from input |
tei_input |
Functionality related to parsing MEGA TEI/XML content with the |
text_input |
Input class for handling basic text file as input for corpus creation. |
| CLASS | DESCRIPTION |
|---|---|
ALTOInput |
FileInput implementation for ALTO XML delivered as a zipfile. |
FileInput |
Base class for file input for sentence corpus creation |
TEIDocument |
Custom :class: |
TEIinput |
Input class for TEI/XML content. Takes a single input file, |
TextInput |
Basic text file input handling for sentence corpus creation. Takes |
| ATTRIBUTE | DESCRIPTION |
|---|---|
TEI_TAG |
Convenience access to namespaced TEI tag names
|
ALTOInput
dataclass
¶
Bases: FileInput
FileInput implementation for ALTO XML delivered as a zipfile. Iterates through ALTO XML members and yields text blocks with ALTO metadata.
| METHOD | DESCRIPTION |
|---|---|
get_text |
Iterate over ALTO XML files contained in the zipfile and return |
update_current_metadata |
Update current article metadata. |
check_zipfile_path |
Check an individual file included in the zip archive to determine if |
| ATTRIBUTE | DESCRIPTION |
|---|---|
field_names |
List of field names for sentences originating from ALTO XML content.
TYPE:
|
file_type |
Supported file extension for ALTO zipfiles (.zip)
TYPE:
|
default_include |
Default content sections to include
TYPE:
|
filter_sections |
Whether to filter text sections by block type
TYPE:
|
field_names
class-attribute
¶
field_names: tuple[str, ...] = (
*(field_names),
"section_type",
"title",
"author",
"page_number",
"page_file",
)
List of field names for sentences originating from ALTO XML content.
file_type
class-attribute
¶
Supported file extension for ALTO zipfiles (.zip)
default_include
class-attribute
¶
Default content sections to include
filter_sections
class-attribute
instance-attribute
¶
Whether to filter text sections by block type
get_text
¶
Iterate over ALTO XML files contained in the zipfile and return a generator of text content.
update_current_metadata
¶
Update current article metadata.
check_zipfile_path
¶
Check an individual file included in the zip archive to determine if parsing should be attempted and if it is a valid ALTO XML file. Returns AltoDocument if valid, otherwise None.
FileInput
dataclass
¶
Base class for file input for sentence corpus creation
| METHOD | DESCRIPTION |
|---|---|
include_sentence |
Return True if a sentence should be included in the corpus. |
get_text |
Get plain-text contents for this input file with any desired chunking |
get_extra_metadata |
Hook method for subclasses to override to provide extra metadata for a sentence (e.g. line number). |
get_sentences |
Get sentences for this file, with associated metadata. |
subclasses |
List of available file input classes. |
subclass_by_type |
Dictionary of subclass by supported file extension for available |
supported_types |
Unique list of supported file extensions for available input classes. |
create |
Instantiate and return the appropriate input class for the specified |
| ATTRIBUTE | DESCRIPTION |
|---|---|
input_file |
Reference to input file. Source of content for sentences.
TYPE:
|
filename_override |
Optional filename override, e.g. when using temporary files as input
TYPE:
|
field_names |
List of field names for sentences from text input files.
TYPE:
|
file_type |
Supported file extension; subclasses must define
TYPE:
|
file_name |
Input file name. Associated with sentences in generated corpus.
TYPE:
|
input_file
instance-attribute
¶
Reference to input file. Source of content for sentences.
filename_override
class-attribute
instance-attribute
¶
Optional filename override, e.g. when using temporary files as input
field_names
class-attribute
¶
List of field names for sentences from text input files.
file_name
cached
property
¶
Input file name. Associated with sentences in generated corpus.
include_sentence
¶
Return True if a sentence should be included in the corpus.
Drops sentences that are:
- Punctuation/digits-only (or the letter 'p' alone, e.g. p.)
- Fewer than min_words tokens (whitespace split)
get_text
¶
Get plain-text contents for this input file with any desired chunking (e.g. pages or other semantic unit). Subclasses must implement; no default implementation.
| RETURNS | DESCRIPTION |
|---|---|
Generator[dict[str, str]]
|
Generator with a dictionary of text and any other metadata that applies to this unit of text. |
get_extra_metadata
¶
Hook method for subclasses to override to provide extra metadata for a sentence (e.g. line number).
| RETURNS | DESCRIPTION |
|---|---|
dict[str, Any]
|
Dictionary of additional metadata fields to include, or empty dict |
get_sentences
¶
Get sentences for this file, with associated metadata.
| RETURNS | DESCRIPTION |
|---|---|
Generator[dict[str, Any]]
|
Generator of one dictionary per sentence; dictionary always includes: |
subclass_by_type
classmethod
¶
Dictionary of subclass by supported file extension for available input classes.
supported_types
classmethod
¶
Unique list of supported file extensions for available input classes.
create
classmethod
¶
Instantiate and return the appropriate input class for the specified input file. Takes an optional filename override parameter, which is passed through to the input class.
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
if input_file is not a supported type |
TEI_TAG
module-attribute
¶
Convenience access to namespaced TEI tag names
TEIDocument
¶
Bases: BaseTEIXmlObject
Custom :class:neuxml.xmlmap.XmlObject instance for TEI XML document.
Customized for MEGA TEI XML.
| METHOD | DESCRIPTION |
|---|---|
init_from_file |
Class method to initialize a new :class: |
init_from_file
classmethod
¶
Class method to initialize a new :class:TEIDocument from a file.
TEIinput
dataclass
¶
Bases: FileInput
Input class for TEI/XML content. Takes a single input file, and yields text content by page, with page number. Customized for MEGA TEI/XML: follows standard edition page numbering and ignores pages marked as manuscript edition.
| METHOD | DESCRIPTION |
|---|---|
__post_init__ |
After default initialization, parse the input file as a |
get_text |
Get document content as plain text. The document's content is yielded in segments |
get_line_number |
Return the TEI line number for the specified text index and |
get_extra_metadata |
Calculate extra metadata including line number for a sentence in TEI documents |
| ATTRIBUTE | DESCRIPTION |
|---|---|
xml_doc |
Parsed XML document; initialized from inherited input_file
TYPE:
|
field_names |
List of field names for sentences from TEI XML input files
TYPE:
|
file_type |
Supported file extension for TEI/XML input
|
xml_doc
class-attribute
instance-attribute
¶
xml_doc: TEIDocument = field(init=False)
Parsed XML document; initialized from inherited input_file
field_names
class-attribute
¶
field_names: tuple[str, ...] = (
*(field_names),
"page_number",
"section_type",
"line_number",
)
List of field names for sentences from TEI XML input files
file_type
class-attribute
instance-attribute
¶
Supported file extension for TEI/XML input
__post_init__
¶
After default initialization, parse the input file as a TEIDocument and store it as xml_doc.
get_text
¶
Get document content as plain text. The document's content is yielded in segments with each segment corresponding to a dictionary of containing its text content, page number and section type ("text" or "footnote"). Body text is yielded once per page, while each footnote is yielded individually.
| RETURNS | DESCRIPTION |
|---|---|
Generator[dict[str, str]]
|
Generator with dictionaries of text content with page number and section_type ("text" or "footnote"). |
get_line_number
¶
Return the TEI line number for the specified text index and
character index. Returns the line number at or before char_index;
line number offsets must be populated by get_text().
Returns None if line number cannot be determined.
get_extra_metadata
¶
Calculate extra metadata including line number for a sentence in TEI documents based on the character position within the text chunk (page body or footnote).
| RETURNS | DESCRIPTION |
|---|---|
dict[str, Any]
|
Dictionary with line_number for the sentence or empty dict |
TextInput
dataclass
¶
Bases: FileInput
Basic text file input handling for sentence corpus creation. Takes a single text input file and returns text without chunking.
| METHOD | DESCRIPTION |
|---|---|
get_text |
Get plain-text contents for this file with any desired chunking (e.g. |
| ATTRIBUTE | DESCRIPTION |
|---|---|
file_type |
Supported file extension for text input
|
file_type
class-attribute
instance-attribute
¶
Supported file extension for text input
get_text
¶
Get plain-text contents for this file with any desired chunking (e.g. pages or other semantic unit). Default implementation does no chunking, no additional metadata.
| RETURNS | DESCRIPTION |
|---|---|
Generator[dict[str, str]]
|
Generator with a dictionary of text and any other metadata that applies to this unit of text. |
base_input
¶
Base file input class with common functionality. Provides a factory method for initialization of known input classes based on supported file types.
To initialize the appropriate subclass for a supported file type, use FileInput.create().
For a list of supported file types across all registered input classes, use FileInput.supported_types().
Subclasses must define a supported file_type extension and implement
the get_text method. For discovery, input classes must be imported in
remarx.sentence.corpus.__init__ and included in __all__ to ensure
they are found as available input classes.
tei_input
¶
Functionality related to parsing MEGA TEI/XML content with the goal of creating a sentence corpora with associated metadata from the TEI.
text_input
¶
Input class for handling basic text file as input for corpus creation.
quotation
¶
This module contains libraries for embedding generation and quotation detection.
| MODULE | DESCRIPTION |
|---|---|
consolidate |
functionality for consolidating sequential quotes into passages |
embeddings |
Library for generating sentence embeddings from pretrained Sentence Transformer models. |
find_quotes |
Command-line script to identify sentence-level quotation pairs between corpora. |
pairs |
Library for finding sentence-level quote pairs. |
embeddings
¶
Library for generating sentence embeddings from pretrained Sentence Transformer models.
| FUNCTION | DESCRIPTION |
|---|---|
get_cached_embeddings |
Get sentence embeddings, with file caching based on source file. |
get_sentence_embeddings |
Extract embeddings for each sentence using the specified pretrained Sentence |
get_cached_embeddings
¶
get_cached_embeddings(
source_file: Path,
sentences: list[str],
model_name: str = DEFAULT_MODEL,
show_progress_bar: bool = False,
) -> tuple[NDArray, bool]
Get sentence embeddings, with file caching based on source file.
Returns a tuple of embeddings array and a boolean indicating whether the data was loaded from cache.
get_sentence_embeddings
¶
get_sentence_embeddings(
sentences: list[str],
model_name: str = DEFAULT_MODEL,
show_progress_bar: bool = False,
) -> NDArray
Extract embeddings for each sentence using the specified pretrained Sentence Transformers model (default is paraphrase-multilingual-mpnet-base-v2). Returns a numpy array of the embeddings with shape [# sents, # dims].
| PARAMETER | DESCRIPTION |
|---|---|
sentences
|
List of sentences to generate embeddings for
TYPE:
|
model_name
|
Name of the pretrained sentence transformer model to use (default: paraphrase-multilingual-mpnet-base-v2)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
NDArray
|
2-dimensional numpy array of normalized sentence embeddings with shape [# sents, # dims] |
consolidate
¶
functionality for consolidating sequential quotes into passages
| FUNCTION | DESCRIPTION |
|---|---|
identify_sequences |
Given a polars dataframe, identify and label rows that are sequential |
consolidate_quotes |
Consolidate quotes that are sequential in both original and reuse texts. |
identify_sequences
¶
Given a polars dataframe, identify and label rows that are sequential for the specified field, within the specified group field. Returns a modified dataframe with the following columns, prefixed by field name:
_sequential: boolean indicating whether a row is in a sequence,_group: group identifier; uses field value for first in sequence
consolidate_quotes
¶
Consolidate quotes that are sequential in both original and reuse texts. Required fields:
reuse_sent_indexandoriginal_sent_indexmust be present for aggregation, and must be numericreuse_fileandoriginal_filemust be present to ensure aggregation only happens for sequences within specific input files
If required fields are not present, raises polars.exceptions.ColumnNotFoundError.
Raises ValueError when called on an empty dataframe.
Consolidation only occurs when:
- Sentences are sequential in both reuse and original corpora
- All sentences within a sequence belong to a single reuse corpus and original corpus (seemingly sequential sentences that span multiple files are not consolidated)
DataFrame is expected to include standard quote pair fields; for consolidated quotes, fields are aggregated as follows:
match_scoreaverage across the groupidandsent_index(bothreuseandoriginal): first value in groupreuse_textandoriginal_text: combined with whitespace delimiter- For all other fields, unique values are combined, delimited by semicolon and space
The returned DataFrame includes a new column num_sentences which documents
the number of sentences in a group (1 for unconsolidated quotes).
pairs
¶
Library for finding sentence-level quote pairs.
Note: Currently this script only supports one original and reuse corpus.
| FUNCTION | DESCRIPTION |
|---|---|
build_vector_index |
Builds an index for a given set of embeddings with the specified |
get_sentence_pairs |
Given an array of original and reuse sentence embeddings, identify pairs |
load_sent_corpus |
Takes a sentence corpus file and loads it into a polars DataFrame, |
compile_quote_pairs |
Combine sentence metadata from original and reuse corpora with detected |
find_quote_pairs |
For a set of original sentence corpora and one reuse sentence corpus, finds |
build_vector_index
¶
Builds an index for a given set of embeddings with the specified number of trees.
get_sentence_pairs
¶
get_sentence_pairs(
original_vecs: NDArray,
reuse_vecs: NDArray,
score_cutoff: float,
show_progress_bar: bool = False,
) -> DataFrame
Given an array of original and reuse sentence embeddings, identify pairs of original-reuse pairs with high similarity (i.e., likely quotation). Returns a polars DataFrame of sentence pairs with the following information:
original_index: the index of the original sentencereuse_index: the index of the reuse sentencematch_score: the quality of the match
Uses embeddings and a vector index to find the nearest original sentence for each reuse sentence. Sentence pairs are filtered to those pairs with a match score (cosine similarity) above the specified cutoff.
load_sent_corpus
¶
load_sent_corpus(
sentence_corpus: Path, col_pfx: str | None = None, show_progress_bar: bool = False
) -> tuple[DataFrame, NDArray]
Takes a sentence corpus file and loads it into a polars DataFrame, and generates sentence embeddings for the text of each sentence in the corpus. Optionally supports adding a prefix to all column names in the DataFrame.
The resulting dataframe has the same fields as the input corpus, with the following adjustments:
- a new field
indexcorresponding to the row index - the sentence id field
sent_idis renamed toid - all field names are prefixed if a column prefix is specified
Returns a tuple of DataFrame and numpy array with embeddings vectors
compile_quote_pairs
¶
compile_quote_pairs(
original_corpus: DataFrame, reuse_corpus: DataFrame, detected_pairs: DataFrame
) -> DataFrame
Combine sentence metadata from original and reuse corpora with detected
sentence pair identifiers to form quote pairs. The original and reuse
corpus dataframes must contain a row index column named original_index and
reuse_index respectively. Ideally, these dataframes should be built using
load_sent_corpus.
Returns a dataframe with the following fields:
match_score: Estimated quality of the match- All other fields in order from the reuse corpus except row index
- All other fields in order from the original corpus except row index
find_quote_pairs
¶
find_quote_pairs(
original_corpus: list[Path],
reuse_corpus: Path,
output_path: Path,
score_cutoff: float = 0.225,
consolidate: bool = True,
show_progress_bar: bool = False,
benchmark: bool = False,
) -> None
For a set of original sentence corpora and one reuse sentence corpus, finds the likely sentence-level quote pairs, which are saved as a CSV file.
Optional parameters allow configuring the score_cutoff for threshold to
include quote pairs, and consolidation of consecutive sentences (on by default).
When benchmark is enabled, summary information is logged to report
on corpus size and timings to generate embeddings and search for pairs.