remarx
¶
    remarx: Marx quote identification for the CDH Citing Marx project.
| MODULE | DESCRIPTION | 
|---|---|
app | 
            
               This module contains the UI and utilities for the remarx app  | 
          
quotation | 
            
               This module contains libraries for embedding generation and quotation detection.  | 
          
sentence | 
            
               This module contains libraries for sentence segmentation and sentence corpus construction.  | 
          
utils | 
            
               Utility functions for the remarx package  | 
          
            utils
¶
    Utility functions for the remarx package
| FUNCTION | DESCRIPTION | 
|---|---|
configure_logging | 
              
                 Configure logging for the remarx application.  | 
            
            configure_logging
¶
configure_logging(
    log_destination: Path | TextIO | None = None,
    log_level: int = INFO,
    stanza_log_level: int = ERROR,
) -> Path | None
Configure logging for the remarx application. Supports logging to any text stream, a specified file, or auto-generated timestamped file.
| PARAMETER | DESCRIPTION | 
|---|---|
                log_destination
             | 
            
               Where to write logs. Can be: - None (default): Creates a timestamped log file in ./logs/ directory - pathlib.Path: Write to the specified file path - Any io.TextIOBase (e.g., sys.stdout, sys.stderr, or any io.TextIOBase): Write to the given stream 
                  
                    TYPE:
                        | 
          
                log_level
             | 
            
               Logging level for remarx logger (default to logging.INFO) 
                  
                    TYPE:
                        | 
          
                stanza_log_level
             | 
            
               Logging level for stanza logger (default to logging.ERROR) 
                  
                    TYPE:
                        | 
          
| RETURNS | DESCRIPTION | 
|---|---|
                
                    Path | None
                
             | 
            
               Path to the created log file if file logging is used, None if stream logging  | 
          
            app
¶
    This module contains the UI and utilities for the remarx app
| MODULE | DESCRIPTION | 
|---|---|
corpus_builder | 
            
               The marimo notebook corresponding to the   | 
          
quote_finder | 
            
               The marimo notebook corresponding to the   | 
          
utils | 
            
               Utility methods associated with the remarx app  | 
          
            utils
¶
    Utility methods associated with the remarx app
| FUNCTION | DESCRIPTION | 
|---|---|
lifespan | 
              
                 Lifespan context manager to open browser when server starts  | 
            
redirect_root | 
              
                 Redirect root path to corpus-builder, since app currently has no home page  | 
            
launch_app | 
              
                 Launch the remarx app in the default web browser  | 
            
get_current_log_file | 
              
                 Get the path to the current log file from the root logger's file handler.  | 
            
create_header | 
              
                 Create the header for the remarx notebooks  | 
            
create_temp_input | 
              
                 Context manager to create a temporary file with the file contents and name of a file uploaded  | 
            
            lifespan
  
      async
  
¶
    Lifespan context manager to open browser when server starts
            redirect_root
  
      async
  
¶
    Redirect root path to corpus-builder, since app currently has no home page
            get_current_log_file
¶
    Get the path to the current log file from the root logger's file handler. Returns None if logging is not configured to a file.
            create_temp_input
¶
    Context manager to create a temporary file with the file contents and name of a file uploaded to a web browser as returned by marimo.ui.file. This should be used in with statements.
| RETURNS | DESCRIPTION | 
|---|---|
                
                    Generator[Path, None, None]
                
             | 
            
               Yields the path to the temporary file  | 
          
            sentence
¶
    
            segment
¶
    Provides functionality to break down input text into individual sentences and return them as tuples containing the character index where each sentence begins and the sentence text itself.
| FUNCTION | DESCRIPTION | 
|---|---|
segment_text | 
              
                 Segment a string of text into sentences with character indices.  | 
            
            segment_text
¶
    Segment a string of text into sentences with character indices.
| PARAMETER | DESCRIPTION | 
|---|---|
                text
             | 
            
               Input text to be segmented into sentences 
                  
                    TYPE:
                        | 
          
                language
             | 
            
               Language code for the Stanza pipeline 
                  
                    TYPE:
                        | 
          
| RETURNS | DESCRIPTION | 
|---|---|
                
                    list[tuple[int, str]]
                
             | 
            
               List of tuples where each tuple contains (start_char_index, sentence_text)  | 
          
            corpus
¶
    Functionality for loading and chunking input files for sentence corpus creation.
| MODULE | DESCRIPTION | 
|---|---|
alto_input | 
            
               Functionality related to parsing ALTO XML content packaged within a zipfile,  | 
          
base_input | 
            
               Base file input class with common functionality. Provides a factory  | 
          
create | 
            
               Preliminary script and method to create sentence corpora from input  | 
          
tei_input | 
            
               Functionality related to parsing MEGA TEI/XML content with the  | 
          
text_input | 
            
               Input class for handling basic text file as input for corpus creation.  | 
          
| CLASS | DESCRIPTION | 
|---|---|
ALTOInput | 
            
               Preliminary FileInput implementation for ALTO XML delivered as a zipfile.  | 
          
FileInput | 
            
               Base class for file input for sentence corpus creation  | 
          
TEIDocument | 
            
               Custom :class:  | 
          
TEIinput | 
            
               Input class for TEI/XML content. Takes a single input file,  | 
          
TEIPage | 
            
               Custom :class:  | 
          
TextInput | 
            
               Basic text file input handling for sentence corpus creation. Takes  | 
          
| ATTRIBUTE | DESCRIPTION | 
|---|---|
TEI_TAG | 
            
               Convenience access to namespaced TEI tag names 
  | 
          
            ALTOInput
  
      dataclass
  
¶
    
              Bases: FileInput
Preliminary FileInput implementation for ALTO XML delivered as a zipfile. Iterates through ALTO XML members and stubs out chunk yielding for future parsing.
| METHOD | DESCRIPTION | 
|---|---|
get_text | 
              
                 Iterate over ALTO XML files contained in the zipfile and return  | 
            
| ATTRIBUTE | DESCRIPTION | 
|---|---|
field_names | 
            
               List of field names for sentences originating from ALTO XML content. 
                  
                    TYPE:
                        | 
          
file_type | 
            
               Supported file extension for ALTO zipfiles (.zip) 
                  
                    TYPE:
                        | 
          
            field_names
  
      class-attribute
  
¶
field_names: tuple[str, ...] = (*(field_names), 'section_type')
List of field names for sentences originating from ALTO XML content.
            file_type
  
      class-attribute
  
¶
    Supported file extension for ALTO zipfiles (.zip)
            get_text
¶
    Iterate over ALTO XML files contained in the zipfile and return a generator of text content.
            FileInput
  
      dataclass
  
¶
    Base class for file input for sentence corpus creation
| METHOD | DESCRIPTION | 
|---|---|
get_text | 
              
                 Get plain-text contents for this input file with any desired chunking  | 
            
get_extra_metadata | 
              
                 Hook method for subclasses to override to provide extra metadata for a sentence (e.g. line number).  | 
            
get_sentences | 
              
                 Get sentences for this file, with associated metadata.  | 
            
subclasses | 
              
                 List of available file input classes.  | 
            
subclass_by_type | 
              
                 Dictionary of subclass by supported file extension for available  | 
            
supported_types | 
              
                 Unique list of supported file extensions for available input classes.  | 
            
create | 
              
                 Instantiate and return the appropriate input class for the specified  | 
            
| ATTRIBUTE | DESCRIPTION | 
|---|---|
input_file | 
            
               Reference to input file. Source of content for sentences. 
                  
                    TYPE:
                        | 
          
filename_override | 
            
               Optional filename override, e.g. when using temporary files as input 
                  
                    TYPE:
                        | 
          
field_names | 
            
               List of field names for sentences from text input files. 
                  
                    TYPE:
                        | 
          
file_type | 
            
               Supported file extension; subclasses must define 
                  
                    TYPE:
                        | 
          
file_name | 
            
               Input file name. Associated with sentences in generated corpus. 
                  
                    TYPE:
                        | 
          
            input_file
  
      instance-attribute
  
¶
    Reference to input file. Source of content for sentences.
            filename_override
  
      class-attribute
      instance-attribute
  
¶
    Optional filename override, e.g. when using temporary files as input
            field_names
  
      class-attribute
  
¶
    List of field names for sentences from text input files.
            file_name
  
      cached
      property
  
¶
    Input file name. Associated with sentences in generated corpus.
            get_text
¶
    Get plain-text contents for this input file with any desired chunking (e.g. pages or other semantic unit). Subclasses must implement; no default implementation.
| RETURNS | DESCRIPTION | 
|---|---|
                
                    Generator[dict[str, str]]
                
             | 
            
               Generator with a dictionary of text and any other metadata that applies to this unit of text.  | 
          
            get_extra_metadata
¶
    Hook method for subclasses to override to provide extra metadata for a sentence (e.g. line number).
| RETURNS | DESCRIPTION | 
|---|---|
                
                    dict[str, Any]
                
             | 
            
               Dictionary of additional metadata fields to include, or empty dict  | 
          
            get_sentences
¶
    Get sentences for this file, with associated metadata.
| RETURNS | DESCRIPTION | 
|---|---|
                
                    Generator[dict[str, Any]]
                
             | 
            
               Generator of one dictionary per sentence; dictionary always includes:   | 
          
            subclass_by_type
  
      classmethod
  
¶
    Dictionary of subclass by supported file extension for available input classes.
            supported_types
  
      classmethod
  
¶
    Unique list of supported file extensions for available input classes.
            create
  
      classmethod
  
¶
    Instantiate and return the appropriate input class for the specified input file. Takes an optional filename override parameter, which is passed through to the input class.
| RAISES | DESCRIPTION | 
|---|---|
              
                  ValueError
              
             | 
            
               if input_file is not a supported type  | 
          
            TEI_TAG
  
      module-attribute
  
¶
    Convenience access to namespaced TEI tag names
            TEIDocument
¶
    
              Bases: BaseTEIXmlObject
Custom :class:neuxml.xmlmap.XmlObject instance for TEI XML document.
Customized for MEGA TEI XML.
| METHOD | DESCRIPTION | 
|---|---|
init_from_file | 
              
                 Class method to initialize a new :class:  | 
            
| ATTRIBUTE | DESCRIPTION | 
|---|---|
all_pages | 
            
               List of page objects, identified by page begin tag (pb). Includes all 
  | 
          
pages | 
            
               Standard pages for this document. Returns a list of TEIPage objects 
                  
                    TYPE:
                        | 
          
pages_by_number | 
            
               Dictionary lookup of standard pages by page number. 
                  
                    TYPE:
                        | 
          
            all_pages
  
      class-attribute
      instance-attribute
  
¶
all_pages = NodeListField('//t:text//t:pb', TEIPage)
List of page objects, identified by page begin tag (pb). Includes all pages (standard and manuscript edition), because the XPath is significantly faster without filtering.
            pages
  
      cached
      property
  
¶
pages: list[TEIPage]
Standard pages for this document. Returns a list of TEIPage objects for this document, omitting any pages marked as manuscript edition.
            pages_by_number
  
      cached
      property
  
¶
pages_by_number: dict[str, TEIPage]
Dictionary lookup of standard pages by page number.
            init_from_file
  
      classmethod
  
¶
    Class method to initialize a new :class:TEIDocument from a file.
            TEIinput
  
      dataclass
  
¶
    
              Bases: FileInput
Input class for TEI/XML content. Takes a single input file, and yields text content by page, with page number. Customized for MEGA TEI/XML: follows standard edition page numbering and ignores pages marked as manuscript edition.
| METHOD | DESCRIPTION | 
|---|---|
__post_init__ | 
              
                 After default initialization, parse the input file as a  | 
            
get_text | 
              
                 Get document content as plain text. The document's content is yielded in segments  | 
            
get_extra_metadata | 
              
                 Calculate extra metadata including line number for a sentence in TEI documents  | 
            
| ATTRIBUTE | DESCRIPTION | 
|---|---|
xml_doc | 
            
               Parsed XML document; initialized from inherited input_file 
                  
                    TYPE:
                        | 
          
field_names | 
            
               List of field names for sentences from TEI XML input files 
                  
                    TYPE:
                        | 
          
file_type | 
            
               Supported file extension for TEI/XML input 
  | 
          
            xml_doc
  
      class-attribute
      instance-attribute
  
¶
xml_doc: TEIDocument = field(init=False)
Parsed XML document; initialized from inherited input_file
            field_names
  
      class-attribute
  
¶
field_names: tuple[str, ...] = (
    *(field_names),
    "page_number",
    "section_type",
    "line_number",
)
List of field names for sentences from TEI XML input files
            file_type
  
      class-attribute
      instance-attribute
  
¶
    Supported file extension for TEI/XML input
            __post_init__
¶
    After default initialization, parse the input file as a TEIDocument and store it as xml_doc.
            get_text
¶
    Get document content as plain text. The document's content is yielded in segments with each segment corresponding to a dictionary of containing its text content, page number and section type ("text" or "footnote"). Body text is yielded once per page, while each footnote is yielded individually.
| RETURNS | DESCRIPTION | 
|---|---|
                
                    Generator[dict[str, str]]
                
             | 
            
               Generator with dictionaries of text content, with page number and section_type ("text" or "footnote").  | 
          
            get_extra_metadata
¶
    Calculate extra metadata including line number for a sentence in TEI documents based on the character position within the text chunk (page body or footnote).
| RETURNS | DESCRIPTION | 
|---|---|
                
                    dict[str, Any]
                
             | 
            
               Dictionary with line_number for the sentence (None if not found)  | 
          
            TEIPage
¶
    
              Bases: BaseTEIXmlObject
Custom :class:neuxml.xmlmap.XmlObject instance for a page
of content within a TEI XML document.
| METHOD | DESCRIPTION | 
|---|---|
is_footnote_content | 
              
                 Helper function that checks if an element or any of its ancestors is footnote content.  | 
            
get_page_footnotes | 
              
                 Filters footnotes to keep only the footnotes that belong to this page.  | 
            
get_body_text_line_number | 
              
                 Return the TEI line number for the line at or before   | 
            
find_preceding_lb | 
              
                 Find the closest preceding   | 
            
get_body_text | 
              
                 Extract body text content for this page, excluding footnotes and editorial content.  | 
            
get_footnote_text | 
              
                 Get all footnote content as a single string, with footnotes separated by double newlines.  | 
            
__str__ | 
              
                 Page text contents as a string, with body text and footnotes.  | 
            
| ATTRIBUTE | DESCRIPTION | 
|---|---|
number | 
            
               page number 
  | 
          
edition | 
            
               page edition, if any 
  | 
          
text_nodes | 
            
               list of all text nodes following this tag 
  | 
          
following_footnotes | 
            
               list of footnote elements within this page and following pages 
  | 
          
next_page | 
            
               the next standard page break after this one, or None if this is the last page 
  | 
          
            text_nodes
  
      class-attribute
      instance-attribute
  
¶
    list of all text nodes following this tag
            following_footnotes
  
      class-attribute
      instance-attribute
  
¶
    list of footnote elements within this page and following pages
            next_page
  
      class-attribute
      instance-attribute
  
¶
    the next standard page break after this one, or None if this is the last page
            is_footnote_content
  
      staticmethod
  
¶
    Helper function that checks if an element or any of its ancestors is footnote content.
            get_page_footnotes
¶
    Filters footnotes to keep only the footnotes that belong to this page. Only includes footnotes that occur between this pb and the next standard pb[not(@ed)].
            get_body_text_line_number
¶
    Return the TEI line number for the line at or before char_pos.
Returns None if no line number can be determined.
            find_preceding_lb
  
      staticmethod
  
¶
    Find the closest preceding  element for an element.
Needed to find the 
 relative to immediately following
inline markup, e.g. text ...
            get_body_text
¶
    Extract body text content for this page, excluding footnotes and editorial content. While collecting the text, build a mapping of character offsets to TEI line numbers.
            get_footnote_text
¶
    Get all footnote content as a single string, with footnotes separated by double newlines.
            TextInput
  
      dataclass
  
¶
    
              Bases: FileInput
Basic text file input handling for sentence corpus creation. Takes a single text input file and returns text without chunking.
| METHOD | DESCRIPTION | 
|---|---|
get_text | 
              
                 Get plain-text contents for this file with any desired chunking (e.g.  | 
            
| ATTRIBUTE | DESCRIPTION | 
|---|---|
file_type | 
            
               Supported file extension for text input 
  | 
          
            file_type
  
      class-attribute
      instance-attribute
  
¶
    Supported file extension for text input
            get_text
¶
    Get plain-text contents for this file with any desired chunking (e.g. pages or other semantic unit). Default implementation does no chunking, no additional metadata.
| RETURNS | DESCRIPTION | 
|---|---|
                
                    Generator[dict[str, str]]
                
             | 
            
               Generator with a dictionary of text and any other metadata that applies to this unit of text.  | 
          
            base_input
¶
    Base file input class with common functionality. Provides a factory method for initialization of known input classes based on supported file types.
To initialize the appropriate subclass for a supported file type, use FileInput.create().
For a list of supported file types across all registered input classes, use FileInput.supported_types().
Subclasses must define a supported file_type extension and implement
the get_text method. For discovery, input classes must be imported in
remarx.sentence.corpus.__init__ and included in __all__ to ensure
they are found as available input classes.
            tei_input
¶
    Functionality related to parsing MEGA TEI/XML content with the goal of creating a sentence corpora with associated metadata from the TEI.
            text_input
¶
    Input class for handling basic text file as input for corpus creation.
            quotation
¶
    This module contains libraries for embedding generation and quotation detection.
| MODULE | DESCRIPTION | 
|---|---|
embeddings | 
            
               Library for generating sentence embeddings from pretrained Sentence Transformer models.  | 
          
pairs | 
            
               Library for finding sentence-level quote pairs.  | 
          
            embeddings
¶
    Library for generating sentence embeddings from pretrained Sentence Transformer models.
| FUNCTION | DESCRIPTION | 
|---|---|
get_sentence_embeddings | 
              
                 Extract embeddings for each sentence using the specified pretrained Sentence  | 
            
            get_sentence_embeddings
¶
get_sentence_embeddings(
    sentences: list[str],
    model_name: str = "paraphrase-multilingual-mpnet-base-v2",
    show_progress_bar: bool = False,
) -> NDArray
Extract embeddings for each sentence using the specified pretrained Sentence Transformers model (default is paraphrase-multilingual-mpnet-base-v2). Returns a numpy array of the embeddings with shape [# sents, # dims].
| PARAMETER | DESCRIPTION | 
|---|---|
                sentences
             | 
            
               List of sentences to generate embeddings for 
                  
                    TYPE:
                        | 
          
                model_name
             | 
            
               Name of the pretrained sentence transformer model to use (default: paraphrase-multilingual-mpnet-base-v2) 
                  
                    TYPE:
                        | 
          
| RETURNS | DESCRIPTION | 
|---|---|
                
                    NDArray
                
             | 
            
               2-dimensional numpy array of normalized sentence embeddings with shape [# sents, # dims]  | 
          
            pairs
¶
    Library for finding sentence-level quote pairs.
Note: Currently this script only supports one original and reuse corpus.
| FUNCTION | DESCRIPTION | 
|---|---|
build_vector_index | 
              
                 Builds an index for a given set of embeddings with the specified  | 
            
get_sentence_pairs | 
              
                 For a set of original and reuse sentences, identify pairs of original-reuse  | 
            
load_sent_df | 
              
                 For a given sentence corpus, create a polars DataFrame suitable for finding  | 
            
compile_quote_pairs | 
              
                 Link sentence metadata to the detected sentence pairs from the given original  | 
            
find_quote_pairs | 
              
                 For a given original and reuse sentence corpus, finds the likely sentence-level  | 
            
            build_vector_index
¶
    Builds an index for a given set of embeddings with the specified number of trees.
            get_sentence_pairs
¶
get_sentence_pairs(
    original_sents: list[str],
    reuse_sents: list[str],
    score_cutoff: float,
    show_progress_bar: bool = False,
) -> DataFrame
For a set of original and reuse sentences, identify pairs of original-reuse sentence pairs where quotation is likely. Returns these sentence pairs as a polars DataFrame including for each pair:
original_index: the index of the original sentencereuse_index: the index of the reuse sentencematch_score: the quality of the match
Likely quote pairs are identified through the sentences' embeddings. The Annoy library is used to find the nearest original sentence for each reuse sentence. Then likely quote pairs are determined by those sentence pairs with a match score (cosine similarity) above the specified cutoff. Optionally, the parameters for Annoy may be specified.
            load_sent_df
¶
    For a given sentence corpus, create a polars DataFrame suitable for finding sentence-level quote pairs. Optionally, a prefix can be added to all column names.
The resulting dataframe has the same fields as the input corpus except with:
- a new field 
indexcorresponding to the row index - the sentence id field 
sent_idis renamed toid 
            compile_quote_pairs
¶
compile_quote_pairs(
    original_corpus: DataFrame, reuse_corpus: DataFrame, detected_pairs: DataFrame
) -> DataFrame
Link sentence metadata to the detected sentence pairs from the given original
and reuse sentence corpus dataframes to form quote pairs. The original and reuse
corpus dataframes must contain a row index column named original_index and
reuse_index respectively. Ideally, these dataframes should be built using
load_sent_df.
Returns a dataframe with the following fields:
match_score: Estimated quality of the match- All other fields in order from the reuse corpus except its row index
 - All other fields in order from the original corpus except its row index
 
            find_quote_pairs
¶
find_quote_pairs(
    original_corpus: Path,
    reuse_corpus: Path,
    out_csv: Path,
    score_cutoff: float = 0.225,
    show_progress_bar: bool = False,
) -> None
For a given original and reuse sentence corpus, finds the likely sentence-level
quote pairs. These quote pairs are saved as a CSV. Optionally, the required
quality for quote pairs can be modified via score_cutoff.