Skip to content

remarx

remarx: Marx quote identification for the CDH Citing Marx project.

MODULE DESCRIPTION
app

The marimo notebook corresponding to the remarx application. The application

app_utils

Utility methods associated with the remarx app

sentence

This module contains libraries for sentence segmentation and sentence corpus construction.

app

The marimo notebook corresponding to the remarx application. The application can be launched by running the command remarx-app or via marimo.

Example Usage:

`remarx-app`

`marimo run app.py`

app_utils

Utility methods associated with the remarx app

FUNCTION DESCRIPTION
launch_app

Launch the remarx app into web browser.

create_temp_input

Context manager to create a temporary file with the file contents and name of a file uploaded

launch_app

launch_app() -> None

Launch the remarx app into web browser.

create_temp_input

create_temp_input(file_upload: FileUploadResults) -> Generator[Path, None, None]

Context manager to create a temporary file with the file contents and name of a file uploaded to a web browser as returned by marimo.ui.file. This should be used in with statements.

RETURNS DESCRIPTION
Generator[Path, None, None]

Yields the path to the temporary file

sentence

This module contains libraries for sentence segmentation and sentence corpus construction.

MODULE DESCRIPTION
corpus

Functionality for loading and chunking input files for sentence corpus creation.

segment

Provides functionality to break down input text into individual

segment

Provides functionality to break down input text into individual sentences and return them as tuples containing the character index where each sentence begins and the sentence text itself.

FUNCTION DESCRIPTION
segment_text

Segment a string of text into sentences with character indices.

segment_text

segment_text(text: str, language: str = 'de') -> list[tuple[int, str]]

Segment a string of text into sentences with character indices.

PARAMETER DESCRIPTION
text

Input text to be segmented into sentences

TYPE: str

language

Language code for the Stanza pipeline

TYPE: str DEFAULT: 'de'

RETURNS DESCRIPTION
list[tuple[int, str]]

List of tuples where each tuple contains (start_char_index, sentence_text)

corpus

Functionality for loading and chunking input files for sentence corpus creation.

MODULE DESCRIPTION
base_input

Base file input class with common functionality. Provides a factory

create

Preliminary script and method to create sentence corpora from input

tei_input

Functionality related to parsing MEGA TEI/XML content with the

text_input

Input class for handling basic text file as input for corpus creation.

CLASS DESCRIPTION
FileInput

Base class for file input for sentence corpus creation

TEIDocument

Custom :class:eulxml.xmlmap.XmlObject instance for TEI XML document.

TEIinput

Input class for TEI/XML content. Takes a single input file,

TEIPage

Custom :class:eulxml.xmlmap.XmlObject instance for a page

TextInput

Basic text file input handling for sentence corpus creation. Takes

ATTRIBUTE DESCRIPTION
TEI_TAG

Convenience access to namespaced TEI tag names

FileInput dataclass

FileInput(input_file: Path, filename_override: str = None)

Base class for file input for sentence corpus creation

METHOD DESCRIPTION
get_text

Get plain-text contents for this input file with any desired chunking

get_sentences

Get sentences for this file, with associated metadata.

subclasses

List of available file input classes.

subclass_by_type

Dictionary of subclass by supported file extension for available

supported_types

Unique list of supported file extensions for available input classes.

create

Instantiate and return the appropriate input class for the specified

ATTRIBUTE DESCRIPTION
input_file

Reference to input file. Source of content for sentences.

TYPE: Path

filename_override

Optional filename override, e.g. when using temporary files as input

TYPE: str

field_names

List of field names for sentences from text input files.

TYPE: tuple[str, ...]

file_type

Supported file extension; subclasses must define

TYPE: str

file_name

Input file name. Associated with sentences in generated corpus.

TYPE: str

input_file instance-attribute
input_file: Path

Reference to input file. Source of content for sentences.

filename_override class-attribute instance-attribute
filename_override: str = None

Optional filename override, e.g. when using temporary files as input

field_names class-attribute
field_names: tuple[str, ...] = ('file', 'sent_index', 'text')

List of field names for sentences from text input files.

file_type class-attribute
file_type: str

Supported file extension; subclasses must define

file_name cached property
file_name: str

Input file name. Associated with sentences in generated corpus.

get_text
get_text() -> Generator[dict[str, str]]

Get plain-text contents for this input file with any desired chunking (e.g. pages or other semantic unit). Subclasses must implement; no default implementation.

RETURNS DESCRIPTION
Generator[dict[str, str]]

Generator with a dictionary of text and any other metadata that applies to this unit of text.

get_sentences
get_sentences() -> Generator[dict[str, Any]]

Get sentences for this file, with associated metadata.

RETURNS DESCRIPTION
Generator[dict[str, Any]]

Generator of one dictionary per sentence; dictionary always includes: text (text content), file (filename), sent_index (sentence index within the document). It may include other metadata, depending on the input file type.

subclasses classmethod
subclasses() -> list[type[Self]]

List of available file input classes.

subclass_by_type classmethod
subclass_by_type() -> dict[str, type[Self]]

Dictionary of subclass by supported file extension for available input classes.

supported_types classmethod
supported_types() -> list[str]

Unique list of supported file extensions for available input classes.

create classmethod
create(input_file: Path, filename_override: str | None = None) -> Self

Instantiate and return the appropriate input class for the specified input file. Takes an optional filename override parameter, which is passed through to the input class.

RAISES DESCRIPTION
ValueError

if input_file is not a supported type

TEI_TAG module-attribute

TEI_TAG = TagNames(**{tag: f'{{TEI_NAMESPACE}}{tag}'for tag in (_fields)})

Convenience access to namespaced TEI tag names

TEIDocument

Bases: BaseTEIXmlObject

Custom :class:eulxml.xmlmap.XmlObject instance for TEI XML document. Customized for MEGA TEI XML.

METHOD DESCRIPTION
init_from_file

Class method to initialize a new :class:TEIDocument from a file.

ATTRIBUTE DESCRIPTION
all_pages

List of page objects, identified by page begin tag (pb). Includes all

pages

Standard pages for this document. Returns a list of TEIPage objects

TYPE: list[TEIPage]

all_pages class-attribute instance-attribute
all_pages = NodeListField('//t:text//t:pb', TEIPage)

List of page objects, identified by page begin tag (pb). Includes all pages (standard and manuscript edition), because the XPath is significantly faster without filtering.

pages cached property
pages: list[TEIPage]

Standard pages for this document. Returns a list of TEIPage objects for this document, omitting any pages marked as manuscript edition.

init_from_file classmethod
init_from_file(path: Path) -> Self

Class method to initialize a new :class:TEIDocument from a file.

TEIinput dataclass

TEIinput(input_file: Path, filename_override: str = None)

Bases: FileInput

Input class for TEI/XML content. Takes a single input file, and yields text content by page, with page number. Customized for MEGA TEI/XML: follows standard edition page numbering and ignores pages marked as manuscript edition.

METHOD DESCRIPTION
__post_init__

After default initialization, parse the input file as a

get_text

Get document content as plain text. Chunked by page, dictionary

ATTRIBUTE DESCRIPTION
xml_doc

Parsed XML document; initialized from inherited input_file

TYPE: TEIDocument

field_names

List of field names for sentences from TEI XML input files

TYPE: tuple[str, ...]

file_type

Supported file extension for TEI/XML input

xml_doc class-attribute instance-attribute
xml_doc: TEIDocument = field(init=False)

Parsed XML document; initialized from inherited input_file

field_names class-attribute
field_names: tuple[str, ...] = (*(field_names), 'page_number')

List of field names for sentences from TEI XML input files

file_type class-attribute instance-attribute
file_type = '.xml'

Supported file extension for TEI/XML input

__post_init__
__post_init__() -> None

After default initialization, parse the input file as a TEIDocument and store it as xml_doc.

get_text
get_text() -> Generator[dict[str, str]]

Get document content as plain text. Chunked by page, dictionary includes page number.

RETURNS DESCRIPTION
Generator[dict[str, str]]

Generator with a dictionary of text content by page, with page number.

TEIPage

Bases: BaseTEIXmlObject

Custom :class:eulxml.xmlmap.XmlObject instance for a page of content within a TEI XML document.

METHOD DESCRIPTION
text_contents

Generator of text content on this page, between the current

__str__

Page text contents as a string

ATTRIBUTE DESCRIPTION
number

page number

edition

page edition, if any

text_nodes

list of all text nodes following this tag

number class-attribute instance-attribute
number = StringField('@n')

page number

edition class-attribute instance-attribute
edition = StringField('@ed')

page edition, if any

text_nodes class-attribute instance-attribute
text_nodes = StringListField('following::text()')

list of all text nodes following this tag

text_contents
text_contents() -> Generator[str]

Generator of text content on this page, between the current and following page begin tags. MEGA specific logic: ignores page indicators for the manuscript edition ( tags with ed="manuscript"); assumes standard pb tags have no edition.

__str__
__str__() -> str

Page text contents as a string

TextInput dataclass

TextInput(input_file: Path, filename_override: str = None)

Bases: FileInput

Basic text file input handling for sentence corpus creation. Takes a single text input file and returns text without chunking.

METHOD DESCRIPTION
get_text

Get plain-text contents for this file with any desired chunking (e.g.

ATTRIBUTE DESCRIPTION
file_type

Supported file extension for text input

file_type class-attribute instance-attribute
file_type = '.txt'

Supported file extension for text input

get_text
get_text() -> Generator[dict[str, str]]

Get plain-text contents for this file with any desired chunking (e.g. pages or other semantic unit). Default implementation does no chunking, no additional metadata.

RETURNS DESCRIPTION
Generator[dict[str, str]]

Generator with a dictionary of text and any other metadata that applies to this unit of text.

base_input

Base file input class with common functionality. Provides a factory method for initialization of known input classes based on supported file types.

To initialize the appropriate subclass for a supported file type, use FileInput.create().

For a list of supported file types across all registered input classes, use FileInput.supported_types().

Subclasses must define a supported file_type extension and implement the get_text method. For discovery, input classes must be imported in remarx.sentence.corpus.__init__ and included in __all__ to ensure they are found as available input classes.

tei_input

Functionality related to parsing MEGA TEI/XML content with the goal of creating a sentence corpora with associated metadata from the TEI.

text_input

Input class for handling basic text file as input for corpus creation.