remarx
¶
remarx: Marx quote identification for the CDH Citing Marx project.
MODULE | DESCRIPTION |
---|---|
app |
The marimo notebook corresponding to the |
app_utils |
Utility methods associated with the remarx app |
sentence |
This module contains libraries for sentence segmentation and sentence corpus construction. |
app
¶
The marimo notebook corresponding to the remarx
application. The application
can be launched by running the command remarx-app
or via marimo.
Example Usage:
`remarx-app`
`marimo run app.py`
app_utils
¶
Utility methods associated with the remarx app
FUNCTION | DESCRIPTION |
---|---|
launch_app |
Launch the remarx app into web browser. |
create_temp_input |
Context manager to create a temporary file with the file contents and name of a file uploaded |
create_temp_input
¶
Context manager to create a temporary file with the file contents and name of a file uploaded to a web browser as returned by marimo.ui.file. This should be used in with statements.
RETURNS | DESCRIPTION |
---|---|
Generator[Path, None, None]
|
Yields the path to the temporary file |
sentence
¶
segment
¶
Provides functionality to break down input text into individual sentences and return them as tuples containing the character index where each sentence begins and the sentence text itself.
FUNCTION | DESCRIPTION |
---|---|
segment_text |
Segment a string of text into sentences with character indices. |
segment_text
¶
Segment a string of text into sentences with character indices.
PARAMETER | DESCRIPTION |
---|---|
text
|
Input text to be segmented into sentences
TYPE:
|
language
|
Language code for the Stanza pipeline
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
list[tuple[int, str]]
|
List of tuples where each tuple contains (start_char_index, sentence_text) |
corpus
¶
Functionality for loading and chunking input files for sentence corpus creation.
MODULE | DESCRIPTION |
---|---|
base_input |
Base file input class with common functionality. Provides a factory |
create |
Preliminary script and method to create sentence corpora from input |
tei_input |
Functionality related to parsing MEGA TEI/XML content with the |
text_input |
Input class for handling basic text file as input for corpus creation. |
CLASS | DESCRIPTION |
---|---|
FileInput |
Base class for file input for sentence corpus creation |
TEIDocument |
Custom :class: |
TEIinput |
Input class for TEI/XML content. Takes a single input file, |
TEIPage |
Custom :class: |
TextInput |
Basic text file input handling for sentence corpus creation. Takes |
ATTRIBUTE | DESCRIPTION |
---|---|
TEI_TAG |
Convenience access to namespaced TEI tag names
|
FileInput
dataclass
¶
Base class for file input for sentence corpus creation
METHOD | DESCRIPTION |
---|---|
get_text |
Get plain-text contents for this input file with any desired chunking |
get_sentences |
Get sentences for this file, with associated metadata. |
subclasses |
List of available file input classes. |
subclass_by_type |
Dictionary of subclass by supported file extension for available |
supported_types |
Unique list of supported file extensions for available input classes. |
create |
Instantiate and return the appropriate input class for the specified |
ATTRIBUTE | DESCRIPTION |
---|---|
input_file |
Reference to input file. Source of content for sentences.
TYPE:
|
filename_override |
Optional filename override, e.g. when using temporary files as input
TYPE:
|
field_names |
List of field names for sentences from text input files.
TYPE:
|
file_type |
Supported file extension; subclasses must define
TYPE:
|
file_name |
Input file name. Associated with sentences in generated corpus.
TYPE:
|
input_file
instance-attribute
¶
Reference to input file. Source of content for sentences.
filename_override
class-attribute
instance-attribute
¶
Optional filename override, e.g. when using temporary files as input
field_names
class-attribute
¶
List of field names for sentences from text input files.
file_name
cached
property
¶
Input file name. Associated with sentences in generated corpus.
get_text
¶
Get plain-text contents for this input file with any desired chunking (e.g. pages or other semantic unit). Subclasses must implement; no default implementation.
RETURNS | DESCRIPTION |
---|---|
Generator[dict[str, str]]
|
Generator with a dictionary of text and any other metadata that applies to this unit of text. |
get_sentences
¶
Get sentences for this file, with associated metadata.
RETURNS | DESCRIPTION |
---|---|
Generator[dict[str, Any]]
|
Generator of one dictionary per sentence; dictionary always includes: |
subclass_by_type
classmethod
¶
Dictionary of subclass by supported file extension for available input classes.
supported_types
classmethod
¶
Unique list of supported file extensions for available input classes.
create
classmethod
¶
Instantiate and return the appropriate input class for the specified input file. Takes an optional filename override parameter, which is passed through to the input class.
RAISES | DESCRIPTION |
---|---|
ValueError
|
if input_file is not a supported type |
TEI_TAG
module-attribute
¶
Convenience access to namespaced TEI tag names
TEIDocument
¶
Bases: BaseTEIXmlObject
Custom :class:eulxml.xmlmap.XmlObject
instance for TEI XML document.
Customized for MEGA TEI XML.
METHOD | DESCRIPTION |
---|---|
init_from_file |
Class method to initialize a new :class: |
ATTRIBUTE | DESCRIPTION |
---|---|
all_pages |
List of page objects, identified by page begin tag (pb). Includes all
|
pages |
Standard pages for this document. Returns a list of TEIPage objects
TYPE:
|
all_pages
class-attribute
instance-attribute
¶
all_pages = NodeListField('//t:text//t:pb', TEIPage)
List of page objects, identified by page begin tag (pb). Includes all pages (standard and manuscript edition), because the XPath is significantly faster without filtering.
pages
cached
property
¶
pages: list[TEIPage]
Standard pages for this document. Returns a list of TEIPage objects for this document, omitting any pages marked as manuscript edition.
init_from_file
classmethod
¶
Class method to initialize a new :class:TEIDocument
from a file.
TEIinput
dataclass
¶
Bases: FileInput
Input class for TEI/XML content. Takes a single input file, and yields text content by page, with page number. Customized for MEGA TEI/XML: follows standard edition page numbering and ignores pages marked as manuscript edition.
METHOD | DESCRIPTION |
---|---|
__post_init__ |
After default initialization, parse the input file as a |
get_text |
Get document content as plain text. Chunked by page, dictionary |
ATTRIBUTE | DESCRIPTION |
---|---|
xml_doc |
Parsed XML document; initialized from inherited input_file
TYPE:
|
field_names |
List of field names for sentences from TEI XML input files
TYPE:
|
file_type |
Supported file extension for TEI/XML input
|
xml_doc
class-attribute
instance-attribute
¶
xml_doc: TEIDocument = field(init=False)
Parsed XML document; initialized from inherited input_file
field_names
class-attribute
¶
field_names: tuple[str, ...] = (*(field_names), 'page_number')
List of field names for sentences from TEI XML input files
file_type
class-attribute
instance-attribute
¶
Supported file extension for TEI/XML input
__post_init__
¶
After default initialization, parse the input file as a TEIDocument and store it as xml_doc.
get_text
¶
Get document content as plain text. Chunked by page, dictionary includes page number.
RETURNS | DESCRIPTION |
---|---|
Generator[dict[str, str]]
|
Generator with a dictionary of text content by page, with page number. |
TEIPage
¶
Bases: BaseTEIXmlObject
Custom :class:eulxml.xmlmap.XmlObject
instance for a page
of content within a TEI XML document.
METHOD | DESCRIPTION |
---|---|
text_contents |
Generator of text content on this page, between the current |
__str__ |
Page text contents as a string |
ATTRIBUTE | DESCRIPTION |
---|---|
number |
page number
|
edition |
page edition, if any
|
text_nodes |
list of all text nodes following this tag
|
text_nodes
class-attribute
instance-attribute
¶
list of all text nodes following this tag
text_contents
¶
Generator of text content on this page, between the current
and following page begin tags. MEGA specific logic:
ignores page indicators for the manuscript edition
(
TextInput
dataclass
¶
Bases: FileInput
Basic text file input handling for sentence corpus creation. Takes a single text input file and returns text without chunking.
METHOD | DESCRIPTION |
---|---|
get_text |
Get plain-text contents for this file with any desired chunking (e.g. |
ATTRIBUTE | DESCRIPTION |
---|---|
file_type |
Supported file extension for text input
|
file_type
class-attribute
instance-attribute
¶
Supported file extension for text input
get_text
¶
Get plain-text contents for this file with any desired chunking (e.g. pages or other semantic unit). Default implementation does no chunking, no additional metadata.
RETURNS | DESCRIPTION |
---|---|
Generator[dict[str, str]]
|
Generator with a dictionary of text and any other metadata that applies to this unit of text. |
base_input
¶
Base file input class with common functionality. Provides a factory method for initialization of known input classes based on supported file types.
To initialize the appropriate subclass for a supported file type, use FileInput.create().
For a list of supported file types across all registered input classes, use FileInput.supported_types().
Subclasses must define a supported file_type
extension and implement
the get_text
method. For discovery, input classes must be imported in
remarx.sentence.corpus.__init__
and included in __all__
to ensure
they are found as available input classes.
tei_input
¶
Functionality related to parsing MEGA TEI/XML content with the goal of creating a sentence corpora with associated metadata from the TEI.
text_input
¶
Input class for handling basic text file as input for corpus creation.