Welcome to Impresso Text Preparation’s documentation!
Impresso Text Preparation is a library resulting from the merge of the previous package “Text-Importer” and the “Text-Rebuilder” from Impresso-pycommons. The goal for this merge was to regroup in one place all the code that was used as a first unified preparation for all data sources: creating the Impresso canonical and rebuilt formats from the data provided by partners.
This grouping means that there are two main modules to this library: - Importers: first step of data processing, they convert OCR and OLR data (coming in a variety of formats - e.g. Olive XML, various flavors of Mets/Alto XML, etc.) into Impresso’s unified Canonical JSON format , which represents Newspaper issues and pages. - Rebuilders: second step of the data processing where the content-items (articles, images, tables, headers etc) from the canonical format are extracted and “rebuilt” in preparation for the semantic augmentation and NLP processings that follow in the pipeline.
Contents:
- Installation
- Overview
- Preprocessing
- Importers
- Rebuilders
- Utilities
add_property()
coords_to_xy()
coords_to_xywh()
draw_box_on_img()
empty_folder()
get_issue_schema()
get_page_schema()
get_reading_order()
read_xml()
rescale_coords()
validate_audio_schema()
validate_issue_schema()
validate_page_schema()
verify_imported_issues()
write_error()
write_jsonlines_file()
insert_whitespace()