Welcome to Impresso Text Preparation’s documentation!
Impresso Text Preparation is a library resulting from the merge of the previous package “Text-Importer” and the “Text-Rebuilder” from Impresso-pycommons. The goal for this merge was to regroup in one place all the code that was used as a first unified preparation for all data sources: creating the Impresso canonical and rebuilt formats from the data provided by partners.
This grouping means that there are two main modules to this library: - Importers: first step of data processing, they convert OCR and OLR data (coming in a variety of formats - e.g. Olive XML, various flavors of Mets/Alto XML, etc.) into Impresso’s unified Canonical JSON format , which represents Newspaper issues and pages. - Rebuilders: second step of the data processing where the content-items (articles, images, tables, headers etc) from the canonical format are extracted and “rebuilt” in preparation for the semantic augmentation and NLP processings that follow in the pipeline.
Contents:
- Installation
- Overview
- Preprocessing
- Importers
- Rebuilders
- Utilities
add_property()coords_to_xy()coords_to_xywh()draw_box_on_img()empty_folder()get_issue_schema()get_page_schema()get_reading_order()read_xml()rescale_coords()validate_audio_schema()validate_issue_schema()validate_page_schema()verify_imported_issues()write_error()write_jsonlines_file()insert_whitespace()