Welcome to Impresso Text Preparation’s documentation!
Impresso Text Preparation is a library resulting from the merge of the previous package “Text-Importer” and the “Text-Rebuilder” from Impresso-pycommons. The goal for this merge was to regroup in one place all the code that was used as a first unified preparation for all data sources: creating the Impresso canonical and rebuilt formats from the data provided by partners.
This grouping means that there are two main modules to this library: - Importers: first step of data processing, they convert OCR and OLR data (coming in a variety of formats - e.g. Olive XML, various flavors of Mets/Alto XML, etc.) into Impresso’s unified Canonical JSON format , which represents Newspaper issues and pages. - Rebuilders: second step of the data processing where the content-items (articles, images, tables, headers etc) from the canonical format are extracted and “rebuilt” in preparation for the semantic augmentation and NLP processings that follow in the pipeline.