.. Impresso TextImporter documentation master file, created by sphinx-quickstart on Mon Aug 12 14:50:13 2019. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. Welcome to Impresso Text Preparation's documentation! ===================================================== Impresso Text Preparation is a library resulting from the merge of the previous package "Text-Importer" and the "Text-Rebuilder" from Impresso-pycommons. The goal for this merge was to regroup in one place all the code that was used as a first unified preparation for all data sources: creating the Impresso `canonical` and `rebuilt` formats from the data provided by partners. This grouping means that there are two main modules to this library: - Importers: first step of data processing, they convert OCR and OLR data (coming in a variety of formats - e.g. Olive XML, various flavors of Mets/Alto XML, etc.) into `Impresso's unified Canonical JSON format `_ , which represents Newspaper issues and pages. - Rebuilders: second step of the data processing where the content-items (articles, images, tables, headers etc) from the canonical format are extracted and "rebuilt" in preparation for the semantic augmentation and NLP processings that follow in the pipeline. .. toctree:: :maxdepth: 2 :caption: Contents: install architecture importers rebuilders utils Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search`