Importers ========= Available importers ------------------- The *Impresso Importers* already support a number of formats (and flavours of standard formats), while a few others are currently being developed. The following importer CLI scripts are already available: - :py:mod:`text_preparation.scripts.oliveimporter`: importer for the *Olive XML format*, used by `RERO `_ to encode and deliver the majority of its newspaper data. - :py:mod:`text_preparation.scripts.reroimporter`: importer for the *Mets/ALTO flavor* used by `RERO `_ to encode and deliver part of its data. - :py:mod:`text_preparation.scripts.luximporter`: importer for the *Mets/ALTO flavor* used by the `Bibliothèque nationale de Luxembourg (BNL) `_ to encode and deliver its newspaper data. - :py:mod:`text_preparation.scripts.bnfimporter`: importer for the *Mets/ALTO flavor* used by the `Bibliothèque nationale de France (BNF) `_ to encode and deliver its newspaper data. - :py:mod:`text_preparation.scripts.bnfen_importer`: importer for the *Mets/ALTO flavor* used by the `Bibliothèque nationale de France (BNF) `_ to encode and deliver its newspaper data for the Europeana collection. - :py:mod:`text_preparation.scripts.bcul_importer`: importer for the *ABBYY format* used by the `Bibliothèque Cantonale Universitaire de Lausanne (BCUL) `_ to encode and deliver the newspaper data which is on the `Scriptorium interface `_. - :py:mod:`text_preparation.scripts.swaimporter`: *ALTO flavor* of the `Basel University Library`. - :py:mod:`text_preparation.scripts.blimporter`: importer for the *Mets/ALTO flavor* used by the `British Library (BL) `_ to encode and deliver its newspaper data. - :py:mod:`text_preparation.scripts.tetml`: generic importer for the *TETML format*, produced by `PDFlib TET `_. - :py:mod:`text_preparation.scripts.fedgaz`: importer for the *TETML format* with separate metadata file and a heuristic article segmentation, used to parse the `Federal Gazette `_. For further details on any of these implementations, please do refer to its documentation: .. toctree:: :maxdepth: 1 importers/olive importers/mets-alto importers/lux importers/rero importers/swa importers/bl importers/bnf importers/bnf-en importers/bcul importers/tetml importers/fedgaz Command-line interface ---------------------- .. note :: All importers share the same command-line interface; only a few options are import-specific (see documentation below). .. automodule:: text_preparation.importers.generic_importer Configuration file ------------------ The selection of the actual newspaper data to be imported can be controlled by means of a configuration file (JSON format). The path to this file is passed via the ``--config_file=`` CLI parameter. This JSON file contains three properties: - ``newspapers``: a dictionary containing the newspaper IDs to be imported (e.g. GDL); - ``exclude_newspapers``: a list of the newspaper IDs to be excluded; - ``year_only``: a boolean flag indicating whether date ranges are expressed by using years or more granular dates (in the format ``YYYY/MM/DD``). .. note:: When ingesting large amounts of data, these configuration files can help you organise your data imports into batches or homogeneous collections. Here is a simple configuration file: .. code-block:: python { "newspapers": { "GDL": [] }, "exclude_newspapers": [], "year_only": false } This is what a more complex config file looks like (only contents for the decade 1950-1960 of GDL are processed): .. code-block:: python { "newspapers": { "GDL": "1950/01/01-1960/12/31" }, "exclude_newspapers": [], "year_only": false } Writing a new importer ---------------------- Writing a new importer is easy and entails implementing two pieces of code: 1. implementing **functions to detect the data** to import; 2. implementing from scratch **classes that handle the conversion into JSON** of your OCR format or adapt one of the existing importers. Once these two pieces of code are in place, they can be plugged into the functions defined in :mod:`text_preparation.importers.generic_importer` so as to create a dedicated CLI script for your specific format. For example, this is the content of ``oliveimporter.py``: .. code-block:: python from text_preparation.importers import generic_importer from text_preparation.importers.olive.classes import OliveNewspaperIssue from text_preparation.importers.olive.detect import (olive_detect_issues, olive_select_issues) if __name__ == '__main__': generic_importer.main( OliveNewspaperIssue, olive_detect_issues, olive_select_issues ) **How should the code of a new text importer be structured?** We recommend to comply to the following structure: - :mod:`text_preparation.importers..detect` will contain functions to find the data to be imported; - :mod:`text_preparation.importers..helpers` (optional) will contain ancillary functions; - :mod:`text_preparation.importers..parsers` (optional) will contain functions/classes to parse the data. - :mod:`text_preparation/scripts/.py`: will contain a CLI script to run the importer. Detect data to import ~~~~~~~~~~~~~~~~~~~~~ - the importer needs to know which data should be imported - information about the newspaper contents is often encoded as part of folder names etc., thus it needs to be extracted and made explicit, by means of :ref:`Canonical identifiers` - add some sample data to ``text_preparation/data/sample/`` For example: :py:func:`~text_preparation.importers.olive.detect.olive_detect_issues` Implement abstract classes ~~~~~~~~~~~~~~~~~~~~~~~~~~ These two classes are passed to the the importer's generic command-line interface, see :py:func:`text_preparation.importers.generic_importer.main` .. autoclass:: text_preparation.importers.classes.NewspaperIssue :members: .. autoclass:: text_preparation.importers.classes.NewspaperPage :members: Write an importer CLI script ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This script imports passes the new :class:`NewspaperIssue` class, together with the-newly defined *detect* functions, to the ``main()`` function of the generic importer CLI :func:`text_preparation.importers.generic_importer.main`. Test ~~~~ Create a new test file named ``test__importer.py`` and add it to ``tests/importers/``. This file should contain at the very minimum a test called :func:`test_import_issues`, which - detects input data from ``text_preparation/data/sample/`` - writes any output to ``text_preparation/data/out/``.