
Available importers

The Impresso TextImporter already supports a number of formats (and flavours of standard formats), while a few others are currently being developed.

The following importer CLI scripts are already available:

  • text_importer.scripts.oliveimporter: importer for the Olive XML format, used by RERO to encode and deliver the majority of its newspaper data.

  • text_importer.scripts.reroimporter: importer for the Mets/ALTO flavor used by RERO to encode and deliver part of its data.

  • text_importer.scripts.luximporter: importer for the Mets/ALTO flavor used by the Bibliothèque nationale de Luxembourg (BNL) to encode and deliver its newspaper data.

  • text_importer.scripts.bnfimporter: importer for the Mets/ALTO flavor used by the Bibliothèque nationale de France (BNF) to encode and deliver its newspaper data.

  • text_importer.scripts.bnfen_importer: importer for the Mets/ALTO flavor used by the Bibliothèque nationale de France (BNF) to encode and deliver its newspaper data for the Europeana collection.

  • text_importer.scripts.bcul_importer: importer for the ABBYY format used by the Bibliothèque Cantonale Universitaire de Lausanne (BCUL) to encode and deliver the newspaper data which is on the Scriptorium interface.

  • text_importer.scripts.swaimporter: ALTO flavor of the Basel University Library.

  • text_importer.scripts.blimporter: importer for the Mets/ALTO flavor used by the British Library (BL) to encode and deliver its newspaper data.

  • text_importer.scripts.tetml: generic importer for the TETML format, produced by PDFlib TET.

  • text_importer.scripts.fedgaz: importer for the TETML format with separate metadata file and a heuristic article segmentation, used to parse the Federal Gazette.

For further details on any of these implementations, please do refer to its documentation:

Command-line interface


All importers share the same command-line interface; only a few options are import-specific (see documentation below).

Functions and CLI script to convert any OCR data into Impresso’s format.


<importer-name>importer.py –input-dir=<id> (–clear | –incremental) [–output-dir=<od> –image-dirs=<imd> –temp-dir=<td> –chunk-size=<cs> –s3-bucket=<b> –config-file=<cf> –log-file=<f> –verbose –scheduler=<sch> –access-rights=<ar> –git-repo=<gr> –num-workers=<nw>] <importer-name>importer.py –version


Base directory containing one sub-directory for each journal


Directory containing (canonical) images and their metadata (use , to separate multiple dirs)


Base directory where to write the output files


Temporary directory to extract .zip archives


Configuration file for selective import


If provided, writes output to an S3 drive, in the specified bucket


Tell dask to use an existing scheduler (otherwise it’ll create one)


Log file; when missing print log to stdout


Access right file if relevant (only for olive and rero importers)


Chunk size in years used to group issues when importing


Local path to the “impresso-text-acquisition” git directory (including it).


Number of workers to use for local dask cluster


Verbose log messages (good for debugging)


Removes the output folder (if already existing)


Skips issues already present in output directory


Prints version and exits.

Configuration file

The selection of the actual newspaper data to be imported can be controlled by means of a configuration file (JSON format). The path to this file is passed via the --config_file= CLI parameter.

This JSON file contains three properties:

  • newspapers: a dictionary containing the newspaper IDs to be imported (e.g. GDL);

  • exclude_newspapers: a list of the newspaper IDs to be excluded;

  • year_only: a boolean flag indicating whether date ranges are expressed by using years or more granular dates (in the format YYYY/MM/DD).


When ingesting large amounts of data, these configuration files can help you organise your data imports into batches or homogeneous collections.

Here is a simple configuration file:

  "newspapers": {
      "GDL": []
  "exclude_newspapers": [],
  "year_only": false

This is what a more complex config file looks like (only contents for the decade 1950-1960 of GDL are processed):

  "newspapers": {
      "GDL": "1950/01/01-1960/12/31"
  "exclude_newspapers": [],
  "year_only": false


This module contains generic helper functions for the text-importer module.

text_importer.utils.add_property(object_dict: dict[str, Any], prop_name: str, prop_function: Callable[[str], str], function_input: str) dict[str, Any]

Add a property and value to a given object dict computed with a given function.

  • object_dict (dict[str, Any]) – Object to which the property is added.

  • prop_name (str) – Name of the property to add.

  • prop_function (Callable[[str], str]) – Function computing the property value.

  • function_input (str) – Input to prop_function for this object.


Updated object.

Return type:

dict[str, Any]

text_importer.utils.empty_folder(dir_path: str) None

Empty a directoy given its path if it exists.


dir_path (str) – Path to the directory to empty.

text_importer.utils.get_access_right(journal: str, _date: date, access_rights: dict[str, dict[str, str]]) str

Fetch the access rights for a specific journal and publication date.

  • journal (str) – Journal name.

  • _date (date) – Publication date of the journal

  • access_rights (dict[str, dict[str, str]]) – Access rights for various journals.


Access rights for specific journal issue.

Return type:


text_importer.utils.get_issue_schema(schema_folder: str = 'impresso-schemas/json/newspaper/issue.schema.json') Namespace

Generate a list of python classes starting from a JSON schema.


schema_folder (str, optional) – Path to the schema folder. Defaults to “impresso-schemas/json/newspaper/issue.schema.json”.


Newspaper issue schema based on canonical format.

Return type:


text_importer.utils.get_page_schema(schema_folder: str = 'impresso-schemas/json/newspaper/page.schema.json') Namespace

Generate a list of python classes starting from a JSON schema.


schema_folder (str, optional) – Path to the schema folder. Defaults to “impresso-schemas/json/newspaper/page.schema.json”.


Newspaper page schema based on canonical format.

Return type:


text_importer.utils.get_pkg_resource(file_manager: ExitStack, path: str, package: str = 'text_importer') PosixPath

Return the resource at path in package, using a context manager.


The context manager file_manager needs to be instantiated prior to calling this function and should be closed once the package resource is no longer of use.

  • file_manager (contextlib.ExitStack) – Context manager.

  • path (str) – Path to the desired resource in given package.

  • package (str, optional) – Package name. Defaults to “text_importer”.


Path to desired managed resource.

Return type:


text_importer.utils.get_reading_order(items: list[dict[str, Any]]) dict[str, int]

Generate a reading order for items based on their id and the pages they span.

This reading order can be used to display the content items properly in a table of contents without skipping form page to page.


items (list[dict[str, Any]]) – List of items to reorder for the ToC.


A dictionary mapping item IDs to their reading order.

Return type:

dict[str, int]

text_importer.utils.init_logger(_logger: RootLogger, log_level: int, log_file: str) RootLogger

Initialise the logger.

  • _logger (logging.RootLogger) – Logger instance to initialise.

  • log_level (int) – Desidered logging level (e.g. logging.INFO).

  • log_file (str) – Path to destination file for logging output. If no output file is provided (log_file is None) logs will be written to standard output.


The initialised logger object.

Return type:


text_importer.utils.verify_imported_issues(actual_issue_json: dict[str, Any], expected_issue_json: dict[str, Any]) None

Verify that the imported issues fit expectations.

Two verifications are done: the number of content items, and their IDs.

  • actual_issue_json (dict[str, Any]) – Created issue json,

  • expected_issue_json (dict[str, Any]) – Expected issue json.

text_importer.utils.write_error(thing_id: str, origin_function: str, error: Exception, failed_log: str) None

Write the given error of a failed import to the failed_log file.

Adapted from impresso-text-acquisition/text_importer/importers/core.py to allow using a issue or page id, and provide the function in which the error took place.

  • thing_id (str) – Canonical ID of the object/file for which the error occurred.

  • origin_function (str) – Function in which the exception occured.

  • error (Exception) – Error that occurred and should be logged.

  • failed_log (str) – Path to log file for failed imports.

text_importer.utils.write_jsonlines_file(filepath: str, contents: str | list[str], content_type: str, failed_log: str | None = None) None

Write the given contents to a JSONL file given its path.

Filelocks are used here to prevent concurrent writing to the files.

  • filepath (str) – Path to the JSONL file to write to.

  • contents (str | list[str]) – Dump contents to write to the file.

  • content_type (str) – Type of content that is being written to the file.

  • failed_log (str | None, optional) – Path to a log to keep track of failed operations. Defaults to None.