Importers

Available importers

The Impresso Importers already support a number of formats (and flavours of standard formats), while a few others are currently being developed.

The following importer CLI scripts are already available:

  • text_preparation.scripts.oliveimporter: importer for the Olive XML format, used by RERO to encode and deliver the majority of its newspaper data.

  • text_preparation.scripts.reroimporter: importer for the Mets/ALTO flavor used by RERO to encode and deliver part of its data.

  • text_preparation.scripts.luximporter: importer for the Mets/ALTO flavor used by the Bibliothèque nationale de Luxembourg (BNL) to encode and deliver its newspaper data.

  • text_preparation.scripts.bnfimporter: importer for the Mets/ALTO flavor used by the Bibliothèque nationale de France (BNF) to encode and deliver its newspaper data.

  • text_preparation.scripts.bnfen_importer: importer for the Mets/ALTO flavor used by the Bibliothèque nationale de France (BNF) to encode and deliver its newspaper data for the Europeana collection.

  • text_preparation.scripts.bcul_importer: importer for the ABBYY format used by the Bibliothèque Cantonale Universitaire de Lausanne (BCUL) to encode and deliver the newspaper data which is on the Scriptorium interface.

  • text_preparation.scripts.swaimporter: ALTO flavor of the Basel University Library.

  • text_preparation.scripts.blimporter: importer for the Mets/ALTO flavor used by the British Library (BL) to encode and deliver its newspaper data.

  • text_preparation.scripts.tetml: generic importer for the TETML format, produced by PDFlib TET.

  • text_preparation.scripts.fedgaz: importer for the TETML format with separate metadata file and a heuristic article segmentation, used to parse the Federal Gazette.

For further details on any of these implementations, please do refer to its documentation:

Command-line interface

Note

All importers share the same command-line interface; only a few options are import-specific (see documentation below).

Functions and CLI script to convert any OCR data into Impresso’s format.

Usage:

<importer-name>importer.py –input-dir=<id> (–clear | –incremental) [–output-dir=<od> –image-dirs=<imd> –temp-dir=<td> –chunk-size=<cs> –s3-bucket=<b> –config-file=<cf> –log-file=<f> –verbose –scheduler=<sch> –access-rights=<ar> –git-repo=<gr> –num-workers=<nw>] <importer-name>importer.py –version

Options:
--input-dir=<id>

Base directory containing one sub-directory for each journal

--image-dirs=<imd>

Directory containing (canonical) images and their metadata (use , to separate multiple dirs)

--output-dir=<od>

Base directory where to write the output files

--temp-dir=<td>

Temporary directory to extract .zip archives

--config-file=<cf>

Configuration file for selective import

--s3-bucket=<b>

If provided, writes output to an S3 drive, in the specified bucket

--scheduler=<sch>

Tell dask to use an existing scheduler (otherwise it’ll create one)

--log-file=<f>

Log file; when missing print log to stdout

--access-rights=<ar>

Access right file if relevant (only for olive and rero importers)

--chunk-size=<cs>

Chunk size in years used to group issues when importing

--git-repo=<gr>

Local path to the “impresso-text-acquisition” git directory (including it).

--num-workers=<nw>

Number of workers to use for local dask cluster

--verbose

Verbose log messages (good for debugging)

--clear

Removes the output folder (if already existing)

--incremental

Skips issues already present in output directory

--version

Prints version and exits.

Configuration file

The selection of the actual newspaper data to be imported can be controlled by means of a configuration file (JSON format). The path to this file is passed via the --config_file= CLI parameter.

This JSON file contains three properties:

  • newspapers: a dictionary containing the newspaper IDs to be imported (e.g. GDL);

  • exclude_newspapers: a list of the newspaper IDs to be excluded;

  • year_only: a boolean flag indicating whether date ranges are expressed by using years or more granular dates (in the format YYYY/MM/DD).

Note

When ingesting large amounts of data, these configuration files can help you organise your data imports into batches or homogeneous collections.

Here is a simple configuration file:

{
  "newspapers": {
      "GDL": []
    },
  "exclude_newspapers": [],
  "year_only": false
}

This is what a more complex config file looks like (only contents for the decade 1950-1960 of GDL are processed):

{
  "newspapers": {
      "GDL": "1950/01/01-1960/12/31"
    },
  "exclude_newspapers": [],
  "year_only": false
}

Writing a new importer

Writing a new importer is easy and entails implementing two pieces of code:

  1. implementing functions to detect the data to import;

  2. implementing from scratch classes that handle the conversion into JSON of your OCR format or adapt one of the existing importers.

Once these two pieces of code are in place, they can be plugged into the functions defined in text_preparation.importers.generic_importer so as to create a dedicated CLI script for your specific format.

For example, this is the content of oliveimporter.py:

from text_preparation.importers import generic_importer
from text_preparation.importers.olive.classes import OliveNewspaperIssue
from text_preparation.importers.olive.detect import (olive_detect_issues,
                                                  olive_select_issues)

if __name__ == '__main__':
    generic_importer.main(
        OliveNewspaperIssue,
        olive_detect_issues,
        olive_select_issues
    )

How should the code of a new text importer be structured? We recommend to comply to the following structure:

  • text_preparation.importers.<new_importer>.detect will contain functions to find the data to be imported;

  • text_preparation.importers.<new_importer>.helpers (optional) will contain ancillary functions;

  • text_preparation.importers.<new_importer>.parsers (optional) will contain functions/classes to parse the data.

  • text_preparation/scripts/<new_importer>.py: will contain a CLI script to run the importer.

Detect data to import

  • the importer needs to know which data should be imported

  • information about the newspaper contents is often encoded as part of folder names etc., thus it needs to be extracted and made explicit, by means of Canonical identifiers

  • add some sample data to text_preparation/data/sample/<new_format>

For example: olive_detect_issues()

Implement abstract classes

These two classes are passed to the the importer’s generic command-line interface, see text_preparation.importers.generic_importer.main()

class text_preparation.importers.classes.NewspaperIssue(issue_dir: IssueDir)

Abstract class representing a newspaper issue.

Each text importer needs to define a subclass of NewspaperIssue which specifies the logic to handle OCR data in a given format (e.g. Olive).

Parameters:

issue_dir (IssueDir) – Identifying information about the issue.

id

Canonical Issue ID (e.g. GDL-1900-01-02-a).

Type:

str

edition

Lower case letter ordering issues of the same day.

Type:

str

journal

Newspaper unique identifier or name.

Type:

str

path

Path to directory containing the issue’s OCR data.

Type:

str

date

Publication date of issue.

Type:

datetime.date

issue_data

Issue data according to canonical format.

Type:

dict[str, Any]

pages

List of NewspaperPage instances from this issue.

Type:

list

rights

Access rights applicable to this issue.

Type:

str

property issuedir: IssueDir

IssueDirectory corresponding to this issue.

Type:

IssueDir

to_json() str

Validate self.issue_data & serialize it to string.

Note

Validation adds a substantial overhead to computing time. For serialization of large amounts of issues it is recommendable to bypass schema validation.

class text_preparation.importers.classes.NewspaperPage(_id: str, number: int)

Abstract class representing a newspaper page.

Each text importer needs to define a subclass of NewspaperPage which specifies the logic to handle OCR data in a given format (e.g. Alto).

Parameters:
  • _id (str) – Canonical Page ID (e.g. GDL-1900-01-02-a-p0004).

  • number (int) – Page number.

id

Canonical Page ID (e.g. GDL-1900-01-02-a-p0004).

Type:

str

number

Page number.

Type:

int

page_data

Page data according to canonical format.

Type:

dict[str, Any]

issue

Issue this page is from.

Type:

NewspaperIssue | None

abstract add_issue(issue: NewspaperIssue) None

Add to a page object its parent, i.e. the newspaper issue.

This allows each page to preserve contextual information coming from the newspaper issue.

Parameters:

issue (NewspaperIssue) – Newspaper issue containing this page.

abstract parse() None

Process the page XML file and transform into canonical Page format.

Note

This lazy behavior means that the page contents are not processed upon creation of the page object, but only once the parse() method is called.

to_json() str

Validate self.page.data & serialize it to string.

Note

Validation adds a substantial overhead to computing time. For serialization of large amounts of pages it is recommendable to bypass schema validation.

Write an importer CLI script

This script imports passes the new NewspaperIssue class, together with the-newly defined detect functions, to the main() function of the generic importer CLI text_preparation.importers.generic_importer.main().

Test

Create a new test file named test_<new_importer>_importer.py and add it to tests/importers/.

This file should contain at the very minimum a test called test_import_issues(), which

  • detects input data from text_preparation/data/sample/<new_format>

  • writes any output to text_preparation/data/out/.