TextImporter

Available importers

The Impresso TextImporter already supports a number of formats (and flavours of standard formats), while a few others are currently being developed.

The following importer CLI scripts are already available:

  • text_importer.scripts.oliveimporter: importer for the Olive XML format, used by RERO to encode and deliver the majority of its newspaper data.

  • text_importer.scripts.reroimporter: importer for the Mets/ALTO flavor used by RERO to encode and deliver part of its data.

  • text_importer.scripts.luximporter: importer for the Mets/ALTO flavor used by the Bibliothèque nationale de Luxembourg (BNL) to encode and deliver its newspaper data.

  • text_importer.scripts.bnfimporter: importer for the Mets/ALTO flavor used by the Bibliothèque nationale de France (BNF) to encode and deliver its newspaper data.

  • text_importer.scripts.bnfen_importer: importer for the Mets/ALTO flavor used by the Bibliothèque nationale de France (BNF) to encode and deliver its newspaper data for the Europeana collection.

  • text_importer.scripts.bcul_importer: importer for the ABBYY format used by the Bibliothèque Cantonale Universitaire de Lausanne (BCUL) to encode and deliver the newspaper data which is on the Scriptorium interface.

  • text_importer.scripts.swaimporter: ALTO flavor of the Basel University Library.

  • text_importer.scripts.blimporter: importer for the Mets/ALTO flavor used by the British Library (BL) to encode and deliver its newspaper data.

  • text_importer.scripts.tetml: generic importer for the TETML format, produced by PDFlib TET.

  • text_importer.scripts.fedgaz: importer for the TETML format with separate metadata file and a heuristic article segmentation, used to parse the Federal Gazette.

For further details on any of these implementations, please do refer to its documentation:

Command-line interface

Note

All importers share the same command-line interface; only a few options are import-specific (see documentation below).

Functions and CLI script to convert any OCR data into Impresso’s format.

Usage:

<importer-name>importer.py –input-dir=<id> (–clear | –incremental) [–output-dir=<od> –image-dirs=<imd> –temp-dir=<td> –chunk-size=<cs> –s3-bucket=<b> –config-file=<cf> –log-file=<f> –verbose –scheduler=<sch> –access-rights=<ar> –git-repo=<gr> –num-workers=<nw>] <importer-name>importer.py –version

Options:
--input-dir=<id>

Base directory containing one sub-directory for each journal

--image-dirs=<imd>

Directory containing (canonical) images and their metadata (use , to separate multiple dirs)

--output-dir=<od>

Base directory where to write the output files

--temp-dir=<td>

Temporary directory to extract .zip archives

--config-file=<cf>

Configuration file for selective import

--s3-bucket=<b>

If provided, writes output to an S3 drive, in the specified bucket

--scheduler=<sch>

Tell dask to use an existing scheduler (otherwise it’ll create one)

--log-file=<f>

Log file; when missing print log to stdout

--access-rights=<ar>

Access right file if relevant (only for olive and rero importers)

--chunk-size=<cs>

Chunk size in years used to group issues when importing

--git-repo=<gr>

Local path to the “impresso-text-acquisition” git directory (including it).

--num-workers=<nw>

Number of workers to use for local dask cluster

--verbose

Verbose log messages (good for debugging)

--clear

Removes the output folder (if already existing)

--incremental

Skips issues already present in output directory

--version

Prints version and exits.

Configuration file

The selection of the actual newspaper data to be imported can be controlled by means of a configuration file (JSON format). The path to this file is passed via the --config_file= CLI parameter.

This JSON file contains three properties:

  • newspapers: a dictionary containing the newspaper IDs to be imported (e.g. GDL);

  • exclude_newspapers: a list of the newspaper IDs to be excluded;

  • year_only: a boolean flag indicating whether date ranges are expressed by using years or more granular dates (in the format YYYY/MM/DD).

Note

When ingesting large amounts of data, these configuration files can help you organise your data imports into batches or homogeneous collections.

Here is a simple configuration file:

{
  "newspapers": {
      "GDL": []
    },
  "exclude_newspapers": [],
  "year_only": false
}

This is what a more complex config file looks like (only contents for the decade 1950-1960 of GDL are processed):

{
  "newspapers": {
      "GDL": "1950/01/01-1960/12/31"
    },
  "exclude_newspapers": [],
  "year_only": false
}

Utilities

This module contains generic helper functions for the text-importer module.

text_importer.utils.add_property(object_dict: dict[str, Any], prop_name: str, prop_function: Callable[[str], str], function_input: str) dict[str, Any]

Add a property and value to a given object dict computed with a given function.

Parameters:
  • object_dict (dict[str, Any]) – Object to which the property is added.

  • prop_name (str) – Name of the property to add.

  • prop_function (Callable[[str], str]) – Function computing the property value.

  • function_input (str) – Input to prop_function for this object.

Returns:

Updated object.

Return type:

dict[str, Any]

text_importer.utils.empty_folder(dir_path: str) None

Empty a directoy given its path if it exists.

Parameters:

dir_path (str) – Path to the directory to empty.

text_importer.utils.get_access_right(journal: str, _date: date, access_rights: dict[str, dict[str, str]]) str

Fetch the access rights for a specific journal and publication date.

Parameters:
  • journal (str) – Journal name.

  • _date (date) – Publication date of the journal

  • access_rights (dict[str, dict[str, str]]) – Access rights for various journals.

Returns:

Access rights for specific journal issue.

Return type:

str

text_importer.utils.get_issue_schema(schema_folder: str = 'impresso-schemas/json/newspaper/issue.schema.json') Namespace

Generate a list of python classes starting from a JSON schema.

Parameters:

schema_folder (str, optional) – Path to the schema folder. Defaults to “impresso-schemas/json/newspaper/issue.schema.json”.

Returns:

Newspaper issue schema based on canonical format.

Return type:

pjs.util.Namespace

text_importer.utils.get_page_schema(schema_folder: str = 'impresso-schemas/json/newspaper/page.schema.json') Namespace

Generate a list of python classes starting from a JSON schema.

Parameters:

schema_folder (str, optional) – Path to the schema folder. Defaults to “impresso-schemas/json/newspaper/page.schema.json”.

Returns:

Newspaper page schema based on canonical format.

Return type:

pjs.util.Namespace

text_importer.utils.get_pkg_resource(file_manager: ExitStack, path: str, package: str = 'text_importer') PosixPath

Return the resource at path in package, using a context manager.

Note

The context manager file_manager needs to be instantiated prior to calling this function and should be closed once the package resource is no longer of use.

Parameters:
  • file_manager (contextlib.ExitStack) – Context manager.

  • path (str) – Path to the desired resource in given package.

  • package (str, optional) – Package name. Defaults to “text_importer”.

Returns:

Path to desired managed resource.

Return type:

pathlib.PosixPath

text_importer.utils.get_reading_order(items: list[dict[str, Any]]) dict[str, int]

Generate a reading order for items based on their id and the pages they span.

This reading order can be used to display the content items properly in a table of contents without skipping form page to page.

Parameters:

items (list[dict[str, Any]]) – List of items to reorder for the ToC.

Returns:

A dictionary mapping item IDs to their reading order.

Return type:

dict[str, int]

text_importer.utils.init_logger(_logger: RootLogger, log_level: int, log_file: str) RootLogger

Initialise the logger.

Parameters:
  • _logger (logging.RootLogger) – Logger instance to initialise.

  • log_level (int) – Desidered logging level (e.g. logging.INFO).

  • log_file (str) – Path to destination file for logging output. If no output file is provided (log_file is None) logs will be written to standard output.

Returns:

The initialised logger object.

Return type:

logging.RootLogger

text_importer.utils.verify_imported_issues(actual_issue_json: dict[str, Any], expected_issue_json: dict[str, Any]) None

Verify that the imported issues fit expectations.

Two verifications are done: the number of content items, and their IDs.

Parameters:
  • actual_issue_json (dict[str, Any]) – Created issue json,

  • expected_issue_json (dict[str, Any]) – Expected issue json.

text_importer.utils.write_error(thing_id: str, origin_function: str, error: Exception, failed_log: str) None

Write the given error of a failed import to the failed_log file.

Adapted from impresso-text-acquisition/text_importer/importers/core.py to allow using a issue or page id, and provide the function in which the error took place.

Parameters:
  • thing_id (str) – Canonical ID of the object/file for which the error occurred.

  • origin_function (str) – Function in which the exception occured.

  • error (Exception) – Error that occurred and should be logged.

  • failed_log (str) – Path to log file for failed imports.

text_importer.utils.write_jsonlines_file(filepath: str, contents: str | list[str], content_type: str, failed_log: str | None = None) None

Write the given contents to a JSONL file given its path.

Filelocks are used here to prevent concurrent writing to the files.

Parameters:
  • filepath (str) – Path to the JSONL file to write to.

  • contents (str | list[str]) – Dump contents to write to the file.

  • content_type (str) – Type of content that is being written to the file.

  • failed_log (str | None, optional) – Path to a log to keep track of failed operations. Defaults to None.