Overview

Data architecture

The Impreso TextImporter is part of and fits into the data architecture defined in the framework of the impresso project to store and process a large-scale archive of historical newspapers. To understand the importer’s logic is worth touching upon the key points of the architecure into which it fits.

Canonical identifiers

Canonical identifiers are defined at the following levels:

  1. newspaper issue

  2. newspaper page

  3. content item (e.g. article, advertisement, weather forecast, obituary, etc.)

Issue IDs

  • template: {newspaper_id}-{date.year}-{date.month}-{date.day}-{edition}

  • examples: GDL-1900-01-02-a, luxzeit1858-1858-12-7-a

Page IDs

  • template: {newspaper_id}-{date.year}-{date.month}-{date.day}-{edition}-p{page_number}

  • examples: GDL-1900-01-02-a-p0004, luxzeit1858-1858-12-7-a-p0002

Content item IDs

  • template: {newspaper_id}-{date.year}-{date.month}-{date.day}-{edition}-i{item_number}

  • examples: GDL-1900-01-02-a-i0048, JDG-1901-01-01-a-i0031

Some things to note about these templates:

  • newspaper_id is an arbitrary string, not containing white spaces, unambiguously identifying a given newspaper

  • page_number is a four-digits integer (zeroes are used for filling)

  • edition: in case of newspapers published multiple times per day, a lowercase letter is used to indicate the edition number: a for the first, b for the second, etc.

  • item_number: is a four-digits integer (zeroes are used for filling); NB: content item IDs are expected to remain stable across any two runs of the importer given the same input data.

Data packaging

The JSON data produced by TextImporter are packaged into .bz2 archives for efficient storage. Each archive consists of one JSON-line file, where each line contains a JSON document. The JSON schemas are described here.

In Impresso we use an S3 solution for distributed storage to store newspaper data and accessed them at processing time.

Issue data

They are packaged by newspaper and by year (as they tend to be very small files). Each archive contains, one document per line, all issues of a newspaper that appeared in that year.

Examples: GDL-1900-issues.jsonl.bz2 contains all issues of the Gazette de Lausanne published in 1900.

Page data

They are packaged by newspaper issue. Each archive contains, one document per line, all JSON pages belonging to a given newspaper issue (edition).

Examples: GDL-1900-01-01-a-pages.jsonl.bz2 contains all issues of the Gazette de Lausanne (= GDL) published on January 1, 1900.

Image data

They are expected to be delivered via a dedicated IIIF endpoint, and typically stored in an image server. To each newspaper page corresponds an image file.

Note

In case the canonical ID of a page and the internal ID of its image differ, the content provider is expected to be able to provide a mapping of the two identifier systems.

Processing

Core functions to perform large-scale import of OCR data.

Most of the functions in this module are meant to be used in conjuction with Dask, the library we are using to parallelize the ingestion process and run it on distributed computing infrastructures.

Note

The function import_issues() is the most important in this module as it keeps everything together, by calling all other functions.

text_importer.importers.core.cleanup(upload_success: bool, filepath: str) None

Remove a file if it has been successfully uploaded to S3.

Copied and adapted from impresso-pycommons.

Parameters:
  • upload_success (bool) – Whether the upload was successful

  • filepath (str) – Path to the uploaded file

text_importer.importers.core.compress_issues(key: Tuple[str, int], issues: list[NewspaperIssue], output_dir: str | None = None, failed_log: str | None = None) Tuple[str, str, list[dict[str, int]]]

Compress issues of the same Journal-year and save them in a json file.

First check if the file exists, load it and then over-write/add the newly generated issues. The compressed .bz2 output file is a JSON-line file, where each line corresponds to an individual and issue document in the canonical format. Finally, yearly statistics are computed on the issues and included in the returned values.

Parameters:
  • key (Tuple[str, int]) – Newspaper ID and year of input issues (e.g. (GDL, 1900)).

  • issues (list[NewspaperIssue]) – A list of NewspaperIssue instances.

  • output_dir (str | None, optional) – Output directory. Defaults to None.

  • failed_log (str | None, optional) – Path to the log file used when an instantiation was not successful. Defaults to None.

Returns:

Label following the template <NEWSPAPER>-<YEAR>, the path to

the the compressed .bz2 file, and the statistics computed on the issues.

Return type:

Tuple[str, str]

text_importer.importers.core.compress_pages(key: str, json_files: list[str], output_dir: str, suffix: str = '', failed_log: str | None = None) Tuple[str, str]

Merge a set of JSON line files into a single compressed archive.

Parameters:
  • key (str) – Canonical ID of the newspaper issue (e.g. GDL-1900-01-02-a).

  • json_files (list[str]) – Paths of input JSON line files.

  • output_dir (str) – Directory where to write the output file.

  • suffix (str, optional) – Suffix to add to the filename. Defaults to “”.

Returns:

Sorting key [0] and path to serialized file [1].

Return type:

Tuple[str, str]

text_importer.importers.core.dir2issue(issue: IssueDir, issue_class: Type[NewspaperIssue], failed_log: str | None = None, image_dirs: str | None = None, temp_dir: str | None = None) NewspaperIssue | None

Instantiate a NewspaperIssue object from an IssueDir.

Any instantiation leading to an exception is logged to a specific file only containing issues which could not be imported.

Parameters:
  • issue (IssueDir) – IssueDir representing the issue to instantiate.

  • issue_class (Type[NewspaperIssue]) – Type of NewspaperIssue to use.

  • failed_log (str | None, optional) – Path to the log file used if the instantiation was not successful. Defaults to None.

  • image_dirs (str | None, optional) – Path to the directory containing the information on images, only for Olive importer. Defaults to None.

  • temp_dir (str | None, optional) – Temporary directory to unpack the issue’s zip archive into. Defaults to None.

Returns:

A new NewspaperIssue instance, or None if

the instantiation triggered an exception.

Return type:

NewspaperIssue | None

text_importer.importers.core.dirs2issues(issues: list[IssueDir], issue_class: Type[NewspaperIssue], failed_log: str | None = None, image_dirs: str | None = None, temp_dir: str | None = None) list[NewspaperIssue]

Instantiate the NewspaperIssue objects to import to Impresso’s format.

Any NewspaperIssue for which the instantiation is unsuccessful will be logged, along with the triggered error.

Parameters:
  • issues (list[IssueDir]) – List of issues to instantiate and import.

  • issue_class (Type[NewspaperIssue]) – Type of NewspaperIssue to use.

  • failed_log (str | None, optional) – Path to the log file used when an instantiation was not successful. Defaults to None.

  • image_dirs (str | None, optional) – Path to the directory containing the information on images, only for Olive importer. Defaults to None.

  • temp_dir (str | None, optional) – Temporary directory to unpack zip archives of issues into. Defaults to None.

Returns:

List of NewspaperIssue instances to import.

Return type:

list[NewspaperIssue]

text_importer.importers.core.import_issues(issues: list[IssueDir], out_dir: str, s3_bucket: str | None, issue_class: Type[NewspaperIssue], image_dirs: str | None, temp_dir: str | None, chunk_size: int | None, manifest: DataManifest, client: Client | None = None) None

Import a bunch of newspaper issues.

Parameters:
  • issues (list[IssueDir]) – Issues to import.

  • out_dir (str) – Output directory for the json files.

  • s3_bucket (str | None) – Output s3 bucket for the json files.

  • issue_class (Type[NewspaperIssue]) – Newspaper issue class to import, (Child of NewspaperIssue).

  • image_dirs (str | None) – Directory of images for Olive format, (can be multiple).

  • temp_dir (str | None) – Temporary directory for extracting archives (applies only to importers make use of ZipArchive).

  • chunk_size (int | None) – Chunk size in years used to process issues.

text_importer.importers.core.issue2pages(issue: NewspaperIssue) list[NewspaperPage]

Flatten an issue into a list of their pages.

As an issue consists of several pages, this function is useful in order to process each page in a truly parallel fashion.

Parameters:

issue (NewspaperIssue) – Issue to collect the pages of.

Returns:

List of pages of the given issue.

Return type:

list[NewspaperPage]

text_importer.importers.core.process_pages(pages: list[NewspaperPage], failed_log: str) list[NewspaperPage]

Given a list of pages, trigger the .parse() method of each page.

Parameters:
  • pages (list[NewspaperPage]) – Input newspaper pages.

  • failed_log (str) – File path of failed log.

Returns:

A list of processed pages.

Return type:

list[NewspaperPage]

text_importer.importers.core.remove_filelocks(output_dir: str) None

Remove all files ending with .lock in a directory.

Parameters:

output_dir (str) – Path to directory containing file locks.

text_importer.importers.core.serialize_pages(pages: list[NewspaperPage], output_dir: str | None = None) list[Tuple[IssueDir, str]]

Serialize a list of pages to an output directory.

Parameters:
  • pages (list[NewspaperPage]) – Input newspaper pages.

  • output_dir (str | None, optional) – Path to the output directory. Defaults to None.

Returns:

A list of tuples (IssueDir, path),

where the IssueDir object represents the issue to which pages belong, and path the path to the individual page JSON file.

Return type:

list[Tuple[IssueDir, str]]

text_importer.importers.core.upload_issues(sort_key: str, filepath: str, bucket_name: str | None = None, failed_log: str | None = None) Tuple[bool, str]

Upload an issues JSON-line file to a given S3 bucket.

sort_key is expected to be the concatenation of newspaper ID and year.

Parameters:
  • sort_key (str) – Key used to group articles (e.g. “GDL-1900”).

  • filepath (str) – Path of the file to upload to S3.

  • bucket_name (str | None, optional) – Name of S3 bucket where to upload the file. Defaults to None.

  • failed_log (str | None, optional) – Path to file where to log errors.

Returns:

Whether the upload was successful and the path to the

uploaded file.

Return type:

Tuple[bool, str]

text_importer.importers.core.upload_pages(sort_key: str, filepath: str, bucket_name: str | None = None, failed_log: str | None = None) Tuple[bool, str]

Upload a page JSON file to a given S3 bucket.

Parameters:
  • sort_key (str) – the key used to group articles (e.g. “GDL-1900-01-01-a”).

  • filepath (str) – Path of the file to upload to S3.

  • bucket_name (str | None, optional) – Name of S3 bucket where to upload the file. Defaults to None.

  • failed_log (str | None, optional) – Path to file where to log errors.

Returns:

Whether the upload was successful and the path to the

uploaded file.

Return type:

Tuple[bool, str]

text_importer.importers.core.write_error(thing: NewspaperIssue | NewspaperPage | IssueDir | str, error: Exception | str, failed_log: str | None) None

Write the given error of a failed import to the failed_log file.

Parameters:
  • thing (NewspaperIssue | NewspaperPage | IssueDir | str) – Object for which the error occurred, or corresponding canonical ID.

  • error (Exception) – Error that occurred and should be logged.

  • failed_log (str) – Path to log file for failed imports.