Overview

Data architecture

Impreso Text Preparation, composed of the Importer and the Rebuilder is the main part of the data architecture defined in the framework of the impresso project to store and process a large-scale archive of historical newspapers and radio (broadcasts and bulletins). To understand the importer’s logic is worth touching upon the key points of the architecure into which it fits.

Canonical identifiers

Canonical identifiers are defined at the following levels:

newspaper or broadcast issue
newspaper page or audio record
paper-based content item (e.g. article, advertisement, weather forecast, obituary, etc.) or audio-record content item (audio broadcast emission)

Issue IDs

template: {media_alias}-{date.year}-{date.month}-{date.day}-{edition}
regex pattern: ^[A-Za-z][A-Za-z0-9_]*-\\d{4}-\\d{2}-\\d{2}-[a-z]{1,2}$
examples: GDL-1900-01-02-a, luxzeit1858-1858-12-7-a

Page IDs

template: {media_alias}-{date.year}-{date.month}-{date.day}-{edition}-p{page_number}
regex pattern: ^[A-Za-z][A-Za-z0-9_]*-\\d{4}-\\d{2}-\\d{2}-[a-z]{1,2}-p[0-9]{4}$
examples: GDL-1900-01-02-a-p0004, luxzeit1858-1858-12-7-a-p0002

Audio Records IDs

template: {media_alias}-{date.year}-{date.month}-{date.day}-{edition}-r{record_number}
regex pattern: ^[A-Za-z][A-Za-z0-9_]*-\\d{4}-\\d{2}-\\d{2}-[a-z]{1,2}-r[0-9]{4}$
examples: GDL-1900-01-02-a-p0004, luxzeit1858-1858-12-7-a-p0002

Content item IDs

template: {media_alias}-{date.year}-{date.month}-{date.day}-{edition}-i{item_number}
regex pattern: ^[A-Za-z][A-Za-z0-9_]*-\\d{4}-\\d{2}-\\d{2}-[a-z]{1,2}-i[0-9]{4}$
examples: GDL-1900-01-02-a-i0048, JDG-1901-01-01-a-i0031

Some things to note about these templates:

media_alias is an arbitrary string, not containing white spaces, unambiguously identifying a given media title
page_number is a four-digits integer (zeroes are used for filling)
record_number is a four-digits integer (zeroes are used for filling). In most cases a broadcast only has one MP3 recording, but is here for conformity.
edition: in case of newspapers published multiple times per day, a lowercase letter is used to indicate the edition number: a for the first, b for the second, etc.
item_number: is a four-digits integer (zeroes are used for filling); NB: content item IDs are expected to remain stable across any two runs of the importer given the same input data.

Data packaging

The JSON data produced by the Importer and Rebuilder are packaged into .bz2 archives for efficient storage. Each archive consists of one JSON-line file, where each line contains a JSON document. The JSON schemas are described here.

In Impresso we use an S3 solution for distributed storage to store newspaper data and accessed them at processing time.

Issue data

They are packaged by media title and by year (as they tend to be very small files). Each archive contains, one document per line, all issues of a media title that appeared in that year.

Examples: GDL-1900-issues.jsonl.bz2 contains all issues of the Gazette de Lausanne published in 1900.

Page or Audio Record data

They are packaged by issue. Each archive contains, one document per line, all JSON pages belonging to a given issue (edition).

Examples: - GDL-1900-01-01-a-pages.jsonl.bz2 contains all issues of the Gazette de Lausanne (= GDL) published on January 1, 1900. - RDN-1950-01-12-a-audios.jsonl.bz2 contains all issues of La ronde des nations (= RDN) published on January 12, 1950.

Rebuilt data

They are packaged by media title and by year. Each archive contains, one document per line, all JSON content-items belonging to a given title and year.

Examples: GDL-1900.jsonl.bz2 contains all rebuilt data of the Gazette de Lausanne (= GDL) published in 1900.

Image data

They are expected to be delivered via a dedicated IIIF endpoint, and typically stored in an image server. To each newspaper page corresponds an image file.

Note

In case the canonical ID of a page and the internal ID of its image differ, the content provider is expected to be able to provide a mapping of the two identifier systems.

Processing

Core functions to perform large-scale import of OCR data.

Most of the functions in this module are meant to be used in conjuction with Dask, the library we are using to parallelize the ingestion process and run it on distributed computing infrastructures.

Note

The function import_issues() is the most important in this module as it keeps everything together, by calling all other functions.

text_preparation.importers.core.cleanup(upload_success: bool, filepath: str) → None

Remove a file if it has been successfully uploaded to S3.

Copied and adapted from impresso-pycommons.

Parameters:

upload_success (bool) – Whether the upload was successful
filepath (str) – Path to the uploaded file

text_preparation.importers.core.compress_issues(key: Tuple[str, int], issues: list[CanonicalIssue], output_dir: str | None = None, failed_log: str | None = None, is_audio: bool | None = None) → Tuple[str, str, list[dict[str, int]]]

Compress issues of the same alias-year and save them in a json file.

First check if the file exists, load it and then over-write/add the newly generated issues. The compressed .bz2 output file is a JSON-line file, where each line corresponds to an individual and issue document in the canonical format. Finally, yearly statistics are computed on the issues and included in the returned values.

Parameters:

key (Tuple[str, int]) – Newspaper or media alias and year of input issues (e.g. (GDL, 1900)).
issues (list[CanonicalIssue]) – A list of CanonicalIssue instances.
output_dir (str | None, optional) – Output directory. Defaults to None.
failed_log (str | None, optional) – Path to the log file used when an instantiation was not successful. Defaults to None.

Returns:

Label following the template <ALIAS>-<YEAR>, the path to: the the compressed .bz2 file, and the statistics computed on the issues.

Return type:

Tuple[str, str]

text_preparation.importers.core.compress_supports(key: str, json_files: list[str], output_dir: str, suffix: str = '', failed_log: str | None = None) → Tuple[str, str]

Merge a set of JSON line files into a single compressed archive.

Parameters:

key (str) – Canonical ID of the issue (e.g. GDL-1900-01-02-a).
json_files (list[str]) – Paths of input JSON line files.
output_dir (str) – Directory where to write the output file.
suffix (str, optional) – Suffix to add to the filename. Defaults to “”.

Returns:

Sorting key [0] and path to serialized file [1].

Return type:

Tuple[str, str]

text_preparation.importers.core.dir2issue(issue: IssueDir, issue_class: Type[CanonicalIssue], failed_log: str | None = None, image_dirs: str | None = None, temp_dir: str | None = None) → CanonicalIssue | None

Instantiate a CanonicalIssue object from an IssueDir.

Any instantiation leading to an exception is logged to a specific file only containing issues which could not be imported.

Parameters:

issue (IssueDir) – IssueDir representing the issue to instantiate.
issue_class (Type[CanonicalIssue]) – Type of CanonicalIssue to use.
failed_log (str | None, optional) – Path to the log file used if the instantiation was not successful. Defaults to None.
image_dirs (str | None, optional) – Path to the directory containing the information on images, only for Olive importer. Defaults to None.
temp_dir (str | None, optional) – Temporary directory to unpack the issue’s zip archive into. Defaults to None.

Returns:

A new CanonicalIssue instance, or None if: the instantiation triggered an exception.

Return type:

CanonicalIssue | None

text_preparation.importers.core.dirs2issues(issues: list[IssueDir], issue_class: Type[CanonicalIssue], failed_log: str | None = None, image_dirs: str | None = None, temp_dir: str | None = None) → list[CanonicalIssue]

Instantiate the CanonicalIssue objects to import to Impresso’s format.

Any CanonicalIssue for which the instantiation is unsuccessful will be logged, along with the triggered error.

Parameters:

issues (list[IssueDir]) – List of issues to instantiate and import.
issue_class (Type[CanonicalIssue]) – Type of CanonicalIssue to use.
failed_log (str | None, optional) – Path to the log file used when an instantiation was not successful. Defaults to None.
image_dirs (str | None, optional) – Path to the directory containing the information on images, only for Olive importer. Defaults to None.
temp_dir (str | None, optional) – Temporary directory to unpack zip archives of issues into. Defaults to None.

Returns:

List of CanonicalIssue instances to import.

Return type:

list[CanonicalIssue]

text_preparation.importers.core.import_issues(issues: list[IssueDir], out_dir: str, s3_bucket: str | None, issue_class: Type[CanonicalIssue], image_dirs: str | None, temp_dir: str | None, chunk_size: int | None, manifest: DataManifest, client: Client | None = None, is_audio: bool | None = False, provider: str | None = None) → None

Import a bunch of canonical issues (newspaper or radio-broadcast).

Parameters:

issues (list[IssueDir]) – Issues to import.
out_dir (str) – Output directory for the json files.
s3_bucket (str | None) – Output s3 bucket for the json files.
issue_class (Type[CanonicalIssue]) – Canonical issue class to import, (Child of CanonicalIssue).
image_dirs (str | None) – Directory of images for Olive format, (can be multiple).
temp_dir (str | None) – Temporary directory for extracting archives (applies only to importers make use of ZipArchive).
chunk_size (int | None) – Chunk size in years used to process issues.
manifest (DataManifest) – Data manifest instance, tracking the stats on the imported data.
client (Client | None, optional) – Dask client. Defaults to None.
is_audio (bool | None, optional) – The if the data being imported is audio. Defaults to False.
provider (str | None, optional) – The provider of the data being ingested, if possible.

text_preparation.importers.core.is_audio_issue(issue: CanonicalIssue) → bool

Checks whether the given canonical issue corresponds to audio data.

This is done by checking the “sm” (ie. “source medium”) property.

Parameters:: issue (CanonicalIssue) – Canonical Issue for which to identify the medium
Raises:: NotImplementedError – If the source medium found is invalid.
Returns:: True if the issue corresponds to audio data, False otherwise.
Return type:: bool

text_preparation.importers.core.issue2supports(issue: CanonicalIssue) → list[CanonicalPage] | list[CanonicalAudioRecord]

Flatten an issue into a list of their pages or audio records.

Case of newspapers and radio bulletins: As an issue consists of several pages, this function is useful in order to process each page in a truly parallel fashion.

For radio audio broadcasts, it’s more to keep a unified processing approach

Parameters:: issue (CanonicalIssue) – Issue to collect the pages or audio records of.
Returns:: List of pages or audio records of the given issue.
Return type:: list[CanonicalPage] | list[CanonicalAudioRecord]

text_preparation.importers.core.process_supports(supports: list[CanonicalPage | CanonicalAudioRecord], failed_log: str) → list[CanonicalPage | CanonicalAudioRecord]

Given a list of pages, trigger the .parse() method of each page.

Parameters:

supports (list[CanonicalPage | CanonicalAudioRecord]) – Input canonical supports.
failed_log (str) – File path of failed log.

Returns:

A list of processed supports.

Return type:

list[CanonicalPage | CanonicalAudioRecord]

text_preparation.importers.core.remove_filelocks(output_dir: str) → None

Remove all files ending with .lock in a directory.

Parameters:: output_dir (str) – Path to directory containing file locks.

text_preparation.importers.core.serialize_supports(supports: list[CanonicalPage] | list[CanonicalAudioRecord], output_dir: str | None = None, failed_log: str | None = None, is_audio: bool | None = None) → list[Tuple[IssueDir, str]]

Serialize a list of pages or audio records to an output directory.

Parameters:

pages (list[CanonicalPage] | list[CanonicalAudioRecord]) – Input canonical pages or audio supports.
output_dir (str | None, optional) – Path to the output directory. Defaults to None.
failed_log (str | None, optional) – Path to the log file used when an instantiation was not successful. Defaults to None.

Returns:

A list of tuples (IssueDir, path),: where the IssueDir object represents the issue to which pages belong, and path the path to the individual page JSON file.

Return type:

list[Tuple[IssueDir, str]]

text_preparation.importers.core.upload_issues(sort_key: str, filepath: str, provider: str | None = None, bucket_name: str | None = None, failed_log: str | None = None) → Tuple[bool, str]

Upload an issues JSON-line file to a given S3 bucket.

sort_key is expected to be the concatenation of title alias and year.

Parameters:

sort_key (str) – Key used to group articles (e.g. “GDL-1900”).
filepath (str) – Path of the file to upload to S3.
provider (str | None, optional) – Name of the alias’ provider to include in s3 path. the file. Defaults to None.
bucket_name (str | None, optional) – Name of S3 bucket where to upload the file. Defaults to None.
failed_log (str | None, optional) – Path to file where to log errors.

Returns:

Whether the upload was successful and the path to the: uploaded file.

Return type:

Tuple[bool, str]

text_preparation.importers.core.upload_supports(sort_key: str, filepath: str, provider: str | None = None, bucket_name: str | None = None, failed_log: str | None = None, supports_name: str | None = None) → Tuple[bool, str]

Upload a page JSON file to a given S3 bucket.

Parameters:

sort_key (str) – the key used to group articles (e.g. “GDL-1900-01-01-a”).
filepath (str) – Path of the file to upload to S3.
provider (str | None, optional) – Name of the alias’ provider to include in s3 path. the file. Defaults to None.
bucket_name (str | None, optional) – Name of S3 bucket where to upload the file. Defaults to None.
failed_log (str | None, optional) – Path to file where to log errors.

Returns:

Whether the upload was successful and the path to the: uploaded file.

Return type:

Tuple[bool, str]

Write the given error of a failed import to the failed_log file.

Parameters:

thing (CanonicalIssue | CanonicalPage | IssueDir | str) – Object for which the error occurred, or corresponding canonical ID.
error (Exception) – Error that occurred and should be logged.
failed_log (str) – Path to log file for failed imports.