Overview
Data architecture
Impreso Text Preparation, composed of the Importer and the Rebuilder is the main part of the data architecture defined in the framework of the impresso project to store and process a large-scale archive of historical newspapers. To understand the importer’s logic is worth touching upon the key points of the architecure into which it fits.
Canonical identifiers
Canonical identifiers are defined at the following levels:
newspaper issue
newspaper page
content item (e.g. article, advertisement, weather forecast, obituary, etc.)
Issue IDs
template:
{newspaper_id}-{date.year}-{date.month}-{date.day}-{edition}
examples:
GDL-1900-01-02-a
,luxzeit1858-1858-12-7-a
Page IDs
template:
{newspaper_id}-{date.year}-{date.month}-{date.day}-{edition}-p{page_number}
examples:
GDL-1900-01-02-a-p0004
,luxzeit1858-1858-12-7-a-p0002
Content item IDs
template:
{newspaper_id}-{date.year}-{date.month}-{date.day}-{edition}-i{item_number}
examples:
GDL-1900-01-02-a-i0048
,JDG-1901-01-01-a-i0031
Some things to note about these templates:
newspaper_id
is an arbitrary string, not containing white spaces, unambiguously identifying a given newspaperpage_number
is a four-digits integer (zeroes are used for filling)edition
: in case of newspapers published multiple times per day, a lowercase letter is used to indicate the edition number:a
for the first,b
for the second, etc.item_number
: is a four-digits integer (zeroes are used for filling); NB: content item IDs are expected to remain stable across any two runs of the importer given the same input data.
Data packaging
The JSON data produced by the Importer
and Rebuilder
are packaged into .bz2
archives for efficient storage. Each archive consists of one JSON-line file, where each line contains a JSON document. The JSON schemas are described here.
In Impresso we use an S3 solution for distributed storage to store newspaper data and accessed them at processing time.
Issue data
They are packaged by newspaper and by year (as they tend to be very small files). Each archive contains, one document per line, all issues of a newspaper that appeared in that year.
Examples: GDL-1900-issues.jsonl.bz2
contains all issues of the Gazette de Lausanne published in 1900.
Page data
They are packaged by newspaper issue. Each archive contains, one document per line, all JSON pages belonging to a given newspaper issue (edition).
Examples: GDL-1900-01-01-a-pages.jsonl.bz2
contains all issues of the Gazette de Lausanne (= GDL
) published on January 1, 1900.
Rebuilt data
They are packaged by newspaper and by year. Each archive contains, one document per line, all JSON content-items belonging to a given newspaper and year.
Examples: GDL-1900.jsonl.bz2
contains all rebuilt data of the Gazette de Lausanne (= GDL
) published in 1900.
Image data
They are expected to be delivered via a dedicated IIIF endpoint, and typically stored in an image server. To each newspaper page corresponds an image file.
Note
In case the canonical ID of a page and the internal ID of its image differ, the content provider is expected to be able to provide a mapping of the two identifier systems.
Processing
Core functions to perform large-scale import of OCR data.
Most of the functions in this module are meant to be used in conjuction with Dask, the library we are using to parallelize the ingestion process and run it on distributed computing infrastructures.
Note
The function import_issues()
is the most important in this module
as it keeps everything together, by calling all other functions.
- text_preparation.importers.core.cleanup(upload_success: bool, filepath: str) None
Remove a file if it has been successfully uploaded to S3.
Copied and adapted from impresso-pycommons.
- Parameters:
upload_success (bool) – Whether the upload was successful
filepath (str) – Path to the uploaded file
- text_preparation.importers.core.compress_issues(key: Tuple[str, int], issues: list[NewspaperIssue], output_dir: str | None = None, failed_log: str | None = None) Tuple[str, str, list[dict[str, int]]]
Compress issues of the same Journal-year and save them in a json file.
First check if the file exists, load it and then over-write/add the newly generated issues. The compressed
.bz2
output file is a JSON-line file, where each line corresponds to an individual and issue document in the canonical format. Finally, yearly statistics are computed on the issues and included in the returned values.- Parameters:
key (Tuple[str, int]) – Newspaper ID and year of input issues (e.g. (GDL, 1900)).
issues (list[NewspaperIssue]) – A list of NewspaperIssue instances.
output_dir (str | None, optional) – Output directory. Defaults to None.
failed_log (str | None, optional) – Path to the log file used when an instantiation was not successful. Defaults to None.
- Returns:
- Label following the template <NEWSPAPER>-<YEAR>, the path to
the the compressed .bz2 file, and the statistics computed on the issues.
- Return type:
Tuple[str, str]
- text_preparation.importers.core.compress_pages(key: str, json_files: list[str], output_dir: str, suffix: str = '', failed_log: str | None = None) Tuple[str, str]
Merge a set of JSON line files into a single compressed archive.
- Parameters:
key (str) – Canonical ID of the newspaper issue (e.g. GDL-1900-01-02-a).
json_files (list[str]) – Paths of input JSON line files.
output_dir (str) – Directory where to write the output file.
suffix (str, optional) – Suffix to add to the filename. Defaults to “”.
- Returns:
Sorting key [0] and path to serialized file [1].
- Return type:
Tuple[str, str]
- text_preparation.importers.core.dir2issue(issue: IssueDir, issue_class: Type[NewspaperIssue], failed_log: str | None = None, image_dirs: str | None = None, temp_dir: str | None = None) NewspaperIssue | None
Instantiate a NewspaperIssue object from an IssueDir.
Any instantiation leading to an exception is logged to a specific file only containing issues which could not be imported.
- Parameters:
issue (IssueDir) – IssueDir representing the issue to instantiate.
issue_class (Type[NewspaperIssue]) – Type of NewspaperIssue to use.
failed_log (str | None, optional) – Path to the log file used if the instantiation was not successful. Defaults to None.
image_dirs (str | None, optional) – Path to the directory containing the information on images, only for Olive importer. Defaults to None.
temp_dir (str | None, optional) – Temporary directory to unpack the issue’s zip archive into. Defaults to None.
- Returns:
- A new NewspaperIssue instance, or None if
the instantiation triggered an exception.
- Return type:
NewspaperIssue | None
- text_preparation.importers.core.dirs2issues(issues: list[IssueDir], issue_class: Type[NewspaperIssue], failed_log: str | None = None, image_dirs: str | None = None, temp_dir: str | None = None) list[NewspaperIssue]
Instantiate the NewspaperIssue objects to import to Impresso’s format.
Any NewspaperIssue for which the instantiation is unsuccessful will be logged, along with the triggered error.
- Parameters:
issues (list[IssueDir]) – List of issues to instantiate and import.
issue_class (Type[NewspaperIssue]) – Type of NewspaperIssue to use.
failed_log (str | None, optional) – Path to the log file used when an instantiation was not successful. Defaults to None.
image_dirs (str | None, optional) – Path to the directory containing the information on images, only for Olive importer. Defaults to None.
temp_dir (str | None, optional) – Temporary directory to unpack zip archives of issues into. Defaults to None.
- Returns:
List of NewspaperIssue instances to import.
- Return type:
list[NewspaperIssue]
- text_preparation.importers.core.import_issues(issues: list[IssueDir], out_dir: str, s3_bucket: str | None, issue_class: Type[NewspaperIssue], image_dirs: str | None, temp_dir: str | None, chunk_size: int | None, manifest: DataManifest, client: Client | None = None) None
Import a bunch of newspaper issues.
- Parameters:
issues (list[IssueDir]) – Issues to import.
out_dir (str) – Output directory for the json files.
s3_bucket (str | None) – Output s3 bucket for the json files.
issue_class (Type[NewspaperIssue]) – Newspaper issue class to import, (Child of
NewspaperIssue
).image_dirs (str | None) – Directory of images for Olive format, (can be multiple).
temp_dir (str | None) – Temporary directory for extracting archives (applies only to importers make use of
ZipArchive
).chunk_size (int | None) – Chunk size in years used to process issues.
- text_preparation.importers.core.issue2pages(issue: NewspaperIssue) list[NewspaperPage]
Flatten an issue into a list of their pages.
As an issue consists of several pages, this function is useful in order to process each page in a truly parallel fashion.
- Parameters:
issue (NewspaperIssue) – Issue to collect the pages of.
- Returns:
List of pages of the given issue.
- Return type:
list[NewspaperPage]
- text_preparation.importers.core.process_pages(pages: list[NewspaperPage], failed_log: str) list[NewspaperPage]
Given a list of pages, trigger the
.parse()
method of each page.- Parameters:
pages (list[NewspaperPage]) – Input newspaper pages.
failed_log (str) – File path of failed log.
- Returns:
A list of processed pages.
- Return type:
list[NewspaperPage]
- text_preparation.importers.core.remove_filelocks(output_dir: str) None
Remove all files ending with .lock in a directory.
- Parameters:
output_dir (str) – Path to directory containing file locks.
- text_preparation.importers.core.serialize_pages(pages: list[NewspaperPage], output_dir: str | None = None) list[Tuple[IssueDir, str]]
Serialize a list of pages to an output directory.
- Parameters:
pages (list[NewspaperPage]) – Input newspaper pages.
output_dir (str | None, optional) – Path to the output directory. Defaults to None.
- Returns:
- A list of tuples (IssueDir, path),
where the IssueDir object represents the issue to which pages belong, and path the path to the individual page JSON file.
- Return type:
list[Tuple[IssueDir, str]]
- text_preparation.importers.core.upload_issues(sort_key: str, filepath: str, bucket_name: str | None = None, failed_log: str | None = None) Tuple[bool, str]
Upload an issues JSON-line file to a given S3 bucket.
sort_key is expected to be the concatenation of newspaper ID and year.
- Parameters:
sort_key (str) – Key used to group articles (e.g. “GDL-1900”).
filepath (str) – Path of the file to upload to S3.
bucket_name (str | None, optional) – Name of S3 bucket where to upload the file. Defaults to None.
failed_log (str | None, optional) – Path to file where to log errors.
- Returns:
- Whether the upload was successful and the path to the
uploaded file.
- Return type:
Tuple[bool, str]
- text_preparation.importers.core.upload_pages(sort_key: str, filepath: str, bucket_name: str | None = None, failed_log: str | None = None) Tuple[bool, str]
Upload a page JSON file to a given S3 bucket.
- Parameters:
sort_key (str) – the key used to group articles (e.g. “GDL-1900-01-01-a”).
filepath (str) – Path of the file to upload to S3.
bucket_name (str | None, optional) – Name of S3 bucket where to upload the file. Defaults to None.
failed_log (str | None, optional) – Path to file where to log errors.
- Returns:
- Whether the upload was successful and the path to the
uploaded file.
- Return type:
Tuple[bool, str]
- text_preparation.importers.core.write_error(thing: NewspaperIssue | NewspaperPage | IssueDir | str, error: Exception | str, failed_log: str | None) None
Write the given error of a failed import to the failed_log file.
- Parameters:
thing (NewspaperIssue | NewspaperPage | IssueDir | str) – Object for which the error occurred, or corresponding canonical ID.
error (Exception) – Error that occurred and should be logged.
failed_log (str) – Path to log file for failed imports.