Rebuilders

Once the canonical data has been generated, it is rebuilt into two variants of the rebuilt format: - Solr Rebuilder: Returns the base format for all subsequent text proessing steps, keeping only the text data and important information (line breaks, regions etc). - Passim Rebuilder: Returns the base format for the text-reuse processing, which is done with a software called Passim. This format is very similar to the solr rebuilt on all counts, but simply has slightly different properties and property names.

Both of these are generated with the text_preparation.rebuilders.rebuilder module, and one can select which format to produce with the format parameter.

Rebuild functions

A set of functions to transform JSON files in impresso’s canonical format into a number of JSON-based formats for different purposes.

Functions and CLI to rebuild text from impresso’s canonical format. For EPFL members, this script can be scaled by running it using Runai, as documented on https://github.com/impresso/impresso-infrastructure/blob/main/howtos/runai.md.

Usage:: rebuilder.py rebuild_articles –input-bucket=<b> –log-file=<f> –output-dir=<od> –filter-config=<fc> [–format=<fo> –scheduler=<sch> –output-bucket=<ob> –verbose –clear –languages=<lgs> –nworkers=<nw> –git-repo=<gr> –temp-dir=<tp> –prev-manifest=<pm>]

Options:

--input-bucket=<b>: S3 bucket where canonical JSON data will be read from
--output-bucket=<ob>: Rebuilt data will be uploaded to the specified s3 bucket (otherwise no upload)
--log-file=<f>: Path to log file
--scheduler=<sch>: Tell dask to use an existing scheduler (otherwise it’ll create one)
--filter-config=<fc>: A JSON configuration file specifying which newspaper issues will be rebuilt
--verbose: Set logging level to DEBUG (by default is INFO)
--clear: Remove output directory before and after rebuilding
--format=<fo>: Rebuilt format to use (can be “solr” or “passim”)
--languages=<lgs>: Languages to filter the articles to rebuild on.
--nworkers=<nw>: number of workers for (local) Dask client.
--git-repo=<gr>: Local path to the “impresso-text-acquisition” git directory (including it).
--temp-dir=<tp>: Temporary directory in which to clone the impresso-data-release git repository.
--prev-manifest=<pm>: Optional S3 path to the previous manifest to use for the manifest generation.

text_preparation.rebuilders.rebuilder.cleanup(upload_success, filepath): Removes a file if it has been successfully uploaded to S3. :param upload_success: whether the upload was successful :type upload_success: bool :param filepath: path to the uploaded file :type filepath: str

text_preparation.rebuilders.rebuilder.compress(key, json_files, output_dir)

Merges a set of JSON line files into a single compressed archive.

Parameters:

key (str) – signature of the newspaper issue (e.g. GDL-1900)
json_files (list) – input JSON line files
output_dir – directory where to write the output file

Returns:

a tuple with: sorting key [0] and path to serialized file [1].

Rytpe:

tuple

Note

sort_key is expected to be the concatenation of newspaper ID and year (e.g. GDL-1900).

text_preparation.rebuilders.rebuilder.init_logging(level, file)

Initialises the root logger.

Parameters:

level (int) – desired level of logging (default: logging.INFO)
file (str)

Returns:

the initialised logger

Return type:

logging.RootLogger

Note

It’s basically a duplicate of impresso_commons.utils.init_logger but I could not get it to work properly, so keeping this duplicate.

text_preparation.rebuilders.rebuilder.main() → None

text_preparation.rebuilders.rebuilder.rebuild_for_passim(content_item: dict[str, Any]) → dict[str, Any]

Rebuilds the text of an article content-item to be used with passim.

Parameters:: content_item (dict[str, Any]) – The content-item to rebuilt using its metadata.
Returns:: The rebuilt content-item built for passim.
Return type:: dict[str, Any]

text_preparation.rebuilders.rebuilder.rebuild_for_solr(content_item: dict[str, Any]) → dict[str, Any]

Rebuilds the text of an article content-item given its metadata as input.

Note

This rebuild function is thought especially for ingesting the newspaper data into our Solr index.

Parameters:: content_item (dict[str, Any]) – The content-item to rebuilt using its metadata.
Returns:: The rebuilt content-item following the Impresso JSON Schema.
Return type:: dict[str, Any]

text_preparation.rebuilders.rebuilder.rebuild_issues(issues, input_bucket, output_dir, dask_client, _format='solr', filter_language=None)

Rebuild a set of newspaper issues into a given format.

Parameters:

issues (list of IssueDir objects) – issues to rebuild
input_bucket (str) – name of input s3 bucket
outp_dir (str) – local directory where to store the rebuilt files

Returns:

a list of tuples (see return type of upload)

Return type:

list of tuples

text_preparation.rebuilders.rebuilder.rebuild_text(page: list[dict], language: str | None, string: str | None = None) → tuple[str, dict[list], dict[list]]

Rebuild the text of an article for Solr ingestion.

If string is not None, then the rebuilt text is appended to it.

Parameters:

page (list[dict]) – Newspaper page conforming to the impresso JSON pages schema.
language (str | None) – Language of the article being rebuilt
string (str | None, optional) – Rebuilt text of previous page. Defaults to None.

Returns:

[0] Article fulltext, [1] offsets and: [2] coordinates of token regions.

Return type:

tuple[str, dict[list], dict[list]]

text_preparation.rebuilders.rebuilder.rebuild_text_passim(page: list[dict], language: str | None, string: str | None = None) → tuple[str, list[dict]]

The text rebuilding function from pages for passim.

If string is not None, then the rebuilt text is appended to it.

Parameters:

page (list[dict]) – Newspaper page conforming to the impresso JSON pages schema.
language (str | None) – Language of the article being rebuilt
string (str | None, optional) – Rebuilt text of previous page. Defaults to None.

Returns:

[0] article fulltext, [1] coordinates of token regions.

Return type:

tuple[str, list[dict]]

text_preparation.rebuilders.rebuilder.upload(sort_key, filepath, bucket_name=None)

Uploads a file to a given S3 bucket.

Parameters:

sort_key (str) – the key used to group articles (e.g. “GDL-1900”)
filepath (str) – path of the file to upload to S3
bucket_name (str) – name of S3 bucket where to upload the file

Returns:

a tuple with [0] whether the upload was successful (boolean) and [1] the path of the uploaded file (string)

Note

sort_key is expected to be the concatenation of newspaper ID and year (e.g. GDL-1900).

Helpers

Helper functions for the text rebuilder.py script.

text_preparation.rebuilders.helpers.get_iiif_and_coords(ci: dict[str, Any]) → tuple[str | None, str | None]

Fetch the iiif link and image coordinates from CI metadata.

Adapts to the various cases currently present in the canonical data, see https://github.com/impresso/impresso-text-acquisition/issues/117.

Parameters:

ci (dict[str, Any]) – Content item to retrieve the information from.

Returns:

IIIF link and coordinates as string or: None if part of the information is missing from the content item

Return type:

tuple[str | None, str | None]

text_preparation.rebuilders.helpers.insert_whitespace(token: str, next_t: str | None, prev_t: str | None, lang: str | None) → bool

Determine whether a whitespace should be inserted after a token.

Parameters:

token (str) – Current token.
next_t (str) – Following token.
prev_t (str) – Previous token.
lang (str) – Language of text.

Returns:

Whether a whitespace should be inserted after the token.

Return type:

bool

text_preparation.rebuilders.helpers.pages_to_article(article, pages): Return all text regions belonging to a given article.

text_preparation.rebuilders.helpers.read_issue(issue, bucket_name, s3_client=None)

Read the data from S3 for a given newspaper issue.

Parameters:

issue (IssueDir) – input issue
bucket_name (str) – bucket’s name
s3_client (boto3.resources.factory.s3.ServiceResource) – open connection to S3 storage

Returns:

a JSON representation of the issue object

text_preparation.rebuilders.helpers.read_issue_pages(issue, issue_json, bucket=None): Read all pages of a given issue from S3 in parallel.

text_preparation.rebuilders.helpers.read_page(page_key, bucket_name, s3_client): Read the data from S3 for a given newspaper pages.

text_preparation.rebuilders.helpers.reconstruct_iiif_link(content_item: dict[str, Any]) → str

Construct the iiif link to the CI’s image based on its metadata.

A iiif image API link and the image coordinates are to be fetched from the content item first. Different importers (and endpoints) have different formats, needing different processing. In addition, some inconsistencies exist in the canonical data. This function adapts to these variations, more details in issue: https://github.com/impresso/impresso-text-acquisition/issues/117

Parameters:

content_item (dict[str, Any]) – Content item in canonical format.

Returns:

iiif link to the image area of the content item if present in the: CI metadata, else None.

Return type:

str

text_preparation.rebuilders.helpers.rejoin_articles(issue, issue_json)

text_preparation.rebuilders.helpers.text_apply_breaks(fulltext, breaks)

Apply breaks to the text returned by rebuild_for_solr.

The purpose of this function is to debug (visually) the rebuild_for_solr function. It applies to fulltext the characte offsets contained in breaks (e.g. line breaks, paragraph breaks, etc.).

Parameters:

fulltext (str) – input text
breaks (list of int) – a list of character offsets

Returns:

a list of text chunks

Return type:

list

Config file example

(from file: text_preparation.config.rebuilt_cofig.cluster.json):

[{"GDL": [1948, 1999]}, {"GDL": [1900, 1948]}, {"GDL": [1850, 1900]}, {"schmiede": [1916, 1920]}]

Several newspaper titles can be added to the same configuration file. If there is a newspaper title that contains a very large number of data (many issues and/or many years), it is advised to separate this processing into parts as is shown above with GDL to reduce the memory and computing needs, as well as ensure minimal outputs are lost if the process stops early due to an error.

Running using Runai

Members of Impresso and EPFL can use the computing platform Runai to produce the rebuilt data. Indications to run data in this way are available [here](https://github.com/impresso/impresso-infrastructure).