Rebuilders
Once the canonical data has been generated, it is rebuilt into two variants of the rebuilt format:
- Solr Rebuilder
: Returns the base format for all subsequent text proessing steps, keeping only the text data and important information (line breaks, regions etc).
- Passim Rebuilder
: Returns the base format for the text-reuse processing, which is done with a software called Passim.
This format is very similar to the solr rebuilt on all counts, but simply has slightly different properties and property names.
Both of these are generated with the text_preparation.rebuilders.rebuilder module, and one can select which format to produce with the format parameter.
Rebuild functions
A set of functions to transform JSON files in impresso’s canonical format into a number of JSON-based formats for different purposes.
Functions and CLI to rebuild text from impresso’s canonical format. For EPFL members, this script can be scaled by running it using Runai, as documented on https://github.com/impresso/impresso-infrastructure/blob/main/howtos/runai.md. TODO update the runai functionalities.
- Usage:
rebuilder.py rebuild_articles –input-bucket=<b> –log-file=<f> –output-dir=<od> –filter-config=<fc> [–format=<fo> –scheduler=<sch> –output-bucket=<ob> –verbose –clear –languages=<lgs> –nworkers=<nw> –git-repo=<gr> –temp-dir=<tp> –prev-manifest=<pm>]
Options:
- --input-bucket=<b>
S3 bucket where canonical JSON data will be read from
- --output-bucket=<ob>
Rebuilt data will be uploaded to the specified s3 bucket (otherwise no upload)
- --log-file=<f>
Path to log file
- --scheduler=<sch>
Tell dask to use an existing scheduler (otherwise it’ll create one)
- --filter-config=<fc>
A JSON configuration file specifying which newspaper issues will be rebuilt
- --verbose
Set logging level to DEBUG (by default is INFO)
- --clear
Remove output directory before and after rebuilding
- --format=<fo>
Rebuilt format to use (can be “solr” or “passim”)
- --languages=<lgs>
Languages to filter the articles to rebuild on.
- --nworkers=<nw>
number of workers for (local) Dask client.
- --git-repo=<gr>
Local path to the “impresso-text-acquisition” git directory (including it).
- --temp-dir=<tp>
Temporary directory in which to clone the impresso-data-release git repository.
- --prev-manifest=<pm>
Optional S3 path to the previous manifest to use for the manifest generation.
- text_preparation.rebuilders.rebuilder.cleanup(upload_success: bool, filepath: str) None
Remove a file from local fs if it has been successfully uploaded to S3.
- Parameters:
upload_success (bool) – Whether the upload was successful
filepath (str) – Oath to the uploaded file
- text_preparation.rebuilders.rebuilder.compress(key: str, json_files: list, output_dir: str) tuple[str, str]
Merge a set of JSON line files into a single compressed archive.
- Parameters:
key (str) – alias-year “key” of a given issue (e.g. GDL-1900).
json_files (list) – Input JSON line files.
output_dir (str) – Directory where to write the output file.
- Returns:
sorting key [0] and path to serialized file [1].
- Return type:
tuple[str, str]
- text_preparation.rebuilders.rebuilder.filter_and_process_cis(issues_bag, input_bucket: str, issue_medium: str, _format: str)
Process the issues into rebuilt CIs
- Parameters:
issues_bag (Dask Bag) – Dask Bag containing all the issues to filter and rebuild.
input_bucket (str) – Input bucket where to find the supports (pages or audios).
issue_medium (str) – Source medium of the given issue.
_format (str) – Target rebuilt format (should be one of “solr” and “passim”).
- Raises:
NotImplementedError – The format is not valid
- Returns:
Resulting rebuilt CIs.
- Return type:
Dask Bag
- text_preparation.rebuilders.rebuilder.main() None
- text_preparation.rebuilders.rebuilder.rebuild_issues(issues: list[IssueDir], input_bucket: str, output_dir: str, dask_client: Client, _format: str = 'solr', filter_language: list[str] = None) tuple[str, list, list[dict[str, int | str]]]
Rebuild a set of newspaper issues into a given format.
- Parameters:
issues (list[IssueDir]) – Issues to rebuild.
input_bucket (str) – Name of input s3 bucket.
output_dir (str) – Local directory where to store the rebuilt files.
dask_client (Client) – Dask client object.
_format (str, optional) – Format in which to rebuild the CIs. Defaults to “solr”.
filter_language (list[str], optional) – List of languages to filter. Defaults to None.
- Returns:
- alias-year key for the issues, resulting
files dumped and startistics computed on them for the manifest.
- Return type:
tuple[str, list, list[dict[str, int | str]]]
- text_preparation.rebuilders.rebuilder.upload(sort_key: str, filepath: str, bucket_name: str | None = None) tuple[bool, str]
Upload a file to a given S3 bucket.
- Parameters:
sort_key (str) – alias-year key used to group CIs (e.g. “GDL-1900”).
filepath (str) – Path of the file to upload to S3.
bucket_name (str | None, optional) – Name of S3 bucket where to upload the file. Defaults to None.
- Returns:
- a tuple with [0] whether the upload was successful (boolean) and
[1] the path of the uploaded file (string)
- Return type:
tuple[bool, str]
Helpers
Helper functions for the text rebuilder.py script.
- text_preparation.rebuilders.helpers.ci_has_problem(ci: dict[str, Any]) bool
Helper function to keep CIs with problems.
- Parameters:
ci (dict[str, Any]) – Input CI
- Returns:
Whether a problem was detected in the CI.
- Return type:
bool
- text_preparation.rebuilders.helpers.ci_without_problem(ci: dict[str, Any]) bool
Helper function to keep CIs without problems, and log others.
- Parameters:
ci (dict[str, Any]) – Input CI
- Returns:
Whether the CI ws problem-free.
- Return type:
bool
- text_preparation.rebuilders.helpers.get_iiif_and_coords(ci: dict[str, Any]) tuple[str | None, str | None]
Fetch the iiif link and image coordinates from CI metadata.
Adapts to the various cases currently present in the canonical data, see https://github.com/impresso/impresso-text-acquisition/issues/117.
- Parameters:
ci (dict[str, Any]) – Content item to retrieve the information from.
- Returns:
- IIIF link and coordinates as string or
None if part of the information is missing from the content item
- Return type:
tuple[str | None, str | None]
- text_preparation.rebuilders.helpers.pages_to_article(article: dict[str, Any], pages: list[dict[str, Any]]) dict[str, Any]
Return all text regions belonging to a given article.”
- Parameters:
article (dict[str, Any]) – Article/CI for which to fetch the regions.
pages (list[dict[str, Any]]) – Pages from which to fetch the regions.
- Returns:
Article completed with the regions extracted from the page.
- Return type:
dict[str, Any]
- text_preparation.rebuilders.helpers.read_issue(issue: IssueDir, bucket_name: str, s3_client=None) tuple[IssueDir, dict[str, Any]]
Read the data from S3 for a given canonical issue.
- Parameters:
issue (IssueDir) – Input issue to fetch form S3
bucket_name (str) – S3 bucket’s name
s3_client (boto3.resources.factory.s3.ServiceResource, optional) – open connection to S3 storage. Defaults to None.
- Returns:
Input issue and its JSON canonical representation.
- Return type:
tuple[IssueDir, dict[str, Any]]
- text_preparation.rebuilders.helpers.read_issue_supports(issue: IssueDir, issue_json: dict[str, Any], is_audio: bool, bucket: str | None = None) tuple[IssueDir, dict[str, Any]]
Read all pages/audio records of a given issue from S3 in parallel, and add them to it.
The found and read files will then be added to the issue’s canonical json representation in the properties rr or pp based on is_audio.
- Parameters:
issue (IssueDir) – IssueDir object for which to read the pages/audios.
issue_json (dict[str, Any]) – issue_data dict of the given issue.
is_audio (bool) – Whether the issue corresponds to audio data.
bucket (str | None, optional) – Bucket where to go fetch the pages/audios. Defaults to None.
- Returns:
The given issue, with the pages/audios data added.
- Return type:
tuple[IssueDir, dict[str, Any]]
- text_preparation.rebuilders.helpers.read_page(page_key: str, bucket_name: str, s3_client) dict[str, Any] | None
Read the data from S3 for a given canonical page
- Parameters:
page_key (str) – S3 key to the page
bucket_name (str) – S3 bucket’s name
s3_client (boto3.resources.factory.s3.ServiceResource) – open connection to S3 storage.
- Returns:
The page’s JSON representation or None if the page could not be read.
- Return type:
dict[str, Any] | None
- text_preparation.rebuilders.helpers.rebuild_for_passim(content_item: dict[str, Any]) dict[str, Any]
Rebuilds the text of an article content-item to be used with passim.
TODO Check that this works with passim!
- Parameters:
content_item (dict[str, Any]) – The content-item to rebuild using its metadata.
- Returns:
The rebuilt content-item built for passim.
- Return type:
dict[str, Any]
- text_preparation.rebuilders.helpers.rebuild_for_solr(content_item: dict[str, Any]) dict[str, Any]
Rebuilds the text of an article content-item given its metadata as input.
Note
This rebuild function is thought especially for ingesting the newspaper data into our Solr index.
- Parameters:
content_item (dict[str, Any]) – The content-item to rebuilt using its metadata.
- Returns:
The rebuilt content-item following the Impresso JSON Schema.
- Return type:
dict[str, Any]
- text_preparation.rebuilders.helpers.reconstruct_iiif_link(content_item: dict[str, Any]) str
Construct the iiif link to the CI’s image based on its metadata.
A iiif image API link and the image coordinates are to be fetched from the content item first. Different importers (and endpoints) have different formats, needing different processing. In addition, some inconsistencies exist in the canonical data. This function adapts to these variations, more details in issue: https://github.com/impresso/impresso-text-acquisition/issues/117
- Parameters:
content_item (dict[str, Any]) – Content item in canonical format.
- Returns:
- iiif link to the image area of the content item if present in the
CI metadata, else None.
- Return type:
str
- text_preparation.rebuilders.helpers.rejoin_cis(issue: IssueDir, issue_json: dict[str, Any]) list[dict[str, Any]]
Rejoin the CIs of a given issue using its pyhsical supports (pages or audio records).
- Parameters:
issue (IssueDir) – Issue directory of issue to be processed.
issue_json (dict[str, Any]) – Issue canonical json wôf which to rejoin CIs.
- Returns:
Processed content-items for the issue.
- Return type:
list[dict[str, Any]]
- text_preparation.rebuilders.helpers.text_apply_breaks(fulltext, breaks)
Apply breaks to the text returned by rebuild_for_solr.
The purpose of this function is to debug (visually) the rebuild_for_solr function. It applies to fulltext the characte offsets contained in breaks (e.g. line breaks, paragraph breaks, etc.).
- Parameters:
fulltext (str) – input text
breaks (list of int) – a list of character offsets
- Returns:
a list of text chunks
- Return type:
list
Config file example
(from file: text_preparation.config.rebuilt_cofig.cluster.json):
[{"GDL": [1948, 1999]}, {"GDL": [1900, 1948]}, {"GDL": [1850, 1900]}, {"schmiede": [1916, 1920]}]
Several newspaper titles can be added to the same configuration file. If there is a newspaper title that contains a very large number of data (many issues and/or many years), it is advised to separate this processing into parts as is shown above with GDL to reduce the memory and computing needs, as well as ensure minimal outputs are lost if the process stops early due to an error.
Running using Runai
Members of Impresso and EPFL can use the computing platform Runai to produce the rebuilt data. Indications to run data in this way are available [here](https://github.com/impresso/impresso-infrastructure).