Preprocessing

Motivation

Unfortunately, not all data arrives ready for ingestion. In several cases, some preprocessing steps are necessary to prepare the data in order to reduce the complexity of the importers.

This preprocessing can include any of the following steps:

  • Identification of the exact contents of the data which was dumped, and of the OCR formats present.

  • Reorganization of the files to follow our prefered directory structure: alias > year > month > day > edition > issue files.

  • Copying of image files into the IIIF server location, often also requiring reorganization, renaming and conversion of the files.

  • Extraction of the OCR from PDFs in the case the OCR is embedded in PDFs.

Other preprocessing steps might also be necessary and depend on each provider.

Since these steps are so case-specific, we currently handle each situation with individual scripts, which are tailored to each provider.

Existing Preprocessing Scripts

British Library

  • Reorganizes and copies exiting data into the desired directory structure, separating images and OCR files for each issue.

  • Logs all copies made and optionally skips issues if the copy was already done, after verification that all the desired files exist in the destination

Script copying BL’s original OCR data and Images into Impresso’s internal filestructure, according to the devised Alias-to-NLP mapping.

Example usage:

  • To copy OCR files:

` $ python reorganize_original_data.bl.py --log_file="ocr_logfile_2.log" --chunk_idx=2 `

  • To copy image files:

` $ python reorganize_original_data.bl.py --log_file="img_logfile_2.log" --file_type_ext=".jp2" --chunk_idx=2 --dest_base_dir="/mnt/impresso_images_BL" `

text_preparation.importer_scripts.preprocessing.bl_reorganize_original_data.check_if_to_be_copied(source_dir_files: str, dest_issue_dir: str, possible_date_formats, file_ext='.xml') tuple[bool, list[str]]

Determines whether files need to be copied to the destination issue directory.

This function checks if a copy operation should be performed by verifying: - Whether the destination issue directory already contains the required files. - Whether the source dir contains files matching the expected date and file extension.

Parameters:
  • source_dir_files (str) – A list of file names available in the source directory.

  • dest_issue_dir (str) – The destination issue directory where files should be copied.

  • possible_date_formats (list[str]) – A list of possible date formats that should be

  • names. (present in the file)

  • file_ext (str, optional) – The file extension to check for (default is “.xml”).

Returns:

A tuple containing:
  • bool: True if the copy operation needs to be performed, False otherwise.

  • list[str]: A list of source files that match the criteria for copying.

Return type:

tuple[bool, list[str]]

Raises:

AssertionError – If the provided file extension is not in POSSIBLE_EXTENTIONS.

Example

>>> check_if_to_be_copied(["18200317.xml", "18200317.jp2"], "/mnt/data/issues/18200317", ["18200317"])
(True, ["18200317.xml"])
text_preparation.importer_scripts.preprocessing.bl_reorganize_original_data.copy_files_for_NLP(nlp: str, alias: str, source_dir: str, dest_dir: str, file_ext: str, date_fmt_chars: list[str] = ['-', '', '_']) tuple[list[str], list[str]]

Copies files for a given NLP (BL title ID) into a structured directory.

This function processes all files in the specified source directory, extracts date information from issue directories, and organizes them into a structured destination directory following the format dest_dir/alias/nlp/YYYY/MM/DD.

Parameters:
  • nlp (str) – The name of the NLP process.

  • alias (str) – The alias under which the NLP process is categorized.

  • source_dir (str) – The root directory containing the source files.

  • dest_dir (str) – The destination root directory where files should be copied.

  • file_ext (str) – The file extension to be copied (e.g., “.xml”).

  • date_fmt_chars (list[str], optional) – A list of characters to format date strings when matching files (default is [“-”, “”, “_”]).

Returns:

A tuple containing:
  • list[str]: A list of issue directories with invalid structures or date errors.

  • list[str]: A list of files that failed to copy.

Return type:

tuple[list[str], list[str]]

Raises:

IOError – If file copying fails.

Example

>>> copy_files_for_NLP("NLP1", "aliasX", "/mnt/source", "/mnt/dest", ".xml")
([], [])  # No issues or failed copies.
text_preparation.importer_scripts.preprocessing.bl_reorganize_original_data.extract_date(root_path: str) tuple[bool, str, str, str]

Extracts the year, month, and day from a given root path.

This function assumes the root path follows a specific format and attempts to extract date information (YYYY, MM, DD). It also handles cases where the path contains unexpected .backup components and logs relevant errors.

Parameters:

root_path (str) – The file path from which to extract the date.

Returns:

A tuple containing:
  • bool: True if the date is valid, False otherwise.

  • str: Extracted year (YYYY) as a string.

  • str: Extracted month (MM) as a string.

  • str: Extracted day (DD) as a string.

Return type:

tuple[bool, str, str, str]

Raises:

ValueError – If the extracted date is not a valid calendar date.

Example

>>> extract_date("/mnt/project_impresso/original/BL_old/0002634/1820/0317/")
(True, "1820", "03", "17")
text_preparation.importer_scripts.preprocessing.bl_reorganize_original_data.main(log_file: str, source_base_dir: str = '/mnt/project_impresso/original/BL_old', dest_base_dir: str = '/mnt/impresso_ocr_BL', sample_data_dir: str = '/home/piconti/impresso-text-acquisition/text_preparation/data/sample_data/BL', title_alias_mapping_file: str = 'BL_title_alias_mapping.csv', file_type_ext: str = '.xml', chunk_size: int = 100, chunk_idx: int = 0, verbose: bool = False) None

Main function to process and copy BL original data into a impresso structured directory.

This function reads a CSV file containing NLP-to-alias mappings, processes a chunk of NLPs by copying files from the source to the destination directory, and logs any issues encountered. It also tracks problem directories and failed file copies.

By default, performs the copy of OCR files, but can be paramtrized to copy images.

Parameters:
  • log_file (str) – Path to the log file.

  • source_base_dir (str, optional) – Root directory of the source NLP files. Defaults to “/mnt/project_impresso/original/BL_old”.

  • dest_base_dir (str, optional) – Root directory where processed files will be copied. Defaults to “/mnt/impresso_ocr_BL”.

  • sample_data_dir (str, optional) – Directory containing metadata files. Defaults to “/home/piconti/impresso-text-acquisition/text_preparation/data/sample_data/BL”.

  • title_alias_mapping_file (str, optional) – CSV file mapping aliases to NLPs. Defaults to “BL_title_alias_mapping.csv”.

  • file_type_ext (str, optional) – File extension to be processed (e.g., “.xml”). Defaults to “.xml”.

  • chunk_size (int, optional) – Number of NLP directories to process in each chunk. Defaults to 100.

  • chunk_idx (int, optional) – Index of the chunk to process. Defaults to 0.

  • verbose (bool, optional) – If True, sets logging level to DEBUG; otherwise, INFO. Defaults to False.

Raises:
  • AssertionError – If the provided file extension is not in POSSIBLE_EXTENTIONS.

  • Exception – If an error occurs while processing an NLP directory.

SWISSINFO

  • Extract the OCR from PDF files, creating JSON files and convert the images to JP2 format.

  • Reorganize and rename the files accordingly.

This script processes PDF files by converting them to JPEG2000 images (JP2) and extracting OCR data.

The main functionalities include:

  • Rescaling the bounding box coordinates.

  • Processing documents to define their canonical path and id.

  • Converting PDF images to JP2 format.

  • Extracting OCR text and saving it as a JSON file.

Usage:

python script.py –log_file log.txt –input_base_dir /path/to/pdf –out_base_dir /path/to/output

text_preparation.importer_scripts.preprocessing.swissinfo_extract_ocr_from_pdfs.get_canonical_path(full_img_path: str) str

Generate a canonical path from a radio bulletin image file path.

Extracts metadata from the file name (program, date, edition, language) and constructs a standardized (“canonical”) directory path in the format: SOC_<program>/<year>/<month>/<day>/<edition>. Also returns the language extracted from the filename in lowercase.

The filename is expected to follow this structure: <prefix>_<prefix>_<program>_<YYYYMMDD>_<LANG>[_<EDITION>].<ext>

Parameters:

full_img_path (str) – The full file path to the image.

Returns:

A tuple containing:
  • The canonical path as a string.

  • The language code as a lowercase string.

Return type:

tuple[str, str]

Raises:

ValueError – If the date string cannot be parsed or required elements are missing.

text_preparation.importer_scripts.preprocessing.swissinfo_extract_ocr_from_pdfs.pdf_to_jp2_and_ocr_json(img_path: str, out_base_dir: str) tuple[str, bool]

Convert a PDF to JPEG 2000 images and extracts OCR data into a JSON file.

This function processes a given PDF file by: 1. Determining its canonical path and ID. 2. Converting its pages to JP2 (JPEG 2000) format. 3. Extracting OCR text and bounding box data from each page. 4. Saving the extracted OCR data into a JSON file. 5. Returning the canonical ID and the success status of the operation.

If the OCR JSON file already exists, the function skips processing and returns early.

Parameters:
  • img_path (str) – The file path of the input PDF document.

  • out_base_dir (str) – The base directory where the processed images and JSON should be saved.

  • all_filenames (list[str]) – A list of all filenames in the dataset to determine the edition letter.

Returns:

A tuple containing:
  • The canonical issue ID of the processed document.

  • A boolean indicating whether processing was successful (True) or if an error occurred (False).

Return type:

tuple[str, bool]

Raises:
  • OSError – If there is an issue with file I/O operations (e.g., saving JP2 images or writing the JSON).

  • Exception – If any other unexpected error occurs.

text_preparation.importer_scripts.preprocessing.swissinfo_extract_ocr_from_pdfs.process_blocks_of_page(page_num: int, page_text_dict: dict, page_image_size: tuple[float]) dict

Process OCR blocks from a page by cleaning, rescaling, and organizing them.

Cleans and prepares OCR block data for a specific page by: - Removing unnecessary keys (like images and masks) from each block. - Rescaling all bounding box coordinates to match the provided image size. - Separating blocks that contain lines from those that do not.

Parameters:
  • page_num (int) – The number of the page being processed (used for logging and output).

  • page_text_dict (dict) – A dictionary representing the OCR data for the page. Must contain “width”, “height”, and a list of “blocks”.

  • page_image_size (tuple[float]) – The target image size (width, height) for which bounding boxes should be rescaled.

Returns:

A dictionary containing the processed information for the page, with keys:
  • ”page_num”: The page number.

  • ”ocr_page_size”: The original OCR coordinate space (width, height).

  • ”jp2_img_size”: The target image size used for rescaling.

  • ”blocks_with_lines”: List of blocks that contain text lines.

  • ”blocks_without_lines”: List of blocks that do not contain text lines.

Return type:

dict

text_preparation.importer_scripts.preprocessing.swissinfo_extract_ocr_from_pdfs.remove_key_from_block(block: dict, key: str, page_num: int, block_idx: int) dict

Remove a specified key from a block dictionary if it exists, and logs the action.

Parameters:
  • block (dict) – The block dictionary from which the key should be removed.

  • key (str) – The key to remove from the block.

  • page_num (int) – The page number associated with the block (for logging purposes).

  • block_idx (int) – The index of the block on the page (for logging purposes).

Returns:

The updated block dictionary with the key removed if it was present.

Return type:

dict

text_preparation.importer_scripts.preprocessing.swissinfo_extract_ocr_from_pdfs.rescale_block_coords(block: dict, curr_img_size: tuple[float], dest_img_size: tuple[float]) dict

Rescale bounding box coordinates in a block and its nested lines and spans.

Parameters:
  • block (dict) – A dictionary representing a layout block that may contain a “bbox”, and optionally a list of “lines”, each of which may contain “spans”.

  • curr_img_size (tuple[float]) – The current size of the image as (width, height).

  • dest_img_size (tuple[float]) – The target size of the image as (width, height).

Returns:

The updated block dictionary with rescaled bounding boxes added under the key “rescaled_bbox” at each relevant level (block, lines, spans).

Return type:

dict

text_preparation.importer_scripts.preprocessing.swissinfo_extract_ocr_from_pdfs.save_as_jp2(pil_imgs: <module 'PIL.Image' from '/scratch/piconti/.conda/envs/text_prep_build/lib/python3.13/site-packages/PIL/Image.py'>, canonical_path: str, out_base_dir: str) str

Save a list of PIL images as JPEG 2000 (.jp2) files in a structured directory.

The function constructs output file paths using a canonical issue ID, derived from the given canonical_path. Each image is saved with a sequential page number in the format: “{canonical_issue_id}-pXXXX.jp2”

Parameters:
  • pil_imgs (list[Image.Image]) – A list of PIL images to be saved.

  • canonical_path (str) – The structured canonical path for the images.

  • out_base_dir (str) – The base output directory where the images should be saved.

Returns:

List of file paths where the images were saved and whether

all images were successfully saved (True) or if an error occurred (False).

Return type:

tuple[list[str], bool]

Raises:

OSError – If there is an issue creating directories or saving images.