Utilities

There are also some utilities shares between both modules, and where more general util functions should be.

This module contains generic helper functions for the text-importer module.

text_preparation.utils.add_property(object_dict: dict[str, Any], prop_name: str, prop_function: Callable[[str], str], function_input: str) → dict[str, Any]

Add a property and value to a given object dict computed with a given function.

Parameters:

object_dict (dict[str, Any]) – Object to which the property is added.
prop_name (str) – Name of the property to add.
prop_function (Callable[[str], str]) – Function computing the property value.
function_input (str) – Input to prop_function for this object.

Returns:

Updated object.

Return type:

dict[str, Any]

text_preparation.utils.coords_to_xy(coords: list[int | Any], as_int: bool = False) → list[int | Any]

Convert coordinates from xywh format to x1y1x2y2 format.

Parameters:

coords (list) – Coords in xywh format to convert.
as_int (bool, optional) – Whether to cast elements to int before conversion. Defaults to False.

Returns:

Resulting converted coordinates, now in x1y1x2y2 format.

Return type:

list[int | Any]

text_preparation.utils.coords_to_xywh(coords: list[int | Any], as_int: bool = True) → list[int | Any]

Convert coordinates from x1y1x2y2 format to xywh format.

Parameters:

coords (list) – Coords in x1y1x2y2 format to convert.
as_int (bool, optional) – Whether to cast elements to int before conversion. Defaults to False.

Returns:

Resulting converted coordinates, now in xywh format.

Return type:

list[int | Any]

text_preparation.utils.draw_box_on_img(base_img_path: str, coords_xy: list, img: <module 'PIL.Image' from '/scratch/piconti/.conda/envs/text_prep_build/lib/python3.13/site-packages/PIL/Image.py'> = None, width: int = 10) → <module 'PIL.Image' from '/scratch/piconti/.conda/envs/text_prep_build/lib/python3.13/site-packages/PIL/Image.py'>

Draw a bounding box on an image given coordinates in x1y1x2y2 format.

The image can either be provided through its path, or as a PIL.Image object (specifically if other bboxes have already been drawn on it.)

Parameters:

base_img_path (str) – Path to the image to open as a PIL Image.
coords_xy (list) – Coordinates of the bbox to draw.
img (Image, optional) – PIL image if already loaded. Defaults to None.
width (int, optional) – Stroke width for the bbox. Defaults to 10.

Returns:

Resulting PIL Image with the bbox drawn on it.

Return type:

Image

text_preparation.utils.empty_folder(dir_path: str) → None

Empty a directoy given its path if it exists.

Parameters:: dir_path (str) – Path to the directory to empty.

text_preparation.utils.get_issue_schema(schema_folder: str = 'impresso-schemas/json/canonical/issue.schema.json') → Namespace

Generate a list of python classes starting from a JSON schema.

Parameters:: schema_folder (str, optional) – Path to the schema folder. Defaults to “impresso-schemas/json/canonical/issue.schema.json”.
Returns:: Issue schema based on canonical format.
Return type:: pjs.util.Namespace

text_preparation.utils.get_page_schema(schema_folder: str = 'impresso-schemas/json/canonical/page.schema.json') → Namespace

Generate a list of python classes starting from a JSON schema.

Parameters:: schema_folder (str, optional) – Path to the schema folder. Defaults to “impresso-schemas/json/canonical/page.schema.json”.
Returns:: Printed page schema based on canonical format.
Return type:: pjs.util.Namespace

text_preparation.utils.get_reading_order(items: list[dict[str, Any]]) → dict[str, int]

Generate a reading order for items based on their id and the pages they span.

This reading order can be used to display the content items properly in a table of contents without skipping form page to page.

Parameters:: items (list[dict[str, Any]]) – List of items to reorder for the ToC.
Returns:: A dictionary mapping item IDs to their reading order.
Return type:: dict[str, int]

text_preparation.utils.read_xml(file_path: str) → BeautifulSoup

Read the content of an XML file to a BeautifulSoup object.

Parameters:: file_path (str) – Path to the XML object.
Returns:: Resulting BeautifulSoup object.
Return type:: BeautifulSoup

text_preparation.utils.rescale_coords(coords: list[float], curr_size: tuple[float, float] = None, dest_size: tuple[float, float] = None, curr_res: float = None, dest_res: float = None, xy_format: bool = True, int_sc_factor: bool = False) → list[float]

Scales image or bounding box coordinates based on image size or resolution.

This function rescales a set of coordinates (coords) based on either: - The current and target image sizes (curr_size and dest_size). - The current and target resolutions (curr_res and dest_res).

If xy_format is False and curr_res/dest_res are not provided, the function estimates a resolution-based scaling factor using image sizes (width * height).

When xy_format is True, the function assumes coordinates are in “x1, y1, x2, y2” format. Otherwise, it assumes “x, y, width, height” format.

Parameters:

coords (list[float]) – List of coordinates to be scaled.
curr_size (tuple[float, float], optional) – Current image size (width, height). Required if xy_format=True. Defaults to None.
dest_size (tuple[float, float], optional) – Target image size (width, height). Required if xy_format=True. Defaults to None.
curr_res (float, optional) – Current image resolution (optional for xy_format=False).
dest_res (float, optional) – Target image resolution (optional for xy_format=False).
xy_format (bool, optional) – If True, treats coordinates as “x1, y1, x2, y2”. If False, treats coordinates as “x, y, width, height”. Defaults to True.
int_sc_factor (bool, optional) – If True, scales using integer division for factor calculation. Defaults to False.

Returns:

Scaled coordinates.

Return type:

list[float]

Raises:

ValueError – If required parameters (size or resolution) are missing.
ValueError – If curr_size or curr_res contain zero.

Example

>>> scale_coords([10, 20, 30, 40], (100, 200), (200, 400))
[20.0, 40.0, 60.0, 80.0]

text_preparation.utils.validate_audio_schema(audio_json: dict, audio_schema: str = 'schemas/json/canonical/audio_record.schema.json') → None

text_preparation.utils.validate_issue_schema(issue_json: dict, issue_schema: str = 'schemas/json/canonical/issue.schema.json') → None

text_preparation.utils.validate_page_schema(page_json: dict, page_schema: str = 'schemas/json/canonical/page.schema.json') → None

text_preparation.utils.verify_imported_issues(actual_issue_json: dict[str, Any], expected_issue_json: dict[str, Any]) → None

Verify that the imported issues fit expectations.

Two verifications are done: the number of content items, and their IDs.

Parameters:

actual_issue_json (dict[str, Any]) – Created issue json,
expected_issue_json (dict[str, Any]) – Expected issue json.

text_preparation.utils.write_error(thing_id: str, origin_function: str, error: Exception, failed_log: str) → None

Write the given error of a failed import to the failed_log file.

Adapted from impresso-text-acquisition/text_preparation/importers/core.py to allow using a issue or page id, and provide the function in which the error took place.

Parameters:

thing_id (str) – Canonical ID of the object/file for which the error occurred.
origin_function (str) – Function in which the exception occured.
error (Exception) – Error that occurred and should be logged.
failed_log (str) – Path to log file for failed imports.

text_preparation.utils.write_jsonlines_file(filepath: str, contents: str | list[str], content_type: str, failed_log: str | None = None) → None

Write the given contents to a JSONL file given its path.

Filelocks are used here to prevent concurrent writing to the files.

Parameters:

filepath (str) – Path to the JSONL file to write to.
contents (str | list[str]) – Dump contents to write to the file.
content_type (str) – Type of content that is being written to the file.
failed_log (str | None, optional) – Path to a log to keep track of failed operations. Defaults to None.

Tokenization rules for various languages.

text_preparation.tokenization.insert_whitespace(token: str, following_token: str, previous_token: str, language: str) → bool

Determine whether a whitespace should be inserted after a token.

Parameters:

token (str) – Current token.
following_token (str) – Following token.
previous_token (str) – Previous token.
language (str) – Language of text.

Returns:

Whether a whitespace should be inserted after the token.

Return type:

bool