Utilities

There are also some utilities shares between both modules, and where more general util functions should be.

This module contains generic helper functions for the text-importer module.

text_preparation.utils.add_property(object_dict: dict[str, Any], prop_name: str, prop_function: Callable[[str], str], function_input: str) dict[str, Any]

Add a property and value to a given object dict computed with a given function.

Parameters:
  • object_dict (dict[str, Any]) – Object to which the property is added.

  • prop_name (str) – Name of the property to add.

  • prop_function (Callable[[str], str]) – Function computing the property value.

  • function_input (str) – Input to prop_function for this object.

Returns:

Updated object.

Return type:

dict[str, Any]

text_preparation.utils.coords_to_xy(coords: list[int | Any], as_int: bool = False) list[int | Any]

Convert coordinates from xywh format to x1y1x2y2 format.

Parameters:
  • coords (list) – Coords in xywh format to convert.

  • as_int (bool, optional) – Whether to cast elements to int before conversion. Defaults to False.

Returns:

Resulting converted coordinates, now in x1y1x2y2 format.

Return type:

list[int | Any]

text_preparation.utils.coords_to_xywh(coords: list[int | Any], as_int: bool = True) list[int | Any]

Convert coordinates from x1y1x2y2 format to xywh format.

Parameters:
  • coords (list) – Coords in x1y1x2y2 format to convert.

  • as_int (bool, optional) – Whether to cast elements to int before conversion. Defaults to False.

Returns:

Resulting converted coordinates, now in xywh format.

Return type:

list[int | Any]

text_preparation.utils.draw_box_on_img(base_img_path: str, coords_xy: list, img: <module 'PIL.Image' from '/scratch/piconti/.conda/envs/text_prep_build/lib/python3.13/site-packages/PIL/Image.py'> = None, width: int = 10) <module 'PIL.Image' from '/scratch/piconti/.conda/envs/text_prep_build/lib/python3.13/site-packages/PIL/Image.py'>

Draw a bounding box on an image given coordinates in x1y1x2y2 format.

The image can either be provided through its path, or as a PIL.Image object (specifically if other bboxes have already been drawn on it.)

Parameters:
  • base_img_path (str) – Path to the image to open as a PIL Image.

  • coords_xy (list) – Coordinates of the bbox to draw.

  • img (Image, optional) – PIL image if already loaded. Defaults to None.

  • width (int, optional) – Stroke width for the bbox. Defaults to 10.

Returns:

Resulting PIL Image with the bbox drawn on it.

Return type:

Image

text_preparation.utils.empty_folder(dir_path: str) None

Empty a directoy given its path if it exists.

Parameters:

dir_path (str) – Path to the directory to empty.

text_preparation.utils.get_issue_schema(schema_folder: str = 'impresso-schemas/json/canonical/issue.schema.json') Namespace

Generate a list of python classes starting from a JSON schema.

Parameters:

schema_folder (str, optional) – Path to the schema folder. Defaults to “impresso-schemas/json/canonical/issue.schema.json”.

Returns:

Issue schema based on canonical format.

Return type:

pjs.util.Namespace

text_preparation.utils.get_page_schema(schema_folder: str = 'impresso-schemas/json/canonical/page.schema.json') Namespace

Generate a list of python classes starting from a JSON schema.

Parameters:

schema_folder (str, optional) – Path to the schema folder. Defaults to “impresso-schemas/json/canonical/page.schema.json”.

Returns:

Printed page schema based on canonical format.

Return type:

pjs.util.Namespace

text_preparation.utils.get_reading_order(items: list[dict[str, Any]]) dict[str, int]

Generate a reading order for items based on their id and the pages they span.

This reading order can be used to display the content items properly in a table of contents without skipping form page to page.

Parameters:

items (list[dict[str, Any]]) – List of items to reorder for the ToC.

Returns:

A dictionary mapping item IDs to their reading order.

Return type:

dict[str, int]

text_preparation.utils.read_xml(file_path: str) BeautifulSoup

Read the content of an XML file to a BeautifulSoup object.

Parameters:

file_path (str) – Path to the XML object.

Returns:

Resulting BeautifulSoup object.

Return type:

BeautifulSoup

text_preparation.utils.rescale_coords(coords: list[float], curr_size: tuple[float, float] = None, dest_size: tuple[float, float] = None, curr_res: float = None, dest_res: float = None, xy_format: bool = True, int_sc_factor: bool = False) list[float]

Scales image or bounding box coordinates based on image size or resolution.

This function rescales a set of coordinates (coords) based on either: - The current and target image sizes (curr_size and dest_size). - The current and target resolutions (curr_res and dest_res).

If xy_format is False and curr_res/dest_res are not provided, the function estimates a resolution-based scaling factor using image sizes (width * height).

When xy_format is True, the function assumes coordinates are in “x1, y1, x2, y2” format. Otherwise, it assumes “x, y, width, height” format.

Parameters:
  • coords (list[float]) – List of coordinates to be scaled.

  • curr_size (tuple[float, float], optional) – Current image size (width, height). Required if xy_format=True. Defaults to None.

  • dest_size (tuple[float, float], optional) – Target image size (width, height). Required if xy_format=True. Defaults to None.

  • curr_res (float, optional) – Current image resolution (optional for xy_format=False).

  • dest_res (float, optional) – Target image resolution (optional for xy_format=False).

  • xy_format (bool, optional) – If True, treats coordinates as “x1, y1, x2, y2”. If False, treats coordinates as “x, y, width, height”. Defaults to True.

  • int_sc_factor (bool, optional) – If True, scales using integer division for factor calculation. Defaults to False.

Returns:

Scaled coordinates.

Return type:

list[float]

Raises:
  • ValueError – If required parameters (size or resolution) are missing.

  • ValueError – If curr_size or curr_res contain zero.

Example

>>> scale_coords([10, 20, 30, 40], (100, 200), (200, 400))
[20.0, 40.0, 60.0, 80.0]
text_preparation.utils.validate_audio_schema(audio_json: dict, audio_schema: str = 'schemas/json/canonical/audio_record.schema.json') None
text_preparation.utils.validate_issue_schema(issue_json: dict, issue_schema: str = 'schemas/json/canonical/issue.schema.json') None
text_preparation.utils.validate_page_schema(page_json: dict, page_schema: str = 'schemas/json/canonical/page.schema.json') None
text_preparation.utils.verify_imported_issues(actual_issue_json: dict[str, Any], expected_issue_json: dict[str, Any]) None

Verify that the imported issues fit expectations.

Two verifications are done: the number of content items, and their IDs.

Parameters:
  • actual_issue_json (dict[str, Any]) – Created issue json,

  • expected_issue_json (dict[str, Any]) – Expected issue json.

text_preparation.utils.write_error(thing_id: str, origin_function: str, error: Exception, failed_log: str) None

Write the given error of a failed import to the failed_log file.

Adapted from impresso-text-acquisition/text_preparation/importers/core.py to allow using a issue or page id, and provide the function in which the error took place.

Parameters:
  • thing_id (str) – Canonical ID of the object/file for which the error occurred.

  • origin_function (str) – Function in which the exception occured.

  • error (Exception) – Error that occurred and should be logged.

  • failed_log (str) – Path to log file for failed imports.

text_preparation.utils.write_jsonlines_file(filepath: str, contents: str | list[str], content_type: str, failed_log: str | None = None) None

Write the given contents to a JSONL file given its path.

Filelocks are used here to prevent concurrent writing to the files.

Parameters:
  • filepath (str) – Path to the JSONL file to write to.

  • contents (str | list[str]) – Dump contents to write to the file.

  • content_type (str) – Type of content that is being written to the file.

  • failed_log (str | None, optional) – Path to a log to keep track of failed operations. Defaults to None.

Tokenization rules for various languages.

text_preparation.tokenization.insert_whitespace(token: str, following_token: str, previous_token: str, language: str) bool

Determine whether a whitespace should be inserted after a token.

Parameters:
  • token (str) – Current token.

  • following_token (str) – Following token.

  • previous_token (str) – Previous token.

  • language (str) – Language of text.

Returns:

Whether a whitespace should be inserted after the token.

Return type:

bool