Utilities
There are also some utilities shares between both modules, and where more general util functions should be.
This module contains generic helper functions for the text-importer module.
- text_preparation.utils.add_property(object_dict: dict[str, Any], prop_name: str, prop_function: Callable[[str], str], function_input: str) dict[str, Any]
Add a property and value to a given object dict computed with a given function.
- Parameters:
object_dict (dict[str, Any]) – Object to which the property is added.
prop_name (str) – Name of the property to add.
prop_function (Callable[[str], str]) – Function computing the property value.
function_input (str) – Input to prop_function for this object.
- Returns:
Updated object.
- Return type:
dict[str, Any]
- text_preparation.utils.coords_to_xy(coords: list[int | Any], as_int: bool = False) list[int | Any]
Convert coordinates from xywh format to x1y1x2y2 format.
- Parameters:
coords (list) – Coords in xywh format to convert.
as_int (bool, optional) – Whether to cast elements to int before conversion. Defaults to False.
- Returns:
Resulting converted coordinates, now in x1y1x2y2 format.
- Return type:
list[int | Any]
- text_preparation.utils.coords_to_xywh(coords: list[int | Any], as_int: bool = True) list[int | Any]
Convert coordinates from x1y1x2y2 format to xywh format.
- Parameters:
coords (list) – Coords in x1y1x2y2 format to convert.
as_int (bool, optional) – Whether to cast elements to int before conversion. Defaults to False.
- Returns:
Resulting converted coordinates, now in xywh format.
- Return type:
list[int | Any]
- text_preparation.utils.draw_box_on_img(base_img_path: str, coords_xy: list, img: <module 'PIL.Image' from '/scratch/piconti/.conda/envs/text_prep_build/lib/python3.13/site-packages/PIL/Image.py'> = None, width: int = 10) <module 'PIL.Image' from '/scratch/piconti/.conda/envs/text_prep_build/lib/python3.13/site-packages/PIL/Image.py'>
Draw a bounding box on an image given coordinates in x1y1x2y2 format.
The image can either be provided through its path, or as a PIL.Image object (specifically if other bboxes have already been drawn on it.)
- Parameters:
base_img_path (str) – Path to the image to open as a PIL Image.
coords_xy (list) – Coordinates of the bbox to draw.
img (Image, optional) – PIL image if already loaded. Defaults to None.
width (int, optional) – Stroke width for the bbox. Defaults to 10.
- Returns:
Resulting PIL Image with the bbox drawn on it.
- Return type:
Image
- text_preparation.utils.empty_folder(dir_path: str) None
Empty a directoy given its path if it exists.
- Parameters:
dir_path (str) – Path to the directory to empty.
- text_preparation.utils.get_issue_schema(schema_folder: str = 'impresso-schemas/json/canonical/issue.schema.json') Namespace
Generate a list of python classes starting from a JSON schema.
- Parameters:
schema_folder (str, optional) – Path to the schema folder. Defaults to “impresso-schemas/json/canonical/issue.schema.json”.
- Returns:
Issue schema based on canonical format.
- Return type:
pjs.util.Namespace
- text_preparation.utils.get_page_schema(schema_folder: str = 'impresso-schemas/json/canonical/page.schema.json') Namespace
Generate a list of python classes starting from a JSON schema.
- Parameters:
schema_folder (str, optional) – Path to the schema folder. Defaults to “impresso-schemas/json/canonical/page.schema.json”.
- Returns:
Printed page schema based on canonical format.
- Return type:
pjs.util.Namespace
- text_preparation.utils.get_reading_order(items: list[dict[str, Any]]) dict[str, int]
Generate a reading order for items based on their id and the pages they span.
This reading order can be used to display the content items properly in a table of contents without skipping form page to page.
- Parameters:
items (list[dict[str, Any]]) – List of items to reorder for the ToC.
- Returns:
A dictionary mapping item IDs to their reading order.
- Return type:
dict[str, int]
- text_preparation.utils.read_xml(file_path: str) BeautifulSoup
Read the content of an XML file to a BeautifulSoup object.
- Parameters:
file_path (str) – Path to the XML object.
- Returns:
Resulting BeautifulSoup object.
- Return type:
BeautifulSoup
- text_preparation.utils.rescale_coords(coords: list[float], curr_size: tuple[float, float] = None, dest_size: tuple[float, float] = None, curr_res: float = None, dest_res: float = None, xy_format: bool = True, int_sc_factor: bool = False) list[float]
Scales image or bounding box coordinates based on image size or resolution.
This function rescales a set of coordinates (coords) based on either: - The current and target image sizes (curr_size and dest_size). - The current and target resolutions (curr_res and dest_res).
If xy_format is False and curr_res/dest_res are not provided, the function estimates a resolution-based scaling factor using image sizes (width * height).
When xy_format is True, the function assumes coordinates are in “x1, y1, x2, y2” format. Otherwise, it assumes “x, y, width, height” format.
- Parameters:
coords (list[float]) – List of coordinates to be scaled.
curr_size (tuple[float, float], optional) – Current image size (width, height). Required if xy_format=True. Defaults to None.
dest_size (tuple[float, float], optional) – Target image size (width, height). Required if xy_format=True. Defaults to None.
curr_res (float, optional) – Current image resolution (optional for xy_format=False).
dest_res (float, optional) – Target image resolution (optional for xy_format=False).
xy_format (bool, optional) – If True, treats coordinates as “x1, y1, x2, y2”. If False, treats coordinates as “x, y, width, height”. Defaults to True.
int_sc_factor (bool, optional) – If True, scales using integer division for factor calculation. Defaults to False.
- Returns:
Scaled coordinates.
- Return type:
list[float]
- Raises:
ValueError – If required parameters (size or resolution) are missing.
ValueError – If curr_size or curr_res contain zero.
Example
>>> scale_coords([10, 20, 30, 40], (100, 200), (200, 400)) [20.0, 40.0, 60.0, 80.0]
- text_preparation.utils.validate_audio_schema(audio_json: dict, audio_schema: str = 'schemas/json/canonical/audio_record.schema.json') None
- text_preparation.utils.validate_issue_schema(issue_json: dict, issue_schema: str = 'schemas/json/canonical/issue.schema.json') None
- text_preparation.utils.validate_page_schema(page_json: dict, page_schema: str = 'schemas/json/canonical/page.schema.json') None
- text_preparation.utils.verify_imported_issues(actual_issue_json: dict[str, Any], expected_issue_json: dict[str, Any]) None
Verify that the imported issues fit expectations.
Two verifications are done: the number of content items, and their IDs.
- Parameters:
actual_issue_json (dict[str, Any]) – Created issue json,
expected_issue_json (dict[str, Any]) – Expected issue json.
- text_preparation.utils.write_error(thing_id: str, origin_function: str, error: Exception, failed_log: str) None
Write the given error of a failed import to the failed_log file.
Adapted from impresso-text-acquisition/text_preparation/importers/core.py to allow using a issue or page id, and provide the function in which the error took place.
- Parameters:
thing_id (str) – Canonical ID of the object/file for which the error occurred.
origin_function (str) – Function in which the exception occured.
error (Exception) – Error that occurred and should be logged.
failed_log (str) – Path to log file for failed imports.
- text_preparation.utils.write_jsonlines_file(filepath: str, contents: str | list[str], content_type: str, failed_log: str | None = None) None
Write the given contents to a JSONL file given its path.
Filelocks are used here to prevent concurrent writing to the files.
- Parameters:
filepath (str) – Path to the JSONL file to write to.
contents (str | list[str]) – Dump contents to write to the file.
content_type (str) – Type of content that is being written to the file.
failed_log (str | None, optional) – Path to a log to keep track of failed operations. Defaults to None.
Tokenization rules for various languages.
- text_preparation.tokenization.insert_whitespace(token: str, following_token: str, previous_token: str, language: str) bool
Determine whether a whitespace should be inserted after a token.
- Parameters:
token (str) – Current token.
following_token (str) – Following token.
previous_token (str) – Previous token.
language (str) – Language of text.
- Returns:
Whether a whitespace should be inserted after the token.
- Return type:
bool