Utilities

There are also some utilities shares between both modules, and where more general util functions should be.

This module contains generic helper functions for the text-importer module.

text_preparation.utils.add_property(object_dict: dict[str, Any], prop_name: str, prop_function: Callable[[str], str], function_input: str) dict[str, Any]

Add a property and value to a given object dict computed with a given function.

Parameters:
  • object_dict (dict[str, Any]) – Object to which the property is added.

  • prop_name (str) – Name of the property to add.

  • prop_function (Callable[[str], str]) – Function computing the property value.

  • function_input (str) – Input to prop_function for this object.

Returns:

Updated object.

Return type:

dict[str, Any]

text_preparation.utils.empty_folder(dir_path: str) None

Empty a directoy given its path if it exists.

Parameters:

dir_path (str) – Path to the directory to empty.

text_preparation.utils.get_issue_schema(schema_folder: str = 'impresso-schemas/json/newspaper/issue.schema.json') Namespace

Generate a list of python classes starting from a JSON schema.

Parameters:

schema_folder (str, optional) – Path to the schema folder. Defaults to “impresso-schemas/json/newspaper/issue.schema.json”.

Returns:

Newspaper issue schema based on canonical format.

Return type:

pjs.util.Namespace

text_preparation.utils.get_page_schema(schema_folder: str = 'impresso-schemas/json/newspaper/page.schema.json') Namespace

Generate a list of python classes starting from a JSON schema.

Parameters:

schema_folder (str, optional) – Path to the schema folder. Defaults to “impresso-schemas/json/newspaper/page.schema.json”.

Returns:

Newspaper page schema based on canonical format.

Return type:

pjs.util.Namespace

text_preparation.utils.get_reading_order(items: list[dict[str, Any]]) dict[str, int]

Generate a reading order for items based on their id and the pages they span.

This reading order can be used to display the content items properly in a table of contents without skipping form page to page.

Parameters:

items (list[dict[str, Any]]) – List of items to reorder for the ToC.

Returns:

A dictionary mapping item IDs to their reading order.

Return type:

dict[str, int]

text_preparation.utils.verify_imported_issues(actual_issue_json: dict[str, Any], expected_issue_json: dict[str, Any]) None

Verify that the imported issues fit expectations.

Two verifications are done: the number of content items, and their IDs.

Parameters:
  • actual_issue_json (dict[str, Any]) – Created issue json,

  • expected_issue_json (dict[str, Any]) – Expected issue json.

text_preparation.utils.write_error(thing_id: str, origin_function: str, error: Exception, failed_log: str) None

Write the given error of a failed import to the failed_log file.

Adapted from impresso-text-acquisition/text_preparation/importers/core.py to allow using a issue or page id, and provide the function in which the error took place.

Parameters:
  • thing_id (str) – Canonical ID of the object/file for which the error occurred.

  • origin_function (str) – Function in which the exception occured.

  • error (Exception) – Error that occurred and should be logged.

  • failed_log (str) – Path to log file for failed imports.

text_preparation.utils.write_jsonlines_file(filepath: str, contents: str | list[str], content_type: str, failed_log: str | None = None) None

Write the given contents to a JSONL file given its path.

Filelocks are used here to prevent concurrent writing to the files.

Parameters:
  • filepath (str) – Path to the JSONL file to write to.

  • contents (str | list[str]) – Dump contents to write to the file.

  • content_type (str) – Type of content that is being written to the file.

  • failed_log (str | None, optional) – Path to a log to keep track of failed operations. Defaults to None.

Tokenization rules for various languages.

text_preparation.tokenization.insert_whitespace(token: str, following_token: str, previous_token: str, language: str) bool

Determine whether a whitespace should be inserted after a token.

Parameters:
  • token (str) – Current token.

  • following_token (str) – Following token.

  • previous_token (str) – Previous token.

  • language (str) – Language of text.

Returns:

Whether a whitespace should be inserted after the token.

Return type:

bool