Utilities
There are also some utilities shares between both modules, and where more general util functions should be.
This module contains generic helper functions for the text-importer module.
- text_preparation.utils.add_property(object_dict: dict[str, Any], prop_name: str, prop_function: Callable[[str], str], function_input: str) dict[str, Any]
Add a property and value to a given object dict computed with a given function.
- Parameters:
object_dict (dict[str, Any]) – Object to which the property is added.
prop_name (str) – Name of the property to add.
prop_function (Callable[[str], str]) – Function computing the property value.
function_input (str) – Input to prop_function for this object.
- Returns:
Updated object.
- Return type:
dict[str, Any]
- text_preparation.utils.empty_folder(dir_path: str) None
Empty a directoy given its path if it exists.
- Parameters:
dir_path (str) – Path to the directory to empty.
- text_preparation.utils.get_issue_schema(schema_folder: str = 'impresso-schemas/json/newspaper/issue.schema.json') Namespace
Generate a list of python classes starting from a JSON schema.
- Parameters:
schema_folder (str, optional) – Path to the schema folder. Defaults to “impresso-schemas/json/newspaper/issue.schema.json”.
- Returns:
Newspaper issue schema based on canonical format.
- Return type:
pjs.util.Namespace
- text_preparation.utils.get_page_schema(schema_folder: str = 'impresso-schemas/json/newspaper/page.schema.json') Namespace
Generate a list of python classes starting from a JSON schema.
- Parameters:
schema_folder (str, optional) – Path to the schema folder. Defaults to “impresso-schemas/json/newspaper/page.schema.json”.
- Returns:
Newspaper page schema based on canonical format.
- Return type:
pjs.util.Namespace
- text_preparation.utils.get_reading_order(items: list[dict[str, Any]]) dict[str, int]
Generate a reading order for items based on their id and the pages they span.
This reading order can be used to display the content items properly in a table of contents without skipping form page to page.
- Parameters:
items (list[dict[str, Any]]) – List of items to reorder for the ToC.
- Returns:
A dictionary mapping item IDs to their reading order.
- Return type:
dict[str, int]
- text_preparation.utils.verify_imported_issues(actual_issue_json: dict[str, Any], expected_issue_json: dict[str, Any]) None
Verify that the imported issues fit expectations.
Two verifications are done: the number of content items, and their IDs.
- Parameters:
actual_issue_json (dict[str, Any]) – Created issue json,
expected_issue_json (dict[str, Any]) – Expected issue json.
- text_preparation.utils.write_error(thing_id: str, origin_function: str, error: Exception, failed_log: str) None
Write the given error of a failed import to the failed_log file.
Adapted from impresso-text-acquisition/text_preparation/importers/core.py to allow using a issue or page id, and provide the function in which the error took place.
- Parameters:
thing_id (str) – Canonical ID of the object/file for which the error occurred.
origin_function (str) – Function in which the exception occured.
error (Exception) – Error that occurred and should be logged.
failed_log (str) – Path to log file for failed imports.
- text_preparation.utils.write_jsonlines_file(filepath: str, contents: str | list[str], content_type: str, failed_log: str | None = None) None
Write the given contents to a JSONL file given its path.
Filelocks are used here to prevent concurrent writing to the files.
- Parameters:
filepath (str) – Path to the JSONL file to write to.
contents (str | list[str]) – Dump contents to write to the file.
content_type (str) – Type of content that is being written to the file.
failed_log (str | None, optional) – Path to a log to keep track of failed operations. Defaults to None.
Tokenization rules for various languages.
- text_preparation.tokenization.insert_whitespace(token: str, following_token: str, previous_token: str, language: str) bool
Determine whether a whitespace should be inserted after a token.
- Parameters:
token (str) – Current token.
following_token (str) – Following token.
previous_token (str) – Previous token.
language (str) – Language of text.
- Returns:
Whether a whitespace should be inserted after the token.
- Return type:
bool