TextImporter
Available importers
The Impresso TextImporter already supports a number of formats (and flavours of standard formats), while a few others are currently being developed.
The following importer CLI scripts are already available:
text_importer.scripts.oliveimporter
: importer for the Olive XML format, used by RERO to encode and deliver the majority of its newspaper data.text_importer.scripts.reroimporter
: importer for the Mets/ALTO flavor used by RERO to encode and deliver part of its data.text_importer.scripts.luximporter
: importer for the Mets/ALTO flavor used by the Bibliothèque nationale de Luxembourg (BNL) to encode and deliver its newspaper data.text_importer.scripts.bnfimporter
: importer for the Mets/ALTO flavor used by the Bibliothèque nationale de France (BNF) to encode and deliver its newspaper data.text_importer.scripts.bnfen_importer
: importer for the Mets/ALTO flavor used by the Bibliothèque nationale de France (BNF) to encode and deliver its newspaper data for the Europeana collection.text_importer.scripts.bcul_importer
: importer for the ABBYY format used by the Bibliothèque Cantonale Universitaire de Lausanne (BCUL) to encode and deliver the newspaper data which is on the Scriptorium interface.text_importer.scripts.swaimporter
: ALTO flavor of the Basel University Library.text_importer.scripts.blimporter
: importer for the Mets/ALTO flavor used by the British Library (BL) to encode and deliver its newspaper data.text_importer.scripts.tetml
: generic importer for the TETML format, produced by PDFlib TET.text_importer.scripts.fedgaz
: importer for the TETML format with separate metadata file and a heuristic article segmentation, used to parse the Federal Gazette.
For further details on any of these implementations, please do refer to its documentation:
Command-line interface
Note
All importers share the same command-line interface; only a few options are import-specific (see documentation below).
Functions and CLI script to convert any OCR data into Impresso’s format.
- Usage:
<importer-name>importer.py –input-dir=<id> (–clear | –incremental) [–output-dir=<od> –image-dirs=<imd> –temp-dir=<td> –chunk-size=<cs> –s3-bucket=<b> –config-file=<cf> –log-file=<f> –verbose –scheduler=<sch> –access-rights=<ar> –git-repo=<gr> –num-workers=<nw>] <importer-name>importer.py –version
- Options:
- --input-dir=<id>
Base directory containing one sub-directory for each journal
- --image-dirs=<imd>
Directory containing (canonical) images and their metadata (use , to separate multiple dirs)
- --output-dir=<od>
Base directory where to write the output files
- --temp-dir=<td>
Temporary directory to extract .zip archives
- --config-file=<cf>
Configuration file for selective import
- --s3-bucket=<b>
If provided, writes output to an S3 drive, in the specified bucket
- --scheduler=<sch>
Tell dask to use an existing scheduler (otherwise it’ll create one)
- --log-file=<f>
Log file; when missing print log to stdout
- --access-rights=<ar>
Access right file if relevant (only for olive and rero importers)
- --chunk-size=<cs>
Chunk size in years used to group issues when importing
- --git-repo=<gr>
Local path to the “impresso-text-acquisition” git directory (including it).
- --num-workers=<nw>
Number of workers to use for local dask cluster
- --verbose
Verbose log messages (good for debugging)
- --clear
Removes the output folder (if already existing)
- --incremental
Skips issues already present in output directory
- --version
Prints version and exits.
Configuration file
The selection of the actual newspaper data to be imported can be controlled by
means of a configuration file (JSON format). The path to this file is passed via the --config_file=
CLI parameter.
This JSON file contains three properties:
newspapers
: a dictionary containing the newspaper IDs to be imported (e.g. GDL);exclude_newspapers
: a list of the newspaper IDs to be excluded;year_only
: a boolean flag indicating whether date ranges are expressed by using years or more granular dates (in the formatYYYY/MM/DD
).
Note
When ingesting large amounts of data, these configuration files can help you organise your data imports into batches or homogeneous collections.
Here is a simple configuration file:
{
"newspapers": {
"GDL": []
},
"exclude_newspapers": [],
"year_only": false
}
This is what a more complex config file looks like (only contents for the decade 1950-1960 of GDL are processed):
{
"newspapers": {
"GDL": "1950/01/01-1960/12/31"
},
"exclude_newspapers": [],
"year_only": false
}
Utilities
This module contains generic helper functions for the text-importer module.
- text_importer.utils.add_property(object_dict: dict[str, Any], prop_name: str, prop_function: Callable[[str], str], function_input: str) dict[str, Any]
Add a property and value to a given object dict computed with a given function.
- Parameters:
object_dict (dict[str, Any]) – Object to which the property is added.
prop_name (str) – Name of the property to add.
prop_function (Callable[[str], str]) – Function computing the property value.
function_input (str) – Input to prop_function for this object.
- Returns:
Updated object.
- Return type:
dict[str, Any]
- text_importer.utils.empty_folder(dir_path: str) None
Empty a directoy given its path if it exists.
- Parameters:
dir_path (str) – Path to the directory to empty.
- text_importer.utils.get_access_right(journal: str, _date: date, access_rights: dict[str, dict[str, str]]) str
Fetch the access rights for a specific journal and publication date.
- Parameters:
journal (str) – Journal name.
_date (date) – Publication date of the journal
access_rights (dict[str, dict[str, str]]) – Access rights for various journals.
- Returns:
Access rights for specific journal issue.
- Return type:
str
- text_importer.utils.get_issue_schema(schema_folder: str = 'impresso-schemas/json/newspaper/issue.schema.json') Namespace
Generate a list of python classes starting from a JSON schema.
- Parameters:
schema_folder (str, optional) – Path to the schema folder. Defaults to “impresso-schemas/json/newspaper/issue.schema.json”.
- Returns:
Newspaper issue schema based on canonical format.
- Return type:
pjs.util.Namespace
- text_importer.utils.get_page_schema(schema_folder: str = 'impresso-schemas/json/newspaper/page.schema.json') Namespace
Generate a list of python classes starting from a JSON schema.
- Parameters:
schema_folder (str, optional) – Path to the schema folder. Defaults to “impresso-schemas/json/newspaper/page.schema.json”.
- Returns:
Newspaper page schema based on canonical format.
- Return type:
pjs.util.Namespace
- text_importer.utils.get_pkg_resource(file_manager: ExitStack, path: str, package: str = 'text_importer') PosixPath
Return the resource at path in package, using a context manager.
Note
The context manager file_manager needs to be instantiated prior to calling this function and should be closed once the package resource is no longer of use.
- Parameters:
file_manager (contextlib.ExitStack) – Context manager.
path (str) – Path to the desired resource in given package.
package (str, optional) – Package name. Defaults to “text_importer”.
- Returns:
Path to desired managed resource.
- Return type:
pathlib.PosixPath
- text_importer.utils.get_reading_order(items: list[dict[str, Any]]) dict[str, int]
Generate a reading order for items based on their id and the pages they span.
This reading order can be used to display the content items properly in a table of contents without skipping form page to page.
- Parameters:
items (list[dict[str, Any]]) – List of items to reorder for the ToC.
- Returns:
A dictionary mapping item IDs to their reading order.
- Return type:
dict[str, int]
- text_importer.utils.init_logger(_logger: RootLogger, log_level: int, log_file: str) RootLogger
Initialise the logger.
- Parameters:
_logger (logging.RootLogger) – Logger instance to initialise.
log_level (int) – Desidered logging level (e.g.
logging.INFO
).log_file (str) – Path to destination file for logging output. If no output file is provided (
log_file
isNone
) logs will be written to standard output.
- Returns:
The initialised logger object.
- Return type:
logging.RootLogger
- text_importer.utils.verify_imported_issues(actual_issue_json: dict[str, Any], expected_issue_json: dict[str, Any]) None
Verify that the imported issues fit expectations.
Two verifications are done: the number of content items, and their IDs.
- Parameters:
actual_issue_json (dict[str, Any]) – Created issue json,
expected_issue_json (dict[str, Any]) – Expected issue json.
- text_importer.utils.write_error(thing_id: str, origin_function: str, error: Exception, failed_log: str) None
Write the given error of a failed import to the failed_log file.
Adapted from impresso-text-acquisition/text_importer/importers/core.py to allow using a issue or page id, and provide the function in which the error took place.
- Parameters:
thing_id (str) – Canonical ID of the object/file for which the error occurred.
origin_function (str) – Function in which the exception occured.
error (Exception) – Error that occurred and should be logged.
failed_log (str) – Path to log file for failed imports.
- text_importer.utils.write_jsonlines_file(filepath: str, contents: str | list[str], content_type: str, failed_log: str | None = None) None
Write the given contents to a JSONL file given its path.
Filelocks are used here to prevent concurrent writing to the files.
- Parameters:
filepath (str) – Path to the JSONL file to write to.
contents (str | list[str]) – Dump contents to write to the file.
content_type (str) – Type of content that is being written to the file.
failed_log (str | None, optional) – Path to a log to keep track of failed operations. Defaults to None.