Importers
Available importers
The Impresso Importers already support a number of formats (and flavours of standard formats), while a few others are currently being developed.
The following importer CLI scripts are already available:
text_preparation.scripts.oliveimporter
: importer for the Olive XML format, used by RERO to encode and deliver the majority of its newspaper data.text_preparation.scripts.reroimporter
: importer for the Mets/ALTO flavor used by RERO to encode and deliver part of its data.text_preparation.scripts.luximporter
: importer for the Mets/ALTO flavor used by the Bibliothèque nationale de Luxembourg (BNL) to encode and deliver its newspaper data.text_preparation.scripts.bnfimporter
: importer for the Mets/ALTO flavor used by the Bibliothèque nationale de France (BNF) to encode and deliver its newspaper data.text_preparation.scripts.bnfen_importer
: importer for the Mets/ALTO flavor used by the Bibliothèque nationale de France (BNF) to encode and deliver its newspaper data for the Europeana collection.text_preparation.scripts.bcul_importer
: importer for the ABBYY format used by the Bibliothèque Cantonale Universitaire de Lausanne (BCUL) to encode and deliver the newspaper data which is on the Scriptorium interface.text_preparation.scripts.swaimporter
: ALTO flavor of the Basel University Library.text_preparation.scripts.blimporter
: importer for the Mets/ALTO flavor used by the British Library (BL) to encode and deliver its newspaper data.text_preparation.scripts.tetml
: generic importer for the TETML format, produced by PDFlib TET.text_preparation.scripts.fedgaz
: importer for the TETML format with separate metadata file and a heuristic article segmentation, used to parse the Federal Gazette.
For further details on any of these implementations, please do refer to its documentation:
Command-line interface
Note
All importers share the same command-line interface; only a few options are import-specific (see documentation below).
Functions and CLI script to convert any OCR data into Impresso’s format.
- Usage:
<importer-name>importer.py –input-dir=<id> (–clear | –incremental) [–output-dir=<od> –image-dirs=<imd> –temp-dir=<td> –chunk-size=<cs> –s3-bucket=<b> –config-file=<cf> –log-file=<f> –verbose –scheduler=<sch> –access-rights=<ar> –git-repo=<gr> –num-workers=<nw>] <importer-name>importer.py –version
- Options:
- --input-dir=<id>
Base directory containing one sub-directory for each journal
- --image-dirs=<imd>
Directory containing (canonical) images and their metadata (use , to separate multiple dirs)
- --output-dir=<od>
Base directory where to write the output files
- --temp-dir=<td>
Temporary directory to extract .zip archives
- --config-file=<cf>
Configuration file for selective import
- --s3-bucket=<b>
If provided, writes output to an S3 drive, in the specified bucket
- --scheduler=<sch>
Tell dask to use an existing scheduler (otherwise it’ll create one)
- --log-file=<f>
Log file; when missing print log to stdout
- --access-rights=<ar>
Access right file if relevant (only for olive and rero importers)
- --chunk-size=<cs>
Chunk size in years used to group issues when importing
- --git-repo=<gr>
Local path to the “impresso-text-acquisition” git directory (including it).
- --num-workers=<nw>
Number of workers to use for local dask cluster
- --verbose
Verbose log messages (good for debugging)
- --clear
Removes the output folder (if already existing)
- --incremental
Skips issues already present in output directory
- --version
Prints version and exits.
Configuration file
The selection of the actual newspaper data to be imported can be controlled by
means of a configuration file (JSON format). The path to this file is passed via the --config_file=
CLI parameter.
This JSON file contains three properties:
newspapers
: a dictionary containing the newspaper IDs to be imported (e.g. GDL);exclude_newspapers
: a list of the newspaper IDs to be excluded;year_only
: a boolean flag indicating whether date ranges are expressed by using years or more granular dates (in the formatYYYY/MM/DD
).
Note
When ingesting large amounts of data, these configuration files can help you organise your data imports into batches or homogeneous collections.
Here is a simple configuration file:
{
"newspapers": {
"GDL": []
},
"exclude_newspapers": [],
"year_only": false
}
This is what a more complex config file looks like (only contents for the decade 1950-1960 of GDL are processed):
{
"newspapers": {
"GDL": "1950/01/01-1960/12/31"
},
"exclude_newspapers": [],
"year_only": false
}
Writing a new importer
Writing a new importer is easy and entails implementing two pieces of code:
implementing functions to detect the data to import;
implementing from scratch classes that handle the conversion into JSON of your OCR format or adapt one of the existing importers.
Once these two pieces of code are in place, they can be plugged into the functions defined in text_preparation.importers.generic_importer
so as to create a dedicated CLI script for your specific format.
For example, this is the content of oliveimporter.py
:
from text_preparation.importers import generic_importer
from text_preparation.importers.olive.classes import OliveNewspaperIssue
from text_preparation.importers.olive.detect import (olive_detect_issues,
olive_select_issues)
if __name__ == '__main__':
generic_importer.main(
OliveNewspaperIssue,
olive_detect_issues,
olive_select_issues
)
How should the code of a new text importer be structured? We recommend to comply to the following structure:
text_preparation.importers.<new_importer>.detect
will contain functions to find the data to be imported;text_preparation.importers.<new_importer>.helpers
(optional) will contain ancillary functions;text_preparation.importers.<new_importer>.parsers
(optional) will contain functions/classes to parse the data.text_preparation/scripts/<new_importer>.py
: will contain a CLI script to run the importer.
Detect data to import
the importer needs to know which data should be imported
information about the newspaper contents is often encoded as part of folder names etc., thus it needs to be extracted and made explicit, by means of Canonical identifiers
add some sample data to
text_preparation/data/sample/<new_format>
For example: olive_detect_issues()
Implement abstract classes
These two classes are passed to the the importer’s generic command-line interface,
see text_preparation.importers.generic_importer.main()
- class text_preparation.importers.classes.NewspaperIssue(issue_dir: IssueDir)
Abstract class representing a newspaper issue.
Each text importer needs to define a subclass of NewspaperIssue which specifies the logic to handle OCR data in a given format (e.g. Olive).
- Parameters:
issue_dir (IssueDir) – Identifying information about the issue.
- id
Canonical Issue ID (e.g.
GDL-1900-01-02-a
).- Type:
str
- edition
Lower case letter ordering issues of the same day.
- Type:
str
- journal
Newspaper unique identifier or name.
- Type:
str
- path
Path to directory containing the issue’s OCR data.
- Type:
str
- date
Publication date of issue.
- Type:
datetime.date
- issue_data
Issue data according to canonical format.
- Type:
dict[str, Any]
- pages
List of
NewspaperPage
instances from this issue.- Type:
list
- rights
Access rights applicable to this issue.
- Type:
str
- property issuedir: IssueDir
IssueDirectory corresponding to this issue.
- Type:
IssueDir
- to_json() str
Validate
self.issue_data
& serialize it to string.Note
Validation adds a substantial overhead to computing time. For serialization of large amounts of issues it is recommendable to bypass schema validation.
- class text_preparation.importers.classes.NewspaperPage(_id: str, number: int)
Abstract class representing a newspaper page.
Each text importer needs to define a subclass of
NewspaperPage
which specifies the logic to handle OCR data in a given format (e.g. Alto).- Parameters:
_id (str) – Canonical Page ID (e.g.
GDL-1900-01-02-a-p0004
).number (int) – Page number.
- id
Canonical Page ID (e.g.
GDL-1900-01-02-a-p0004
).- Type:
str
- number
Page number.
- Type:
int
- page_data
Page data according to canonical format.
- Type:
dict[str, Any]
- issue
Issue this page is from.
- Type:
NewspaperIssue | None
- abstract add_issue(issue: NewspaperIssue) None
Add to a page object its parent, i.e. the newspaper issue.
This allows each page to preserve contextual information coming from the newspaper issue.
- Parameters:
issue (NewspaperIssue) – Newspaper issue containing this page.
- abstract parse() None
Process the page XML file and transform into canonical Page format.
Note
This lazy behavior means that the page contents are not processed upon creation of the page object, but only once the
parse()
method is called.
- to_json() str
Validate
self.page.data
& serialize it to string.Note
Validation adds a substantial overhead to computing time. For serialization of large amounts of pages it is recommendable to bypass schema validation.
Write an importer CLI script
This script imports passes the new NewspaperIssue
class, together with the-newly
defined detect functions, to the main()
function of the generic importer CLI
text_preparation.importers.generic_importer.main()
.
Test
Create a new test file named test_<new_importer>_importer.py
and add it to tests/importers/
.
This file should contain at the very minimum a test called test_import_issues()
, which
detects input data from
text_preparation/data/sample/<new_format>
writes any output to
text_preparation/data/out/
.