Writing a new importer

TLDR;

Writing a new importer is easy and entails implementing two pieces of code:

implementing functions to detect the data to import;
implementing from scratch classes that handle the conversion into JSON of your OCR format or adapt one of the existing importers.

Once these two pieces of code are in place, they can be plugged into the functions defined in text_importer.importers.generic_importer so as to create a dedicated CLI script for your specific format.

For example, this is the content of oliveimporter.py:

from text_importer.importers import generic_importer
from text_importer.importers.olive.classes import OliveNewspaperIssue
from text_importer.importers.olive.detect import (olive_detect_issues,
                                                  olive_select_issues)

if __name__ == '__main__':
    generic_importer.main(
        OliveNewspaperIssue,
        olive_detect_issues,
        olive_select_issues
    )

How should the code of a new text importer be structured? We recommend to comply to the following structure:

text_importer.importers.<new_importer>.detect will contain functions to find the data to be imported;
text_importer.importers.<new_importer>.helpers (optional) will contain ancillary functions;
text_importer.importers.<new_importer>.parsers (optional) will contain functions/classes to parse the data.
text_importer/scripts/<new_importer>.py: will contain a CLI script to run the importer.

Detect data to import

the importer needs to know which data should be imported
information about the newspaper contents is often encoded as part of folder names etc., thus it needs to be extracted and made explicit, by means of Canonical identifiers
add some sample data to text_importer/data/sample/<new_format>

For example: olive_detect_issues()

Implement abstract classes

These two classes are passed to the the importer’s generic command-line interface, see text_importer.importers.generic_importer.main()

class text_importer.importers.classes.NewspaperIssue(issue_dir: IssueDir)

Abstract class representing a newspaper issue.

Each text importer needs to define a subclass of NewspaperIssue which specifies the logic to handle OCR data in a given format (e.g. Olive).

Parameters:: issue_dir (IssueDir) – Identifying information about the issue.

id

Canonical Issue ID (e.g. GDL-1900-01-02-a).

Type:: str

edition

Lower case letter ordering issues of the same day.

Type:: str

journal

Newspaper unique identifier or name.

Type:: str

path

Path to directory containing the issue’s OCR data.

Type:: str

date

Publication date of issue.

Type:: datetime.date

issue_data

Issue data according to canonical format.

Type:: dict[str, Any]

pages

List of NewspaperPage instances from this issue.

Type:: list

rights

Access rights applicable to this issue.

Type:: str

property issuedir: IssueDir

IssueDirectory corresponding to this issue.

Type:: IssueDir

to_json() → str: Validate self.issue_data & serialize it to string.

Note

Validation adds a substantial overhead to computing time. For serialization of large amounts of issues it is recommendable to bypass schema validation.

class text_importer.importers.classes.NewspaperPage(_id: str, number: int)

Abstract class representing a newspaper page.

Each text importer needs to define a subclass of NewspaperPage which specifies the logic to handle OCR data in a given format (e.g. Alto).

Parameters:

_id (str) – Canonical Page ID (e.g. GDL-1900-01-02-a-p0004).
number (int) – Page number.

id

Canonical Page ID (e.g. GDL-1900-01-02-a-p0004).

Type:: str

number

Page number.

Type:: int

page_data

Page data according to canonical format.

Type:: dict[str, Any]

issue

Issue this page is from.

Type:: NewspaperIssue | None

abstract add_issue(issue: NewspaperIssue) → None

Add to a page object its parent, i.e. the newspaper issue.

This allows each page to preserve contextual information coming from the newspaper issue.

Parameters:: issue (NewspaperIssue) – Newspaper issue containing this page.

abstract parse() → None: Process the page XML file and transform into canonical Page format.

Note

This lazy behavior means that the page contents are not processed upon creation of the page object, but only once the parse() method is called.

to_json() → str: Validate self.page.data & serialize it to string.

Note

Validation adds a substantial overhead to computing time. For serialization of large amounts of pages it is recommendable to bypass schema validation.

Write an importer CLI script

This script imports passes the new NewspaperIssue class, together with the-newly defined detect functions, to the main() function of the generic importer CLI text_importer.importers.generic_importer.main().

Test

Create a new test file named test_<new_importer>_importer.py and add it to tests/importers/.

This file should contain at the very minimum a test called test_import_issues(), which

detects input data from text_importer/data/sample/<new_format>
writes any output to text_importer/data/out/.