Writing a new importer

TLDR;

Writing a new importer is easy and entails implementing two pieces of code:

  1. implementing functions to detect the data to import;

  2. implementing from scratch classes that handle the conversion into JSON of your OCR format or adapt one of the existing importers.

Once these two pieces of code are in place, they can be plugged into the functions defined in text_importer.importers.generic_importer so as to create a dedicated CLI script for your specific format.

For example, this is the content of oliveimporter.py:

from text_importer.importers import generic_importer
from text_importer.importers.olive.classes import OliveNewspaperIssue
from text_importer.importers.olive.detect import (olive_detect_issues,
                                                  olive_select_issues)

if __name__ == '__main__':
    generic_importer.main(
        OliveNewspaperIssue,
        olive_detect_issues,
        olive_select_issues
    )

How should the code of a new text importer be structured? We recommend to comply to the following structure:

  • text_importer.importers.<new_importer>.detect will contain functions to find the data to be imported;

  • text_importer.importers.<new_importer>.helpers (optional) will contain ancillary functions;

  • text_importer.importers.<new_importer>.parsers (optional) will contain functions/classes to parse the data.

  • text_importer/scripts/<new_importer>.py: will contain a CLI script to run the importer.

Detect data to import

  • the importer needs to know which data should be imported

  • information about the newspaper contents is often encoded as part of folder names etc., thus it needs to be extracted and made explicit, by means of Canonical identifiers

  • add some sample data to text_importer/data/sample/<new_format>

For example: olive_detect_issues()

Implement abstract classes

These two classes are passed to the the importer’s generic command-line interface, see text_importer.importers.generic_importer.main()

class text_importer.importers.classes.NewspaperIssue(issue_dir: IssueDir)

Abstract class representing a newspaper issue.

Each text importer needs to define a subclass of NewspaperIssue which specifies the logic to handle OCR data in a given format (e.g. Olive).

Parameters:

issue_dir (IssueDir) – Identifying information about the issue.

id

Canonical Issue ID (e.g. GDL-1900-01-02-a).

Type:

str

edition

Lower case letter ordering issues of the same day.

Type:

str

journal

Newspaper unique identifier or name.

Type:

str

path

Path to directory containing the issue’s OCR data.

Type:

str

date

Publication date of issue.

Type:

datetime.date

issue_data

Issue data according to canonical format.

Type:

dict[str, Any]

pages

List of NewspaperPage instances from this issue.

Type:

list

rights

Access rights applicable to this issue.

Type:

str

property issuedir: IssueDir

IssueDirectory corresponding to this issue.

Type:

IssueDir

to_json() str

Validate self.issue_data & serialize it to string.

Note

Validation adds a substantial overhead to computing time. For serialization of large amounts of issues it is recommendable to bypass schema validation.

class text_importer.importers.classes.NewspaperPage(_id: str, number: int)

Abstract class representing a newspaper page.

Each text importer needs to define a subclass of NewspaperPage which specifies the logic to handle OCR data in a given format (e.g. Alto).

Parameters:
  • _id (str) – Canonical Page ID (e.g. GDL-1900-01-02-a-p0004).

  • number (int) – Page number.

id

Canonical Page ID (e.g. GDL-1900-01-02-a-p0004).

Type:

str

number

Page number.

Type:

int

page_data

Page data according to canonical format.

Type:

dict[str, Any]

issue

Issue this page is from.

Type:

NewspaperIssue | None

abstract add_issue(issue: NewspaperIssue) None

Add to a page object its parent, i.e. the newspaper issue.

This allows each page to preserve contextual information coming from the newspaper issue.

Parameters:

issue (NewspaperIssue) – Newspaper issue containing this page.

abstract parse() None

Process the page XML file and transform into canonical Page format.

Note

This lazy behavior means that the page contents are not processed upon creation of the page object, but only once the parse() method is called.

to_json() str

Validate self.page.data & serialize it to string.

Note

Validation adds a substantial overhead to computing time. For serialization of large amounts of pages it is recommendable to bypass schema validation.

Write an importer CLI script

This script imports passes the new NewspaperIssue class, together with the-newly defined detect functions, to the main() function of the generic importer CLI text_importer.importers.generic_importer.main().

Test

Create a new test file named test_<new_importer>_importer.py and add it to tests/importers/.

This file should contain at the very minimum a test called test_import_issues(), which

  • detects input data from text_importer/data/sample/<new_format>

  • writes any output to text_importer/data/out/.