Writing a new importer
TLDR;
Writing a new importer is easy and entails implementing two pieces of code:
implementing functions to detect the data to import;
implementing from scratch classes that handle the conversion into JSON of your OCR format or adapt one of the existing importers.
Once these two pieces of code are in place, they can be plugged into the functions defined in text_importer.importers.generic_importer
so as to create a dedicated CLI script for your specific format.
For example, this is the content of oliveimporter.py
:
from text_importer.importers import generic_importer
from text_importer.importers.olive.classes import OliveNewspaperIssue
from text_importer.importers.olive.detect import (olive_detect_issues,
olive_select_issues)
if __name__ == '__main__':
generic_importer.main(
OliveNewspaperIssue,
olive_detect_issues,
olive_select_issues
)
How should the code of a new text importer be structured? We recommend to comply to the following structure:
text_importer.importers.<new_importer>.detect
will contain functions to find the data to be imported;text_importer.importers.<new_importer>.helpers
(optional) will contain ancillary functions;text_importer.importers.<new_importer>.parsers
(optional) will contain functions/classes to parse the data.text_importer/scripts/<new_importer>.py
: will contain a CLI script to run the importer.
Detect data to import
the importer needs to know which data should be imported
information about the newspaper contents is often encoded as part of folder names etc., thus it needs to be extracted and made explicit, by means of Canonical identifiers
add some sample data to
text_importer/data/sample/<new_format>
For example: olive_detect_issues()
Implement abstract classes
These two classes are passed to the the importer’s generic command-line interface,
see text_importer.importers.generic_importer.main()
- class text_importer.importers.classes.NewspaperIssue(issue_dir: IssueDir)
Abstract class representing a newspaper issue.
Each text importer needs to define a subclass of NewspaperIssue which specifies the logic to handle OCR data in a given format (e.g. Olive).
- Parameters:
issue_dir (IssueDir) – Identifying information about the issue.
- id
Canonical Issue ID (e.g.
GDL-1900-01-02-a
).- Type:
str
- edition
Lower case letter ordering issues of the same day.
- Type:
str
- journal
Newspaper unique identifier or name.
- Type:
str
- path
Path to directory containing the issue’s OCR data.
- Type:
str
- date
Publication date of issue.
- Type:
datetime.date
- issue_data
Issue data according to canonical format.
- Type:
dict[str, Any]
- pages
List of
NewspaperPage
instances from this issue.- Type:
list
- rights
Access rights applicable to this issue.
- Type:
str
- property issuedir: IssueDir
IssueDirectory corresponding to this issue.
- Type:
IssueDir
- to_json() str
Validate
self.issue_data
& serialize it to string.Note
Validation adds a substantial overhead to computing time. For serialization of large amounts of issues it is recommendable to bypass schema validation.
- class text_importer.importers.classes.NewspaperPage(_id: str, number: int)
Abstract class representing a newspaper page.
Each text importer needs to define a subclass of
NewspaperPage
which specifies the logic to handle OCR data in a given format (e.g. Alto).- Parameters:
_id (str) – Canonical Page ID (e.g.
GDL-1900-01-02-a-p0004
).number (int) – Page number.
- id
Canonical Page ID (e.g.
GDL-1900-01-02-a-p0004
).- Type:
str
- number
Page number.
- Type:
int
- page_data
Page data according to canonical format.
- Type:
dict[str, Any]
- issue
Issue this page is from.
- Type:
NewspaperIssue | None
- abstract add_issue(issue: NewspaperIssue) None
Add to a page object its parent, i.e. the newspaper issue.
This allows each page to preserve contextual information coming from the newspaper issue.
- Parameters:
issue (NewspaperIssue) – Newspaper issue containing this page.
- abstract parse() None
Process the page XML file and transform into canonical Page format.
Note
This lazy behavior means that the page contents are not processed upon creation of the page object, but only once the
parse()
method is called.
- to_json() str
Validate
self.page.data
& serialize it to string.Note
Validation adds a substantial overhead to computing time. For serialization of large amounts of pages it is recommendable to bypass schema validation.
Write an importer CLI script
This script imports passes the new NewspaperIssue
class, together with the-newly
defined detect functions, to the main()
function of the generic importer CLI
text_importer.importers.generic_importer.main()
.
Test
Create a new test file named test_<new_importer>_importer.py
and add it to tests/importers/
.
This file should contain at the very minimum a test called test_import_issues()
, which
detects input data from
text_importer/data/sample/<new_format>
writes any output to
text_importer/data/out/
.