British Library Mets/Alto importer

This importer extends the generic Mets/Alto importer, and it was developed to handle OCR newspaper data provided by the British Library.

BL Custom classes

This module contains the definition of BL importer classes.

The classes define newspaper Issues and Pages objects which convert OCR data in the BL version of the Mets/Alto format to a unified canoncial format. Theses classes are subclasses of generic Mets/Alto importer classes.

class text_preparation.importers.bl.classes.BlNewspaperIssue(issue_dir: IssueDir)

Newspaper Issue in BL (Mets/Alto) format.

All functions defined in this child class are specific to parsing BL Mets/Alto format.

Parameters:

issue_dir (IssueDir) – Identifying information about the issue.

id

Canonical Issue ID (e.g. GDL-1900-01-02-a).

Type:

str

edition

Lower case letter ordering issues of the same day.

Type:

str

journal

Newspaper unique identifier or name.

Type:

str

path

Path to directory containing the issue’s OCR data.

Type:

str

date

Publication date of issue.

Type:

datetime.date

issue_data

Issue data according to canonical format.

Type:

dict[str, Any]

pages

list of NewspaperPage instances from this issue.

Type:

list

rights

Access rights applicable to this issue.

Type:

str

image_properties

metadata allowing to convert region OCR/OLR coordinates to iiif format compliant ones.

Type:

dict[str, Any]

ark_id

Issue ARK identifier, for the issue’s pages’ iiif links.

Type:

int

class text_preparation.importers.bl.classes.BlNewspaperPage(_id: str, number: int, filename: str, basedir: str, encoding: str = 'utf-8')

Newspaper page in BL (Mets/Alto) format.

Parameters:
  • _id (str) – Canonical page ID.

  • number (int) – Page number.

  • filename (str) – Name of the Alto XML page file.

  • basedir (str) – Base directory where Alto files are located.

  • encoding (str, optional) – Encoding of XML file. Defaults to ‘utf-8’.

id

Canonical Page ID (e.g. GDL-1900-01-02-a-p0004).

Type:

str

number

Page number.

Type:

int

page_data

Page data according to canonical format.

Type:

dict[str, Any]

issue

Issue this page is from.

Type:

NewspaperIssue

filename

Name of the Alto XML page file.

Type:

str

basedir

Base directory where Alto files are located.

Type:

str

encoding

Encoding of XML file. Defaults to ‘utf-8’.

Type:

str, optional

add_issue(issue: MetsAltoNewspaperIssue) None

Add the given BlNewspaperIssue as an attribute for this class.

Parameters:

issue (MetsAltoNewspaperIssue) – Issue this page is from

BL Detect functions

This module contains helper functions to find BL OCR data to import.

text_preparation.importers.bl.detect.BlIssueDir

A light-weight data structure to represent a newspaper issue.

This named tuple contains basic metadata about a newspaper issue. They can then be used to locate the relevant data in the filesystem or to create canonical identifiers for the issue and its pages.

Note

In case of newspaper published multiple times per day, a lowercase letter is used to indicate the edition number: ‘a’ for the first, ‘b’ for the second, etc.

Parameters:
  • journal (str) – Newspaper ID.

  • date (datetime.date) – Publication date or issue.

  • edition (str) – Edition of the newspaper issue (‘a’, ‘b’, ‘c’, etc.).

  • path (str) – Path to the directory containing the issue’s OCR data.

  • rights (str) – Access rights on the data (open, closed, etc.).

>>> from datetime import date
>>> i = BlIssueDir(
    journal='0002088',
    date=datetime.date(1832, 11, 23),
    edition='a',
    path='./BL/BLIP_20190920_01.zip',
    rights='open_public'
)
text_preparation.importers.bl.detect.detect_issues(base_dir: str, access_rights: str, tmp_dir: str) list[IssueDirectory]

Detect newspaper issues to import within the filesystem.

This function expects the directory structure that the BL used to organize the dump of Mets/Alto OCR data.

Parameters:
  • base_dir (str) – Path to the base directory of newspaper data, this directory should contain zip files.

  • access_rights (str) – Not used for this importer, but argument is kept for uniformity.

  • tmp_dir (str) – Temporary directory to unzip archives.

Returns:

List of BlIssueDir instances to import.

Return type:

list[BlIssueDir]

text_preparation.importers.bl.detect.dir2issue(path: str) IssueDirectory | None

Given the BLIP directory of an issue, create the BlIssueDir object.

TODO: update handling of rights and edition with full data.

Parameters:

path (str) – The BLIP directory path

Returns:

The corresponding Issue

Return type:

Optional[BlIssueDir]

text_preparation.importers.bl.detect.select_issues(base_dir: str, config: dict, access_rights: str, tmp_dir: str) list[IssueDirectory] | None

SDetect selectively newspaper issues to import.

The behavior is very similar to detect_issues() with the only difference that config specifies some rules to filter the data to import. See this section for further details on how to configure filtering.

Parameters:
  • base_dir (str) – Path to the base directory of newspaper data.

  • config (dict) – Config dictionary for filtering.

  • access_rights (str) – Path to access_rights.json file.

  • tmp_dir (str) – Temporary directory to unzip archives.

Returns:

List of BlIssueDir instances to import.

Return type:

list[BlIssueDir] | None