British Library importers

The British Library shared data in several different formats with us. We will be processing them one at a time based on the numbr of media titles they cover. All importers will share their detect functions but have unique classes, which will act as submodules of the “bl” module.

BL Detect functions

This module contains helper functions to find BL OCR data to import.

text_preparation.importers.bl.detect.BlIssueDir

A light-weight data structure to represent a newspaper issue.

This named tuple contains basic metadata about a newspaper issue. They can then be used to locate the relevant data in the filesystem or to create canonical identifiers for the issue and its pages.

Note

In case of newspaper published multiple times per day, a lowercase letter is used to indicate the edition number: ‘a’ for the first, ‘b’ for the second, etc.

Parameters:

provider (str) – Provider for this alias, here always “BL”
alias (str) – Newspaper alias.
date (datetime.date) – Publication date or issue.
edition (str) – Edition of the newspaper issue (‘a’, ‘b’, ‘c’, etc.).
path (str) – Path to the directory containing the issue’s OCR data.
nlp (str) – BL internal NLP for this issue (eg. ‘0002088’)

>>> from datetime import date
>>> i = BlIssueDir(
    provider='BL',
    alias='LSGA',
    date=datetime.date(1832, 11, 23),
    edition='a',
    path='./BL/LSGA/0002088/1832/11/23',
    nlp='0002088'
)

text_preparation.importers.bl.detect.detect_issues(base_dir: str, ocr_format: str = 'OmniPage-NLP', bl_issues_for_format: str | None = 'BL_{ocr_format}_issues.json', alias_filter: list[str] | None = None, exclude_list: list[str] | None = None) → list[IssueDirectory]

Detect BL issues to import within the filesystem.

Parameters:

base_dir (str) – Path to the base directory of newspaper data, this directory should contain directories corresponding to the BL aliases.
ocr_format (str, optional) – BL OCR format which is to be processed. Defaults to “OmniPage-NLP”.
bl_issues_for_format (str | None, optional) – Name of the file which contains the list of issues for the given OCR format. Defaults to BL_FORMAT_SPECIFIC_FILE.
alias_filter (list[str] | None, optional) – Aliases to consider. Defaults to None.
exclude_list (list[str] | None, optional) – Aliases to exclude. Defaults to None.

Returns:

List of BlIssueDir instances to import.

Return type:

list[BlIssueDir]

text_preparation.importers.bl.detect.dir2issue(path: str) → IssueDirectory | None

Given the directory of an issue, create the BlIssueDir object.

Parameters:: path (str) – The issue directory path
Returns:: The corresponding Issue
Return type:: Optional[BlIssueDir]

text_preparation.importers.bl.detect.select_issues(base_dir: str, config: dict, ocr_format: str = 'OmniPage-NLP', bl_issues_for_format: str | None = 'BL_{ocr_format}_issues.json') → list[IssueDirectory] | None

SDetect selectively newspaper issues to import.

The behavior is very similar to detect_issues() with the only difference that config specifies some rules to filter the data to import. See this section for further details on how to configure filtering.

Parameters:

base_dir (str) – Path to the base directory of newspaper data, this directory should contain directories corresponding to the BL aliases.
ocr_format (str, optional) – BL OCR format which is to be processed. Defaults to “OmniPage-NLP”.
bl_issues_for_format (str | None, optional) – Name of the file which contains the list of issues for the given OCR format. Defaults to BL_FORMAT_SPECIFIC_FILE.

Returns:

List of BlIssueDir instances to import.

Return type:

list[BlIssueDir] | None

1. BL OmniPage Custom classes

The first BL OCR format, which we call “OmniPage” is based on the METS/ALTO standard. This importer extends the generic Mets/Alto importer. Almost 300 media titles from the BL are in this format.

This module contains the definition of BL importer classes for the OmniPage format.

The classes define newspaper Issues and Pages objects which convert OCR data in the BL version of the Mets/Alto format to a unified canoncial format. Theses classes are subclasses of generic Mets/Alto importer classes.

class text_preparation.importers.bl.omni.classes.BlOmniNewspaperIssue(issue_dir: IssueDirectory)

Newspaper Issue in BL (Mets/Alto) OmniPage-NLP format.

All functions defined in this child class are specific to parsing BL Mets/Alto format.

Parameters:: issue_dir (IssueDir) – Identifying information about the issue.

id

Canonical Issue ID (e.g. GDL-1900-01-02-a).

Type:: str

edition

Lower case letter ordering issues of the same day.

Type:: str

alias

Newspaper unique alias (identifier or name).

Type:: str

path

Path to directory containing the issue’s OCR data.

Type:: str

date

Publication date of issue.

Type:: datetime.date

issue_data

Issue data according to canonical format.

Type:: dict[str, Any]

pages

list of CanonicalPage instances from this issue.

Type:: list

image_properties

metadata allowing to convert region OCR/OLR coordinates to iiif format compliant ones.

Type:: dict[str, Any]

ark_id

Issue ARK identifier, for the issue’s pages’ iiif links.

Type:: int

find_unlinked_image_cis(structlink: Tag, ci_counter: int, page_xmls: dict[str, Any]) → list[dict[str, Any]]

Find illustrations in ALTO pages that are not linked in the METS structlink.

Iterates through all pages, checking for image/illustration blocks (TYPE = “illustration” or “image”). If such blocks are not referenced in the structlink, creates new content items (CIs) for them.

Parameters:

structlink (Tag) – The METS structLink element containing linked regions.
ci_counter (int) – Counter used to generate unique CI IDs.
page_xmls (dict[str, Any]) – Mapping of page numbers to parsed ALTO XML documents.

Returns:

List of newly created image content items.

Return type:

list[dict[str, Any]]

class text_preparation.importers.bl.omni.classes.BlOmniNewspaperPage(_id: str, number: int, filename: str, basedir: str, page_size: tuple[int, int], encoding: str = 'utf-8')

Newspaper page in BL (Mets/Alto) OmniPage-NLP format.

Parameters:

_id (str) – Canonical page ID.
number (int) – Page number.
filename (str) – Name of the Alto XML page file.
basedir (str) – Base directory where Alto files are located.
encoding (str, optional) – Encoding of XML file. Defaults to ‘utf-8’.

id

Canonical Page ID (e.g. GDL-1900-01-02-a-p0004).

Type:: str

number

Page number.

Type:: int

page_data

Page data according to canonical format.

Type:: dict[str, Any]

issue

Issue this page is from.

Type:: CanonicalIssue

filename

Name of the Alto XML page file.

Type:: str

basedir

Base directory where Alto files are located.

Type:: str

encoding

Encoding of XML file. Defaults to ‘utf-8’.

Type:: str, optional

add_issue(issue: MetsAltoCanonicalIssue) → None

Add the given BlNewspaperIssue as an attribute for this class.

Parameters:: issue (MetsAltoCanonicalIssue) – Issue this page is from