BNL Mets/Alto importer

This importer extends the generic Mets/Alto importer, and it was developed to handle OCR newspaper data provided by the BNL.

BNL Custom classes

This module contains the definition of the Luxembourg importer classes.

The classes define newspaper Issues and Pages objects which convert OCR data in the BNL (Blibliotheque Nationale du Luxembourg) version of the Mets/Alto format to a unified canoncial format. Theses classes are subclasses of generic Mets/Alto importer classes.

class text_preparation.importers.lux.classes.LuxNewspaperIssue(issue_dir: IssueDir)

Class representing an issue in BNL data.

All functions defined in this child class are specific to parsing BNL (Luxembourg National Library) Mets/Alto format.

Parameters:: issue_dir (IssueDir) – Identifying information about the issue.

id

Canonical Issue ID (e.g. GDL-1900-01-02-a).

Type:: str

edition

Lower case letter ordering issues of the same day.

Type:: str

alias

Newspaper unique alias (identifier or name).

Type:: str

path

Path to directory containing the issue’s OCR data.

Type:: str

date

Publication date of issue.

Type:: datetime.date

issue_data

Issue data according to canonical Issue format.

Type:: dict

pages

List of CanonicalPage instances from this issue.

Type:: list

image_properties

metadata allowing to convert region OCR/OLR coordinates to iiif format compliant ones.

Type:: dict

ark_id

Issue ARK identifier, for the issue’s pages’ iiif links.

Type:: int

class text_preparation.importers.lux.classes.LuxNewspaperPage(_id: str, number: int, filename: str, basedir: str, encoding: str = 'utf-8')

Newspaper page in BNL (Mets/Alto) format.

Parameters:

_id (str) – Canonical page ID.
number (int) – Page number.
filename (str) – Name of the Alto XML page file.
basedir (str) – Base directory where Alto files are located.
encoding (str, optional) – Encoding of XML file. Defaults to ‘utf-8’.

id

Canonical Page ID (e.g. GDL-1900-01-02-a-p0004).

Type:: str

number

Page number.

Type:: int

page_data

Page data according to canonical format.

Type:: dict[str, Any]

issue

Issue this page is from.

Type:: CanonicalIssue

filename

Name of the Alto XML page file.

Type:: str

basedir

Base directory where Alto files are located.

Type:: str

encoding

Encoding of XML file.

Type:: str, optional

add_issue(issue: MetsAltoCanonicalIssue) → None

Add to a page object its parent, i.e. the canonical issue.

This allows each page to preserve contextual information coming from the canonical issue.

Parameters:: issue (CanonicalIssue) – Canonical issue containing this page.

BNL Detect functions

This module contains helper functions to find BNL OCR data to be imported.

text_preparation.importers.lux.detect.LuxIssueDir

A light-weight data structure to represent a newspaper issue.

This named tuple contains basic metadata about a newspaper issue. They can then be used to locate the relevant data in the filesystem or to create canonical identifiers for the issue and its pages.

Note

In case of newspaper published multiple times per day, a lowercase letter is used to indicate the edition number: ‘a’ for the first, ‘b’ for the second, etc.

Parameters:

provider (str) – Provider for this alias, here always “BNL”
alias (str) – Newspaper alias.
date (datetime.date) – Publication date or issue.
edition (str) – Edition of the newspaper issue (‘a’, ‘b’, ‘c’, etc.).
path (str) – Path to the directory containing the issue’s OCR data.

>>> from datetime import date
>>> i = LuxIssueDir('BNL','armeteufel', date(1904,1,17), 'a', './protected_027/1497608_newspaper_armeteufel_1904-01-17/')

text_preparation.importers.lux.detect.detect_issues(base_dir: str) → list[IssueDirectory]

Detect newspaper issues to import within the filesystem.

This function expects the directory structure that BNL used to organize the dump of Mets/Alto OCR data.

Parameters:: base_dir (str) – Path to the base directory of newspaper data.
Returns:: List of LuxIssueDir instances, to be imported.
Return type:: list[LuxIssueDir]

text_preparation.importers.lux.detect.dir2issue(path: str) → IssueDirectory

Create a LuxIssueDir from a directory (BNL format).

Called internally by detect_issues().

Parameters:: path (str) – Path of issue.
Returns:: New LuxIssueDir object matching the path and rights.
Return type:: Rero2IssueDir

text_preparation.importers.lux.detect.select_issues(base_dir: str, config: dict) → list[IssueDirectory] | None

Detect selectively newspaper issues to import.

The behavior is very similar to detect_issues() with the only difference that config specifies some rules to filter the data to import. See this section for further details on how to configure filtering.

Parameters:

base_dir (str) – Path to the base directory of newspaper data.
config (dict) – Config dictionary for filtering.

Returns:

List of LuxIssueDir instances to import.

Return type:

list[LuxIssueDir] | None

BNL Helper methods

This module contains helper functions to find BNL OCR data to import.

text_preparation.importers.lux.helpers.convert_coordinates(hpos: int, vpos: int, width: int, height: int, x_res: float, y_res: float) → list[int]

Convert the coordinates to iiif-compliant ones using the resolution.

x = (coordinate[‘xResolution’]/254.0) * coordinate[‘hpos’]
y = (coordinate[‘yResolution’]/254.0) * coordinate[‘vpos’]
w = (coordinate[‘xResolution’]/254.0) * coordinate[‘width’]
h = (coordinate[‘yResolution’]/254.0) * coordinate[‘height’]

Parameters:

hpos (int) – Horizontal position coordinate of element.
vpos (int) – Vertical position coordinate of element..
width (int) – Width of element.
height (int) – Height of element.
x_res (float) – X-axis resolution of image.
y_res (float) – Y-axis resolution of image.

Returns:

Converted coordinates.

Return type:

list[int]

text_preparation.importers.lux.helpers.div_has_body(div: Tag, body_type='body') → bool

Checks if the given div has a body in it’s direct children.

Parameters:

div (Tag) – div element to check.
body_type (str, optional) – Content type of a body. Defaults to ‘body’.

Returns:

True if one or more of div’s direct children have a body.

Return type:

bool

text_preparation.importers.lux.helpers.encode_ark(ark: str) → str

Replaces (encodes) backslashes in the Ark identifier.

Parameters:: ark (str) – original ark identifier.
Returns:: New ark identifier with encoded backslashes.
Return type:: str

text_preparation.importers.lux.helpers.find_section_articles(section_div: Tag, content_items: list[dict[str, Any]]) → list[str]

Parse the articles inside the section div and get their content item ID.

Recover the content item canonical ID corresponding to each article using the legacy ID (from the OCR) of the articles found in div’s children.

Parameters:

section_div (Tag) – div with the articles for which to get CI IDs.
content_items (list[dict[str, Any]]) – Content items already identified.

Returns:

List of content item IDs for div’s children articles.

Return type:

list[str]

text_preparation.importers.lux.helpers.remove_section_cis(content_items: list[dict[str, Any]], sections: list[dict[str, Any]]) → tuple[list[dict[str, Any]], list[dict[str, Any]]]

Remove undesired content items based on the formed sections.

Some content items are contained within a section and should not be in the content items. Given the recovered section content items, they can be removed.

Parameters:

content_items (list[dict[str, Any]]) – Content items, to be filtered.
sections (list[dict[str, Any]]) – Formed section content items.

Returns:

Filtered: content items and ones that were removed.

Return type:

tuple[list[dict[str, Any]], list[dict[str, Any]]]

text_preparation.importers.lux.helpers.section_is_article(section_div: Tag) → bool

Check if the given section div is an article.

It’s the case when none of div’s children are of non-article types (except for “BODY” and “BODY_CONTENT”), which are ads or obituaries.

Parameters:: section_div (Tag) – section div to check.
Returns:: True if given div is an article section.
Return type:: bool