RERO Mets/Alto importer

This importer extends the generic Mets/Alto importer, and it was developed to handle OCR newspaper data provided by RERO in Mets/Alto format (the rest of the data is in Olive format).

RERO Custom classes

This module contains the definition of the RERO importer classes.

The classes define newspaper Issues and Pages objects which convert OCR data in the RERO version of the Mets/Alto format to a unified canoncial format. Theses classes are subclasses of generic Mets/Alto importer classes.

class text_preparation.importers.rero.classes.ReroNewspaperIssue(issue_dir: IssueDir)

Newspaper Issue in RERO (Mets/Alto) format.

All functions defined in this child class are specific to parsing RERO Mets/Alto format.

Parameters:: issue_dir (IssueDir) – Identifying information about the issue.

id

Canonical Issue ID (e.g. GDL-1900-01-02-a).

Type:: str

edition

Lower case letter ordering issues of the same day.

Type:: str

journal

Newspaper unique identifier or name.

Type:: str

path

Path to directory containing the issue’s OCR data.

Type:: str

date

Publication date of issue.

Type:: datetime.date

issue_data

Issue data according to canonical format.

Type:: dict[str, Any]

pages

list of NewspaperPage instances from this issue.

Type:: list

rights

Access rights applicable to this issue.

Type:: str

image_properties

metadata allowing to convert region OCR/OLR coordinates to iiif format compliant ones.

Type:: dict[str, Any]

ark_id

Issue ARK identifier, for the issue’s pages’ iiif links.

Type:: int

class text_preparation.importers.rero.classes.ReroNewspaperPage(_id: str, number: int, filename: str, basedir: str)

Newspaper page in RERO (Mets/Alto) format.

Parameters:

_id (str) – Canonical page ID.
number (int) – Page number.
filename (str) – Name of the Alto XML page file.
basedir (str) – Base directory where Alto files are located.

id

Canonical Page ID (e.g. GDL-1900-01-02-a-p0004).

Type:: str

number

Page number.

Type:: int

page_data

Page data according to canonical format.

Type:: dict[str, Any]

issue

Issue this page is from.

Type:: NewspaperIssue

filename

Name of the Alto XML page file.

Type:: str

basedir

Base directory where Alto files are located.

Type:: str

encoding

Encoding of XML file. Defaults to ‘utf-8’.

Type:: str, optional

page_width

The page width used for the coordinate system.

Type:: float

add_issue(issue: MetsAltoNewspaperIssue) → None

Add to a page object its parent, i.e. the newspaper issue.

This allows each page to preserve contextual information coming from the newspaper issue.

Parameters:: issue (NewspaperIssue) – Newspaper issue containing this page.

text_preparation.importers.rero.classes.convert_coordinates(coords: list[float], resolution: dict[str, float], page_width: float) → list[int]

Convert the coordinates using true and coordinate system resolutions.

The coordinate system resolution is not necessarily the same as the true resolution of the image. A conversion, or rescaling can thus be necessary. Essentially computes fact = coordinate_width / true_width, and converts using x/fact.

Parameters:

coords (list[float]) – List of coordinates to convert
resolution (dict[str, float]) – True resolution of the images (keys x_resolution and y_resolution of the dict).
page_width (float) – The page width used for the coordinate system.

Returns:

The coordinates rescaled to match the true image resolution.

Return type:

list[int]

RERO Detect functions

This module contains helper functions to find RERO OCR data to be imported.

text_preparation.importers.rero.detect.Rero2IssueDir

A light-weight data structure to represent a newspaper issue.

This named tuple contains basic metadata about a newspaper issue. They can then be used to locate the relevant data in the filesystem or to create canonical identifiers for the issue and its pages.

Note

In case of newspaper published multiple times per day, a lowercase letter is used to indicate the edition number: ‘a’ for the first, ‘b’ for the second, etc.

Parameters:

journal (str) – Newspaper ID.
date (datetime.date) – Publication date or issue.
edition (str) – Edition of the newspaper issue (‘a’, ‘b’, ‘c’, etc.).
path (str) – Path to the directory containing the issue’s OCR data.
rights (str) – Access rights on the data (open, closed, etc.).

>>> from datetime import date
>>> i = Rero2IssueDir('BLB', date(1845,12,28), 'a', './BLB/data/BLB/18451228_01', 'open')

text_preparation.importers.rero.detect.detect_issues(base_dir: str, access_rights: str, data_dir: str = 'data') → list[IssueDirectory]

Detect newspaper issues to import within the filesystem.

This function expects the directory structure that RERO used to organize the dump of Mets/Alto OCR data.

TODO: add info on the file structure.

Parameters:

base_dir (str) – Path to the base directory of newspaper data.
access_rights (str) – Path to access_rights.json file.
data_dir (str, optional) – Directory where data is stored (usually data/). Defaults to ‘data’.

Returns:

list of Rero2IssueDir instances, to be imported.

Return type:

list[Rero2IssueDir]

text_preparation.importers.rero.detect.dir2issue(path: str, access_rights: dict) → IssueDirectory

Create a Rero2IssueDir from a directory (RERO format).

Parameters:

path (str) – Path of issue.
access_rights (dict) – dictionary for access rights.

Returns:

New Rero2IssueDir object matching the path and rights.

Return type:

Rero2IssueDir

text_preparation.importers.rero.detect.select_issues(base_dir: str, config: dict, access_rights: str) → list[IssueDirectory] | None

Detect selectively newspaper issues to import.

The behavior is very similar to detect_issues() with the only difference that config specifies some rules to filter the data to import. See this section for further details on how to configure filtering.

Parameters:

base_dir (str) – Path to the base directory of newspaper data.
config (dict) – Config dictionary for filtering.
access_rights (str) – Path to access_rights.json file.

Returns:

Rero2IssueDir instances to be imported.

Return type:

list[Rero2IssueDir] | None