Generic Mets/Alto importer

A back-bone for any Mets/Alto importer.

Abstract classes

This module contains the definition of generic Mets/Alto importer classes.

The classes define newspaper Issues and Pages objects which convert OCR data in Mets/Alto format to a unified canoncial format. The classes in this module are meant to be subclassed to handle independently the parsing for each version of the Mets/Atlo format and their specificities.

class text_preparation.importers.mets_alto.classes.MetsAltoNewspaperIssue(issue_dir: IssueDir)

Newspaper issue in generic Mets/Alto format.

Note

New Mets/Alto importers should sub-class this class and implement its abstract methods (i.e. _find_pages(), _parse_mets()).

Parameters:: issue_dir (IssueDir) – Identifying information about the issue.

id

Canonical Issue ID (e.g. GDL-1900-01-02-a).

Type:: str

edition

Lower case letter ordering issues of the same day.

Type:: str

journal

Newspaper unique identifier or name.

Type:: str

path

Path to directory containing the issue’s OCR data.

Type:: str

date

Publication date of issue.

Type:: datetime.date

issue_data

Issue data according to canonical format.

Type:: dict[str, Any]

pages

list of NewspaperPage instances from this issue.

Type:: list

rights

Access rights applicable to this issue.

Type:: str

image_properties

metadata allowing to convert region OCR/OLR coordinates to iiif format compliant ones.

Type:: dict[str, Any]

ark_id

Issue ARK identifier, for the issue’s pages’ iiif links.

Type:: int

property xml: BeautifulSoup

Read Mets XML file of the issue and create a BeautifulSoup object.

During the processing, some IO errors can randomly happen when listing the contents of the directory, or opening files, preventing the correct parsing of the issue. The error is raised after the third try. If the directory does not contain any Mets file, only try once.

Note

By default the issue Mets file is the only file containing mets.xml in its file name and located in the directory self.path. Individual importers can overwrite this behavior if necessary.

Returns:: BeautifulSoup object with Mets XML of the issue.
Return type:: BeautifulSoup

class text_preparation.importers.mets_alto.classes.MetsAltoNewspaperPage(_id: str, number: int, filename: str, basedir: str, encoding: str = 'utf-8')

Newspaper page in generic Alto format.

Note

New Mets/Alto importers should sub-classes this class and implement its abstract methods (i.e. add_issue()).

Parameters:

_id (str) – Canonical page ID.
number (int) – Page number.
filename (str) – Name of the Alto XML page file.
basedir (str) – Base directory where Alto files are located.
encoding (str, optional) – Encoding of XML file. Defaults to ‘utf-8’.

id

Canonical Page ID (e.g. GDL-1900-01-02-a-p0004).

Type:: str

number

Page number.

Type:: int

page_data

Page data according to canonical format.

Type:: dict[str, Any]

issue

Issue this page is from.

Type:: NewspaperIssue

filename

Name of the Alto XML page file.

Type:: str

basedir

Base directory where Alto files are located.

Type:: str

encoding

Encoding of XML file.

Type:: str, optional

abstract add_issue(issue: NewspaperIssue) → None

Add to a page object its parent, i.e. the newspaper issue.

This allows each page to preserve contextual information coming from the newspaper issue.

Parameters:: issue (NewspaperIssue) – Newspaper issue containing this page.

parse() → None: Process the page XML file and transform into canonical Page format.

Note

This lazy behavior means that the page contents are not processed upon creation of the page object, but only once the parse() method is called.

property xml: BeautifulSoup

Read Alto XML file of the page and create a BeautifulSoup object.

Returns:: BeautifulSoup object with Alto XML of the page.
Return type:: BeautifulSoup

Mets parsing

Utility functions to parse Mets XML files.

text_preparation.importers.mets_alto.mets.get_dmd_sec(mets_doc: BeautifulSoup, _filter: str) → Tag

Extract the contents of a specific <dmdsec> from the Mets document.

The <dmdsec> section contains descriptive metadata. It’s composed of several different subsections each identified with string IDs.

Parameters:

mets_doc (BeautifulSoup) – BeautifulSoup object of Mets XML document.
_filter (str) – ID of the subsection of interest to filter the search.

Returns:

Contents of the desired <dmdsec> of the Mets XML document.

Return type:

Tag

text_preparation.importers.mets_alto.mets.parse_mets_amdsec(mets_doc: BeautifulSoup, x_res: str, y_res: str, x_res_default: int = 300, y_res_default: int = 300) → dict

Parse the <amdsec> section of Mets XML to extract image properties.

The <amdsec> section contains administrative metadata about the OCR, in particular information about the image resolution allowing the coordinates conversion to iiif format.

Parameters:

mets_doc (BeautifulSoup) – BeautifulSoup object of Mets XML document.
x_res (str) – Name of field representing the X resolution.
y_res (str) – Name of field representing the Y resolution.
x_res_default (int, optional) – Default X_res. Defaults to 300.
y_res_default (int, optional) – Default Y res. Defaults to 300.

Returns:

Parsed image properties with default values if the field was not: found in the document.

Return type:

dict

text_preparation.importers.mets_alto.mets.parse_mets_filegroup(mets_doc: BeautifulSoup) → dict[int, str]

Parse <fileGrp> section to extract the page’s OCR image ids.

The <fileGrp> section contains the names and ids of the images and text files linked to the Mets XML file. Each page of the issue is associated to a scan image file and ids.

Parameters:: mets_doc (BeautifulSoup) – BeautifulSoup object of Mets XML document.
Returns:: Mapping from page number to page image id.
Return type:: dict[int, str]

Alto parsing

Utility functions to parse Alto XML files.

text_preparation.importers.mets_alto.alto.distill_coordinates(element: Tag) → list[int]

Extract image coordinates from any XML tag.

Note

This function assumes the following attributes to be present in the input XML element: HPOS, VPOS. WIDTH, HEIGHT.

Parameters:

element (Tag) – Input XML tag containing coordinates to distill.

Returns:

An ordered list of coordinates (x, y, width,: height).

Return type:

list[int]

text_preparation.importers.mets_alto.alto.parse_printspace(element: Tag, mappings: dict[str, str]) → tuple[list[dict], list[str]]

Parse the <PrintSpace> element of an ALTO XML document.

This element contains all the OCR information about the content items of a page, up to the lowest level of the hierarchy: the regions, paragraphs, lines and tokens, each with their corresponding coordinates.

Parameters:

element (Tag) – Input XML element (<PrintSpace>).
mappings (dict[str, str]) – Mapping from OCR component ids to their corresponding canonical Content Item ID.

Returns:

List of page regions in the canonical: format and notes about potential parsing problems.

Return type:

tuple[list[dict], list[str]]

text_preparation.importers.mets_alto.alto.parse_style(style_div: Tag) → dict[str, float | str]

Parse the font-style information in the ALTO files (for BNL and BNF).

Parameters:: style_div (Tag) – Element of XML file containing font-style information.
Returns:: Parsed style for Issue canonical format.
Return type:: dict[str, float | str]

text_preparation.importers.mets_alto.alto.parse_textline(element: Tag) → tuple[dict, list[str]]

Parse the <TextLine> element of an ALTO XML document.

Parameters:

element (Tag) – Input XML element (<TextLine>).

Returns:

Parsed lines or text in the canonical format: and notes about potential missing token coordinates.

Return type:

tuple[dict, list[str]]