Generic Mets/Alto importer

A back-bone for any Mets/Alto importer.

Abstract classes

This module contains the definition of generic Mets/Alto importer classes.

The classes define newspaper Issues and Pages objects which convert OCR data in Mets/Alto format to a unified canoncial format. The classes in this module are meant to be subclassed to handle independently the parsing for each version of the Mets/Atlo format and their specificities.

class text_preparation.importers.mets_alto.classes.MetsAltoNewspaperIssue(issue_dir: IssueDir)

Newspaper issue in generic Mets/Alto format.

Note

New Mets/Alto importers should sub-class this class and implement its abstract methods (i.e. _find_pages(), _parse_mets()).

Parameters:

issue_dir (IssueDir) – Identifying information about the issue.

id

Canonical Issue ID (e.g. GDL-1900-01-02-a).

Type:

str

edition

Lower case letter ordering issues of the same day.

Type:

str

journal

Newspaper unique identifier or name.

Type:

str

path

Path to directory containing the issue’s OCR data.

Type:

str

date

Publication date of issue.

Type:

datetime.date

issue_data

Issue data according to canonical format.

Type:

dict[str, Any]

pages

list of NewspaperPage instances from this issue.

Type:

list

rights

Access rights applicable to this issue.

Type:

str

image_properties

metadata allowing to convert region OCR/OLR coordinates to iiif format compliant ones.

Type:

dict[str, Any]

ark_id

Issue ARK identifier, for the issue’s pages’ iiif links.

Type:

int

property xml: BeautifulSoup

Read Mets XML file of the issue and create a BeautifulSoup object.

During the processing, some IO errors can randomly happen when listing the contents of the directory, or opening files, preventing the correct parsing of the issue. The error is raised after the third try. If the directory does not contain any Mets file, only try once.

Note

By default the issue Mets file is the only file containing mets.xml in its file name and located in the directory self.path. Individual importers can overwrite this behavior if necessary.

Returns:

BeautifulSoup object with Mets XML of the issue.

Return type:

BeautifulSoup

class text_preparation.importers.mets_alto.classes.MetsAltoNewspaperPage(_id: str, number: int, filename: str, basedir: str, encoding: str = 'utf-8')

Newspaper page in generic Alto format.

Note

New Mets/Alto importers should sub-classes this class and implement its abstract methods (i.e. add_issue()).

Parameters:
  • _id (str) – Canonical page ID.

  • number (int) – Page number.

  • filename (str) – Name of the Alto XML page file.

  • basedir (str) – Base directory where Alto files are located.

  • encoding (str, optional) – Encoding of XML file. Defaults to ‘utf-8’.

id

Canonical Page ID (e.g. GDL-1900-01-02-a-p0004).

Type:

str

number

Page number.

Type:

int

page_data

Page data according to canonical format.

Type:

dict[str, Any]

issue

Issue this page is from.

Type:

NewspaperIssue

filename

Name of the Alto XML page file.

Type:

str

basedir

Base directory where Alto files are located.

Type:

str

encoding

Encoding of XML file.

Type:

str, optional

abstract add_issue(issue: NewspaperIssue) None

Add to a page object its parent, i.e. the newspaper issue.

This allows each page to preserve contextual information coming from the newspaper issue.

Parameters:

issue (NewspaperIssue) – Newspaper issue containing this page.

parse() None

Process the page XML file and transform into canonical Page format.

Note

This lazy behavior means that the page contents are not processed upon creation of the page object, but only once the parse() method is called.

property xml: BeautifulSoup

Read Alto XML file of the page and create a BeautifulSoup object.

Returns:

BeautifulSoup object with Alto XML of the page.

Return type:

BeautifulSoup

Mets parsing

Utility functions to parse Mets XML files.

text_preparation.importers.mets_alto.mets.get_dmd_sec(mets_doc: BeautifulSoup, _filter: str) Tag

Extract the contents of a specific <dmdsec> from the Mets document.

The <dmdsec> section contains descriptive metadata. It’s composed of several different subsections each identified with string IDs.

Parameters:
  • mets_doc (BeautifulSoup) – BeautifulSoup object of Mets XML document.

  • _filter (str) – ID of the subsection of interest to filter the search.

Returns:

Contents of the desired <dmdsec> of the Mets XML document.

Return type:

Tag

text_preparation.importers.mets_alto.mets.parse_mets_amdsec(mets_doc: BeautifulSoup, x_res: str, y_res: str, x_res_default: int = 300, y_res_default: int = 300) dict

Parse the <amdsec> section of Mets XML to extract image properties.

The <amdsec> section contains administrative metadata about the OCR, in particular information about the image resolution allowing the coordinates conversion to iiif format.

Parameters:
  • mets_doc (BeautifulSoup) – BeautifulSoup object of Mets XML document.

  • x_res (str) – Name of field representing the X resolution.

  • y_res (str) – Name of field representing the Y resolution.

  • x_res_default (int, optional) – Default X_res. Defaults to 300.

  • y_res_default (int, optional) – Default Y res. Defaults to 300.

Returns:

Parsed image properties with default values if the field was not

found in the document.

Return type:

dict

text_preparation.importers.mets_alto.mets.parse_mets_filegroup(mets_doc: BeautifulSoup) dict[int, str]

Parse <fileGrp> section to extract the page’s OCR image ids.

The <fileGrp> section contains the names and ids of the images and text files linked to the Mets XML file. Each page of the issue is associated to a scan image file and ids.

Parameters:

mets_doc (BeautifulSoup) – BeautifulSoup object of Mets XML document.

Returns:

Mapping from page number to page image id.

Return type:

dict[int, str]

Alto parsing

Utility functions to parse Alto XML files.

text_preparation.importers.mets_alto.alto.distill_coordinates(element: Tag) list[int]

Extract image coordinates from any XML tag.

Note

This function assumes the following attributes to be present in the input XML element: HPOS, VPOS. WIDTH, HEIGHT.

Parameters:

element (Tag) – Input XML tag containing coordinates to distill.

Returns:

An ordered list of coordinates (x, y, width,

height).

Return type:

list[int]

text_preparation.importers.mets_alto.alto.parse_printspace(element: Tag, mappings: dict[str, str]) tuple[list[dict], list[str]]

Parse the <PrintSpace> element of an ALTO XML document.

This element contains all the OCR information about the content items of a page, up to the lowest level of the hierarchy: the regions, paragraphs, lines and tokens, each with their corresponding coordinates.

Parameters:
  • element (Tag) – Input XML element (<PrintSpace>).

  • mappings (dict[str, str]) – Mapping from OCR component ids to their corresponding canonical Content Item ID.

Returns:

List of page regions in the canonical

format and notes about potential parsing problems.

Return type:

tuple[list[dict], list[str]]

text_preparation.importers.mets_alto.alto.parse_style(style_div: Tag) dict[str, float | str]

Parse the font-style information in the ALTO files (for BNL and BNF).

Parameters:

style_div (Tag) – Element of XML file containing font-style information.

Returns:

Parsed style for Issue canonical format.

Return type:

dict[str, float | str]

text_preparation.importers.mets_alto.alto.parse_textline(element: Tag) tuple[dict, list[str]]

Parse the <TextLine> element of an ALTO XML document.

Parameters:

element (Tag) – Input XML element (<TextLine>).

Returns:

Parsed lines or text in the canonical format

and notes about potential missing token coordinates.

Return type:

tuple[dict, list[str]]