Generic Mets/Alto importer
A back-bone for any Mets/Alto importer.
Abstract classes
This module contains the definition of generic Mets/Alto importer classes.
The classes define newspaper Issues and Pages objects which convert OCR data in Mets/Alto format to a unified canoncial format. The classes in this module are meant to be subclassed to handle independently the parsing for each version of the Mets/Atlo format and their specificities.
- class text_preparation.importers.mets_alto.classes.MetsAltoNewspaperIssue(issue_dir: IssueDir)
Newspaper issue in generic Mets/Alto format.
Note
New Mets/Alto importers should sub-class this class and implement its abstract methods (i.e.
_find_pages()
,_parse_mets()
).- Parameters:
issue_dir (IssueDir) – Identifying information about the issue.
- id
Canonical Issue ID (e.g.
GDL-1900-01-02-a
).- Type:
str
- edition
Lower case letter ordering issues of the same day.
- Type:
str
- journal
Newspaper unique identifier or name.
- Type:
str
- path
Path to directory containing the issue’s OCR data.
- Type:
str
- date
Publication date of issue.
- Type:
datetime.date
- issue_data
Issue data according to canonical format.
- Type:
dict[str, Any]
- pages
list of
NewspaperPage
instances from this issue.- Type:
list
- rights
Access rights applicable to this issue.
- Type:
str
- image_properties
metadata allowing to convert region OCR/OLR coordinates to iiif format compliant ones.
- Type:
dict[str, Any]
- ark_id
Issue ARK identifier, for the issue’s pages’ iiif links.
- Type:
int
- property xml: BeautifulSoup
Read Mets XML file of the issue and create a BeautifulSoup object.
During the processing, some IO errors can randomly happen when listing the contents of the directory, or opening files, preventing the correct parsing of the issue. The error is raised after the third try. If the directory does not contain any Mets file, only try once.
Note
By default the issue Mets file is the only file containing mets.xml in its file name and located in the directory self.path. Individual importers can overwrite this behavior if necessary.
- Returns:
BeautifulSoup object with Mets XML of the issue.
- Return type:
BeautifulSoup
- class text_preparation.importers.mets_alto.classes.MetsAltoNewspaperPage(_id: str, number: int, filename: str, basedir: str, encoding: str = 'utf-8')
Newspaper page in generic Alto format.
Note
New Mets/Alto importers should sub-classes this class and implement its abstract methods (i.e.
add_issue()
).- Parameters:
_id (str) – Canonical page ID.
number (int) – Page number.
filename (str) – Name of the Alto XML page file.
basedir (str) – Base directory where Alto files are located.
encoding (str, optional) – Encoding of XML file. Defaults to ‘utf-8’.
- id
Canonical Page ID (e.g.
GDL-1900-01-02-a-p0004
).- Type:
str
- number
Page number.
- Type:
int
- page_data
Page data according to canonical format.
- Type:
dict[str, Any]
- issue
Issue this page is from.
- Type:
- filename
Name of the Alto XML page file.
- Type:
str
- basedir
Base directory where Alto files are located.
- Type:
str
- encoding
Encoding of XML file.
- Type:
str, optional
- abstract add_issue(issue: NewspaperIssue) None
Add to a page object its parent, i.e. the newspaper issue.
This allows each page to preserve contextual information coming from the newspaper issue.
- Parameters:
issue (NewspaperIssue) – Newspaper issue containing this page.
- parse() None
Process the page XML file and transform into canonical Page format.
Note
This lazy behavior means that the page contents are not processed upon creation of the page object, but only once the
parse()
method is called.
- property xml: BeautifulSoup
Read Alto XML file of the page and create a BeautifulSoup object.
- Returns:
BeautifulSoup object with Alto XML of the page.
- Return type:
BeautifulSoup
Mets parsing
Utility functions to parse Mets XML files.
- text_preparation.importers.mets_alto.mets.get_dmd_sec(mets_doc: BeautifulSoup, _filter: str) Tag
Extract the contents of a specific
<dmdsec>
from the Mets document.The
<dmdsec>
section contains descriptive metadata. It’s composed of several different subsections each identified with string IDs.- Parameters:
mets_doc (BeautifulSoup) – BeautifulSoup object of Mets XML document.
_filter (str) – ID of the subsection of interest to filter the search.
- Returns:
Contents of the desired
<dmdsec>
of the Mets XML document.- Return type:
Tag
- text_preparation.importers.mets_alto.mets.parse_mets_amdsec(mets_doc: BeautifulSoup, x_res: str, y_res: str, x_res_default: int = 300, y_res_default: int = 300) dict
Parse the
<amdsec>
section of Mets XML to extract image properties.The
<amdsec>
section contains administrative metadata about the OCR, in particular information about the image resolution allowing the coordinates conversion to iiif format.- Parameters:
mets_doc (BeautifulSoup) – BeautifulSoup object of Mets XML document.
x_res (str) – Name of field representing the X resolution.
y_res (str) – Name of field representing the Y resolution.
x_res_default (int, optional) – Default X_res. Defaults to 300.
y_res_default (int, optional) – Default Y res. Defaults to 300.
- Returns:
- Parsed image properties with default values if the field was not
found in the document.
- Return type:
dict
- text_preparation.importers.mets_alto.mets.parse_mets_filegroup(mets_doc: BeautifulSoup) dict[int, str]
Parse
<fileGrp>
section to extract the page’s OCR image ids.The
<fileGrp>
section contains the names and ids of the images and text files linked to the Mets XML file. Each page of the issue is associated to a scan image file and ids.- Parameters:
mets_doc (BeautifulSoup) – BeautifulSoup object of Mets XML document.
- Returns:
Mapping from page number to page image id.
- Return type:
dict[int, str]
Alto parsing
Utility functions to parse Alto XML files.
- text_preparation.importers.mets_alto.alto.distill_coordinates(element: Tag) list[int]
Extract image coordinates from any XML tag.
Note
This function assumes the following attributes to be present in the input XML element:
HPOS
,VPOS
.WIDTH
,HEIGHT
.- Parameters:
element (Tag) – Input XML tag containing coordinates to distill.
- Returns:
- An ordered list of coordinates (
x
,y
,width
, height
).
- An ordered list of coordinates (
- Return type:
list[int]
- text_preparation.importers.mets_alto.alto.parse_printspace(element: Tag, mappings: dict[str, str]) tuple[list[dict], list[str]]
Parse the
<PrintSpace>
element of an ALTO XML document.This element contains all the OCR information about the content items of a page, up to the lowest level of the hierarchy: the regions, paragraphs, lines and tokens, each with their corresponding coordinates.
- Parameters:
element (Tag) – Input XML element (
<PrintSpace>
).mappings (dict[str, str]) – Mapping from OCR component ids to their corresponding canonical Content Item ID.
- Returns:
- List of page regions in the canonical
format and notes about potential parsing problems.
- Return type:
tuple[list[dict], list[str]]
- text_preparation.importers.mets_alto.alto.parse_style(style_div: Tag) dict[str, float | str]
Parse the font-style information in the ALTO files (for BNL and BNF).
- Parameters:
style_div (Tag) – Element of XML file containing font-style information.
- Returns:
Parsed style for Issue canonical format.
- Return type:
dict[str, float | str]
- text_preparation.importers.mets_alto.alto.parse_textline(element: Tag) tuple[dict, list[str]]
Parse the
<TextLine>
element of an ALTO XML document.- Parameters:
element (Tag) – Input XML element (
<TextLine>
).- Returns:
- Parsed lines or text in the canonical format
and notes about potential missing token coordinates.
- Return type:
tuple[dict, list[str]]