BNF-EN Mets/Alto importer
This importer extends the generic Mets/Alto importer, and it was developed to handle OCR newspaper data provided by the BNF that were part of the Europeana project.
BNF-EN Custom classes
This module contains the definition of BNF-EN importer classes.
The classes define newspaper Issues and Pages objects which convert OCR data in the BNF-Europeana version of the Mets/Alto format to a unified canoncial format. Theses classes are subclasses of generic Mets/Alto importer classes.
- class text_preparation.importers.bnf_en.classes.BnfEnNewspaperIssue(issue_dir: IssueDir)
Bases:
MetsAltoNewspaperIssue
Newspaper Issue in BNF-EN (Mets/Alto) format.
All functions defined in this child class are specific to parsing BNF-Europeana Mets/Alto format.
- Parameters:
issue_dir (IssueDir) – Identifying information about the issue.
- id
Canonical Issue ID (e.g.
GDL-1900-01-02-a
).- Type:
str
- edition
Lower case letter ordering issues of the same day.
- Type:
str
- journal
Newspaper unique identifier or name.
- Type:
str
- path
Path to directory containing the issue’s OCR data.
- Type:
str
- date
Publication date of issue.
- Type:
datetime.date
- issue_data
Issue data according to canonical format.
- Type:
dict[str, Any]
- pages
list of
NewspaperPage
instances from this issue.- Type:
list
- rights
Access rights applicable to this issue.
- Type:
str
- image_properties
metadata allowing to convert region OCR/OLR coordinates to iiif format compliant ones.
- Type:
dict[str, Any]
- ark_link
IIIF Ark Id for this issue fetched on the Gallica API.
- Type:
str
- class text_preparation.importers.bnf_en.classes.BnfEnNewspaperPage(_id: str, number: int, filename: str, basedir: str)
Bases:
MetsAltoNewspaperPage
Newspaper page in BNF-EN (Mets/Alto) format.
- Parameters:
_id (str) – Canonical page ID.
number (int) – Page number.
filename (str) – Name of the Alto XML page file.
basedir (str) – Base directory where Alto files are located.
- id
Canonical Page ID (e.g.
GDL-1900-01-02-a-p0004
).- Type:
str
- number
Page number.
- Type:
int
- page_data
Page data according to canonical format.
- Type:
dict[str, Any]
- issue
Issue this page is from.
- Type:
- filename
Name of the Alto XML page file.
- Type:
str
- basedir
Base directory where Alto files are located.
- Type:
str
- encoding
Encoding of XML file. Defaults to ‘utf-8’.
- Type:
str, optional
- is_gzip
Whether the page’s corresponding file is in .gzip.
- Type:
bool
- ark_link
IIIF Ark identifier for this page.
- Type:
str
- add_issue(issue: MetsAltoNewspaperIssue) None
Add to a page object its parent, i.e. the newspaper issue.
This allows each page to preserve contextual information coming from the newspaper issue.
- Parameters:
issue (NewspaperIssue) – Newspaper issue containing this page.
BNF-EN Detect functions
This module contains helper functions to find BNF-EN OCR data to import.
- text_preparation.importers.bnf_en.detect.BnfEnIssueDir
A light-weight data structure to represent a newspaper issue in BNF Europeana
This named tuple contains basic metadata about a newspaper issue. They can then be used to locate the relevant data in the filesystem or to create canonical identifiers for the issue and its pages.
Note
In case of newspaper published multiple times per day, a lowercase letter is used to indicate the edition number: ‘a’ for the first, ‘b’ for the second, etc.
- Parameters:
journal (str) – Newspaper ID.
date (datetime.date) – Publication date or issue.
edition (str) – Edition of the newspaper issue (‘a’, ‘b’, ‘c’, etc.).
path (str) – Path to the directory containing the issue’s OCR data.
rights (str) – Access rights on the data (open, closed, etc.).
ark_link (str) – Unique IIIF identifier associated with this issue.
>>> from datetime import date >>> i = BnfEnIssueDir('BLB', date(1845,12,28), 'a', './Le-Gaulois/18820208_1', 'open')
- text_preparation.importers.bnf_en.detect.construct_iiif_arks() dict[str, str]
Fetch the IIIF ark ids for each issue and map them to each other.
- Returns:
Mapping from issue canonical id to IIIF Ark id.
- Return type:
dict[str, str]
- text_preparation.importers.bnf_en.detect.detect_issues(base_dir: str, access_rights: str) list[IssueDirectory]
Detect newspaper issues to import within the filesystem.
This function expects the directory structure that BNF-EN used to organize the dump of Mets/Alto OCR data.
- Parameters:
base_dir (str) – Path to the base directory of newspaper data.
access_rights (str) – Not used for this importer (kept for conformity).
- Returns:
List of BnfEnIssueDir instances to import.
- Return type:
list[BnfEnIssueDir]
- text_preparation.importers.bnf_en.detect.dir2issue(path: str, access_rights: dict, iiif_arks: dict[str, str]) IssueDirectory | None
Create a BnfEnIssueDir object from a directory path.
Note
This function is called internally by
detect_issues()
.- Parameters:
path (str) – Path of issue.
access_rights (dict) – Access rights (for conformity).
iiif_arks (dict) – Mapping from issue canonical ids to iiif ark ids.
- Returns:
- BnfEnIssueDir for given issue if the ark id
was found on the Gallica API, None otherwise.
- Return type:
BnfEnIssueDir | None
- text_preparation.importers.bnf_en.detect.fix_api_year_mismatch(journal: str, year: int, api_issues: list[Tag], last_i: list[Tag | None]) tuple[list[Tag], list[Tag | None]]
Modify proivded list of issues fetched from the API to fix some issues present.
Indeed, the API currently wronly stores the issues for december 31st of some years, with some issues being shifted from one year. This is not the case for all years, and the correct issue can be present or not. This function aims to rectify this issue and fetch the correct IIIF ark IDs.
- Parameters:
journal (str) – Alias of the journal currently under processing.
year (int) – Year for which the API was queried.
api_issues (list[Tag]) – List of issues as returned from the API.
last_i (list[Tag | None]) – Last december 31st issue entry, returned for the wrong year.
- Returns:
- Corrected issue list and next december 31st
issue(s) if the error was present again, None otherwise.
- Return type:
tuple[list[Tag], list[Tag | None]
- text_preparation.importers.bnf_en.detect.get_api_id(journal: str, api_issue: tuple[str, ~.datetime.date], edition: str) str
Construct an ID given a journal name, date and edition.
- Parameters:
journal (str) – Journal name
api_issue (tuple[str, datetime.date]) – Tuple of information fetched from the Gallica API.
edition (str) – Edition of the issue.
- Returns:
Canonical issue Id composed of journal name, date and edition.
- Return type:
str
- text_preparation.importers.bnf_en.detect.get_id(journal: str, date: ~.datetime.date, edition: str) str
Construct the canonical issue ID given the necessary information.
- Parameters:
journal (str) – Journal name.
date (datetime.date) – Publication date.
edition (str) – Edition of the issue.
- Returns:
Resulting issue canonical Id.
- Return type:
str
- text_preparation.importers.bnf_en.detect.get_issues_iiif_arks(journal_ark: tuple[str, str]) list[tuple[str, str]]
Given a journal name and Ark, fetch its issues’ Ark in the Gallica API.
Each fo the Europeana journals have a journal-level Ark id, as well as issue-level IIIF Ark ids that can be fetched from the Gallica API using the journal Ark. The API also provides the day of the year for the corresponding issue. Using both information, this function recreates all the issue canonical for each collection and maps them to their respective issue IIIF Ark ids.
- Parameters:
journal_ark (tuple[str, str]) – Pair of journal and associated Ark id.
- Returns:
Pairs of issue canonical Ids and IIIF Ark Ids.
- Return type:
list[tuple[str, str]]
- text_preparation.importers.bnf_en.detect.parse_dir(_dir: str, journal: str) str
Parse a directory and return the corresponding ID.
- Parameters:
_dir (str) – The directory (in Windows FS).
journal (str) – Journal name to construct ID.
- Returns:
Issue canonical id.
- Return type:
str
- text_preparation.importers.bnf_en.detect.select_issues(base_dir: str, config: dict, access_rights: str) list[IssueDirectory] | None
Detect selectively newspaper issues to import.
The behavior is very similar to
detect_issues()
with the only difference thatconfig
specifies some rules to filter the data to import. See this section for further details on how to configure filtering.- Parameters:
base_dir (str) – Path to the base directory of newspaper data.
config (dict) – Config dictionary for filtering.
access_rights (str) – Not used for this importer (kept for conformity).
- Returns:
BnfEnIssueDir instances to import.
- Return type:
list[BnfEnIssueDir] | None