BNF Mets/Alto importer
This importer extends the generic Mets/Alto importer, and it was developed to handle OCR newspaper data provided by the BNF.
BNF Custom classes
This module contains the definition of BNF importer classes.
The classes define newspaper Issues and Pages objects which convert OCR data in the BNF version of the Mets/Alto format to a unified canoncial format. Theses classes are subclasses of generic Mets/Alto importer classes.
- class text_preparation.importers.bnf.classes.BnfNewspaperIssue(issue_dir: IssueDir)
Newspaper Issue in BNF (Mets/Alto) format.
All functions defined in this child class are specific to parsing BNF Mets/Alto format.
- Parameters:
issue_dir (IssueDir) – Identifying information about the issue.
- id
Canonical Issue ID (e.g.
GDL-1900-01-02-a
).- Type:
str
- edition
Lower case letter ordering issues of the same day.
- Type:
str
- journal
Newspaper unique identifier or name.
- Type:
str
- path
Path to directory containing the issue’s OCR data.
- Type:
str
- date
Publication date of issue.
- Type:
datetime.date
- issue_data
Issue data according to canonical format.
- Type:
dict[str, Any]
- pages
list of
NewspaperPage
instances from this issue.- Type:
list
- rights
Access rights applicable to this issue.
- Type:
str
- image_properties
metadata allowing to convert region OCR/OLR coordinates to iiif format compliant ones.
- Type:
dict[str, Any]
- ark_id
Issue ARK identifier, for the issue’s pages’ iiif links.
- Type:
int
- issue_uid
Basename of the Mets XML file of this issue.
- Type:
str
- secondary_date
Potential secondary date of issue.
- Type:
datetime.date
- property xml: BeautifulSoup
Read Mets XML file of the issue and create a BeautifulSoup object.
- Returns:
BeautifulSoup object with Mets XML of the issue.
- Return type:
BeautifulSoup
- class text_preparation.importers.bnf.classes.BnfNewspaperPage(_id: str, number: int, filename: str, basedir: str)
Newspaper page in BNF (Mets/Alto) format.
- Parameters:
_id (str) – Canonical page ID.
number (int) – Page number.
filename (str) – Name of the Alto XML page file.
basedir (str) – Base directory where Alto files are located.
- id
Canonical Page ID (e.g.
GDL-1900-01-02-a-p0004
).- Type:
str
- number
Page number.
- Type:
int
- page_data
Page data according to canonical format.
- Type:
dict[str, Any]
- issue
Issue this page is from.
- Type:
- filename
Name of the Alto XML page file.
- Type:
str
- basedir
Base directory where Alto files are located.
- Type:
str
- encoding
Encoding of XML file. Defaults to ‘utf-8’.
- Type:
str, optional
- is_gzip
Whether the page’s corresponding file is in .gzip.
- Type:
bool
- ark_link
IIIF Ark identifier for this page.
- Type:
str
- add_issue(issue: MetsAltoNewspaperIssue) None
Add to a page object its parent, i.e. the newspaper issue.
This allows each page to preserve contextual information coming from the newspaper issue.
- Parameters:
issue (NewspaperIssue) – Newspaper issue containing this page.
- parse() None
Process the page XML file and transform into canonical Page format.
Note
This lazy behavior means that the page contents are not processed upon creation of the page object, but only once the
parse()
method is called.
- property xml: BeautifulSoup
Read Alto XML file of the page and create a BeautifulSoup object.
Redefined function as for some issues, the pages are in gz format.
- Returns:
BeautifulSoup object with Alto XML of the page.
- Return type:
BeautifulSoup
BNF Detect functions
This module contains helper functions to find BNF OCR data to import.
- text_preparation.importers.bnf.detect.BnfIssueDir
A light-weight data structure to represent a newspaper issue.
This named tuple contains basic metadata about a newspaper issue. They can then be used to locate the relevant data in the filesystem or to create canonical identifiers for the issue and its pages.
Note
In case of newspaper published multiple times per day, a lowercase letter is used to indicate the edition number: ‘a’ for the first, ‘b’ for the second, etc.
Note
In BNF data, dates can be given in two different formats (separated with - or /). Also, an issue can have two dates, separated by either - or /.
- Parameters:
journal (str) – Newspaper ID.
date (datetime.date) – Publication date or issue.
edition (str) – Edition of the newspaper issue (‘a’, ‘b’, ‘c’, etc.).
path (str) – Path to the directory containing the issue’s OCR data.
rights (str) – Access rights on the data (open, closed, etc.).
secondary_date (datetime.date) – Secondary publication date or issue.
>>> from datetime import date >>> i = BnfIssueDir( journal='Marie-Claire', date=datetime.date(1938, 3, 11), edition='a', path='./BNF/files/4701034.zip', rights='open_public', secondary_date = None )
- text_preparation.importers.bnf.detect.assign_editions(issues: list[IssueDirectory]) list[IssueDirectory]
Assign updated edition numbers to each issue of a given day.
TFor BNF, the issues are not organized by date or edition in the file system. Hence, when multiple issues exist for a given day, an indexing must be applied to assign edition numbers.
- Parameters:
issues (list[BnfIssueDir]) – List of issues for a given day.
- Returns:
List of issues with updated editions.
- Return type:
list[BnfIssueDir]
- text_preparation.importers.bnf.detect.detect_issues(base_dir: str, access_rights: str = None) list[IssueDirectory]
Detect BNF issues to import within the filesystem
This function the directory structure used by BNF (one subdir by journal).
- Parameters:
base_dir (str) – Path to the base directory of newspaper data.
access_rights (str, optional) – Not used for this importer, but argument is kept for normality. Defaults to None.
- Returns:
List of BnfIssueDir instances, to be imported.
- Return type:
list[BnfIssueDir]
- text_preparation.importers.bnf.detect.dir2issue(issue_path: str, access_rights_dict: dict) IssueDirectory
Create a BnfIssueDir object from an archive path.
Note
This function is called internally by detect_issues
- Parameters:
issue_path (str) – The path of the issue within the archive.
access_rights_dict (dict) – Access rights for this issue.
- Returns:
New BnfIssueDir object
- Return type:
BnfIssueDir
- text_preparation.importers.bnf.detect.get_id(issue: IssueDirectory) str
Return an issue’s canonical ID given its IssueDir.
- Parameters:
issue (BnfIssueDir) – IssueDir of issue.
- Returns:
Canonical ID of issue.
- Return type:
str
- text_preparation.importers.bnf.detect.get_number(issue: IssueDirectory) str
Return an issue’s original identifying number given its IssueDir.
- Parameters:
issue (BnfIssueDir) – IssueDir of issue.
- Returns:
Identifying number in BNF’s original file structure.
- Return type:
str
- text_preparation.importers.bnf.detect.select_issues(base_dir: str, config: dict, access_rights: str) List[IssueDirectory] | None
Detect selectively newspaper issues to import.
The behavior is very similar to
detect_issues()
with the only difference thatconfig
specifies some rules to filter the data to import. See this section for further details on how to configure filtering.- Parameters:
base_dir (str) – Path to the base directory of newspaper data.
config (dict) – Config dictionary for filtering.
access_rights (str) – Not used for this imported, but argument is kept for normality.
- Returns:
List of BnfIssueDir instances, to be imported.
- Return type:
Optional[List[BnfIssueDir]]
BNF Helper and Parser methods
Set of helper functions for BNF importer
- text_preparation.importers.bnf.helpers.SECTION = 'section'
Content types as defined in BNF Mets flavour. These are the ones we are interested in parsing. The SECTION type should be flattened, and shouldn’t be part of content items, but it is needed to parse what’s inside.
- text_preparation.importers.bnf.helpers.add_div(_dict: dict[str, tuple[str, str]], _type: str, div_id: str, label: str) dict[str, tuple[str, str]]
Adds a div item to the given dictionary (sorted by type).
The types used as keys should be in BNF_CONTENT_TYPES or SECTION.
- Parameters:
_dict (dict[str, tuple[str, str]]) – The dictionary where to add the div
_type (str) – The type of the new div to add.
div_id (str) – The div ID to add.
label (str) – The label of the div to add.
- Returns:
The updated dictionary.
- Return type:
dict[str, tuple[str, str]]
- text_preparation.importers.bnf.helpers.get_dates(date_string: str, separators: list[str]) list[str | None]
Extract date from given string using list of possible separators.
Assumes that the given date string represents exactly 2 dates. Tries to separate them using the given separators, and return when two date were found, otherwise list of None is returned.
- Parameters:
date_string (str) – The date string to separate.
separators (list[str]) – The list of potential separators.
- Returns:
Separated date or pair of None.
- Return type:
list[Optional[str]]
- text_preparation.importers.bnf.helpers.get_journal_name(archive_path: str) str
Return the Journal name from the path of the issue.
It assumes the journal name is one directory above the issue.
- Parameters:
archive_path (str) – Path to the issue’s archive
- Returns:
Extracted journal name in lowercase.
- Return type:
str
- text_preparation.importers.bnf.helpers.is_multi_date(date_string: str) bool
Check whether a given date string is composed of more than one date.
This check is based on the assumption that a full date is 10 chars long.
- Parameters:
date_string (str) – Date to check for
- Returns:
True if the string represents multiple dates
- Return type:
bool
- text_preparation.importers.bnf.helpers.parse_date(date_string: str, formats: list[str], separators: list[str]) tuple[~.datetime.date, ~.datetime.date | None]
Parse a date given a list of formats.
The input string can sometimes represent a pair of dates, in which case they are both parsed if possible.
- Parameters:
date_string (str) – Date string to parse.
formats (list[str]) – Possible dates formats.
separators (list[str]) – List of possible date separators.
- Raises:
ValueError – The input date string is too short to be a full date.
ValueError – The string contains two dates that could not be split correctly.
ValueError – The (first) date could not be parsed correctly.
- Returns:
- Parsed date, potentially
parsed pair of dates.
- Return type:
tuple[datetime.date, Optional[datetime.date]]
Utility functions to parse BNF ALTO files.
- text_preparation.importers.bnf.parsers.parse_div_parts(div: Tag) list[dict[str, str | int]]
Parse the parts of a given div element.
Typically, any div of type in BNF_CONTENT_TYPES is composed of child divs. This is what this function parses. Each element of the output contains keys {‘comp_role’, ‘comp_id’, ‘comp_fileid’, ‘comp_page_no’}.
- Parameters:
div (Tag) – Child div to parse.
- Returns:
The list of parts of this Tag.
- Return type:
list[dict[str, str | int]]
- text_preparation.importers.bnf.parsers.parse_embedded_cis(div: Tag, label: str, issue_id: str, parent_id: str | None, counter: int) tuple[list[dict], int]
Parse the div Tags embedded in the given one.
The input div should be of type in BNF_CONTENT_TYPES and should have children of types also in that category. Each child tag represents separate content items, which should thus be processed separately.
- Parameters:
div (Tag) – The parent tag.
label (str) – The label of the parent tag.
issue_id (str) – The ID of the issue.
parent_id (str | None) – The ID of the parent tag (to put into pOf).
counter (int) – Counter for content items.
- Returns:
The embedded CIs and resulting updated counter.
- Return type:
tuple[list[dict], int]
- text_preparation.importers.bnf.parsers.parse_printspace(element: Tag, mappings: dict[str, str]) tuple[list[dict], list[str] | None]
Parse the
<PrintSpace>
element of an ALTO XML document for BNF.- Parameters:
element (Tag) – Input XML element (
<PrintSpace>
).mappings (Dict[str, str]) – Description of parameter mappings.
- Returns:
- Parsed regions and paragraphs, and
potential notes on issues encountered during the parsing.
- Return type:
tuple[list[dict], list[str] | None]