Olive XML importer
Olive Custom classes
This module contains the definition of the Olive importer classes.
The classes define newspaper Issues and Pages objects which convert OCR data in the Olive XML format to a unified canoncial format.
- class text_preparation.importers.olive.classes.OliveNewspaperIssue(issue_dir: IssueDir, image_dirs: str, temp_dir: str)
Newspaper Issue in Olive format.
- Upon object initialization the following things happen:
The Zip archive containing the issue is uncompressed.
The ToC file is parsed to determine the logical structure of the issue.
Page objects (instances of
OliveNewspaperPage
) are initialized.
- Parameters:
issue_dir (IssueDir) – Identifying information about the issue.
image_dirs (str) – Path to the directory with the page images. Multiple paths should be separated by comma (“,”).
temp_dir (str) – Temporary directory to unpack ZipArchive objects.
- id
Canonical Issue ID (e.g.
GDL-1900-01-02-a
).- Type:
str
- edition
Lower case letter ordering issues of the same day.
- Type:
str
- journal
Newspaper unique identifier or name.
- Type:
str
- path
Path to directory containing the issue’s OCR data.
- Type:
str
- date
Publication date of issue.
- Type:
datetime.date
- issue_data
Issue data according to canonical format.
- Type:
dict[str, Any]
- pages
list of
NewspaperPage
instances from this issue.- Type:
list
- rights
Access rights applicable to this issue.
- Type:
str
- image_dirs
Path to the directory with the page images. Multiple paths should be separated by comma (“,”).
- Type:
str
- archive
ZipArchive for this issue.
- Type:
ZipArchive
- toc_data
Table of contents (ToC) data for this issue.
- Type:
dict
- content_elements
All content elements detected.
- Type:
list[dict[str, Any]]
- content_items
Issue’s recomposed content items.
- Type:
list[dict[str, Any]]
- clusters
Inverted index of legacy ids; values are clusters of articles, each indexed by one member.
- Type:
dict[str, list[str]]
- class text_preparation.importers.olive.classes.OliveNewspaperPage(_id: str, number: int, toc_data: dict, image_info: dict, page_xml: str)
Newspaper page in Olive format.
- Parameters:
_id (str) – Canonical page ID.
number (int) – Page number.
toc_data (dict) – Metadata about content items in the newspaper issue.
page_info (dict) – Metadata about the page image.
page_xml (str) – Path to the Olive XML file of the page.
- id
Canonical Page ID (e.g.
GDL-1900-01-02-a-p0004
).- Type:
str
- number
Page number.
- Type:
int
- page_data
Page data according to canonical format.
- Type:
dict[str, Any]
- issue
Issue this page is from.
- Type:
NewspaperIssue | None
- toc_data
Metadata about content items in the newspaper issue.
- Type:
dict
- image_info
Metadata about the page image.
- Type:
dict
- page_xml
Path to the Olive XML file of the page.
- Type:
str
- archive
Archive of the issue this page is from.
- Type:
ZipArchive
- add_issue(issue: NewspaperIssue) None
Add to a page object its parent, i.e. the newspaper issue.
This allows each page to preserve contextual information coming from the newspaper issue.
- Parameters:
issue (NewspaperIssue) – Newspaper issue containing this page.
- parse() None
Process the page XML file and transform into canonical Page format.
Note
This lazy behavior means that the page contents are not processed upon creation of the page object, but only once the
parse()
method is called.- Raises:
ValueError – No Newspaper issue has been added to this page.
Olive Detect functions
This module contains functions to detect Olive OCR data to be imported.
- text_preparation.importers.olive.detect.OliveIssueDir
A light-weight data structure to represent a newspaper issue.
This named tuple contains basic metadata about a newspaper issue. They can then be used to locate the relevant data in the filesystem or to create canonical identifiers for the issue and its pages.
Note
In case of newspaper published multiple times per day, a lowercase letter is used to indicate the edition number: ‘a’ for the first, ‘b’ for the second, etc.
- Parameters:
journal (str) – Newspaper ID.
date (datetime.date) – Publication date or issue.
edition (str) – Edition of the newspaper issue (‘a’, ‘b’, ‘c’, etc.).
path (str) – Path to the directory containing the issue’s OCR data.
rights (str) – Access rights on the data (open, closed, etc.).
>>> from datetime import date >>> i = OliveIssueDir('GDL', date(1900,1,1), 'a', './GDL-1900-01-01/', 'open')
- text_preparation.importers.olive.detect.dir2olivedir(issue_dir: IssueDir, access_rights: dict[str, dict[str, str]]) OliveIssueDirectory
Helper function that injects access rights info into an
IssueDir
.Note
This function is called internally by
olive_detect_issues()
.- Parameters:
issue_dir (IssueDir) – Input
IssueDir
object.access_rights (dict[str, dict[str, str]]) – Access rights information.
- Returns:
New
OliveIssueDir
object.- Return type:
OliveIssueDir
- text_preparation.importers.olive.detect.olive_detect_issues(base_dir: str, access_rights: str, journal_filter: set | None = None, exclude: bool = False) list[OliveIssueDirectory]
Detect newspaper issues to import within the filesystem.
This function expects the directory structure that RERO used to organize the dump of Olive OCR data.
- Parameters:
base_dir (str) – Path to the base directory of newspaper data.
access_rights (str) – Path to
access_rights.json
file.journal_filter (set | None, optional) – IDs of newspapers to consider. Defaults to None.
exclude (bool, optional) – Whether
journal_filter
should determine exclusion. Defaults to False.
- Returns:
List of OliveIssueDir instances, to be imported.
- Return type:
list[OliveIssueDir]
- text_preparation.importers.olive.detect.olive_select_issues(base_dir: str, config: dict[str, Any], access_rights: str) list[OliveIssueDirectory]
Detect selectively newspaper issues to import.
The behavior is very similar to
olive_detect_issues()
with the only difference thatconfig
specifies some rules to filter the data to import. See this section for further details on how to configure filtering.- Parameters:
base_dir (str) – Path to the base directory of newspaper data.
config (dict[str, Any]) – Config dictionary for filtering.
access_rights (str) – Path to
access_rights.json
file.
- Returns:
List of OliveIssueDir instances, to be imported.
- Return type:
list[OliveIssueDir]
Olive parsers
Functions to parse Olive XML data.
- text_preparation.importers.olive.parsers.olive_image_parser(text: bytes) dict[str, str | list] | None
Parse the Olive XML file containing image metadata.
- Parameters:
text (bytes) – Content of the XML file to parse.
- Returns:
Dictionary of image metadata.
- Return type:
dict[str, str | list] | None
- text_preparation.importers.olive.parsers.olive_parser(text: str) dict[str, dict | list]
Parse an Olive XML file (e.g. from Le Temps corpus).
The main logic implemented here was derived from <https://github.com/dhlab-epfl/LeTemps-preprocessing/>. Each XML file corresponds to one article, as detected by Olive. The final dictionary has keys
meta
,r
,stats
andlegacy
, each mapping to dictionaries or lists with the file’s parsed contents.- Parameters:
text (str) – Contents of the xml file to parse.
- Returns:
Dictionary with parsed contents.
- Return type:
dict[str, dict | list]
- text_preparation.importers.olive.parsers.olive_toc_parser(toc_path: str, issue_dir: IssueDir, encoding: str = 'windows-1252') dict[int, dict[str, dict]]
Parse the TOC.xml file (Olive format).
For each page, the a dict containing page data is created; mapping content item legacy IDs to their metadata.
- Parameters:
toc_path (str) – Path to the ToC XML file.
issue_dir (IssueDir) – Corresponding
IssueDir
object.encoding (str, optional) – File’s encoding. Defaults to “windows-1252”.
- Returns:
- Dictionary where keys are page numbers and
values the corresponding page data dictionary.
- Return type:
dict[int, dict[str, dict]]
- text_preparation.importers.olive.parsers.parse_styles(text: str) list[dict[str, Any]]
Turn Olive styleGallery.txt file into a dictionary.
Style IDs may be referred to within the
s
property of token elements as defined in the impresso JSON schema for newspaper pages (see documentation). Each style has ID, font, font size, color (rgb).- Parameters:
text (str) – textual content of file styleGallery.txt.
- Returns:
List of styles according to the impresso schema.
- Return type:
list[dict[str, Any]]
Olive Helper methods
Helper functions used by the Olive Importer.
These functions are mainly used within (i.e. called by) the classes
OliveNewspaperIssue
and OliveNewspaperPage
.
- class text_preparation.importers.olive.helpers.BoxStrategy(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)
- text_preparation.importers.olive.helpers.combine_article_parts(article_parts: list[dict[str, Any]]) dict[str, Any]
Merge article parts into a single element.
Olive format splits an article into multiple components whenever it spans over multiple pages. Thus, it is necessary to recompose multiple parts.
- Parameters:
article_parts (list[dict[str, Any]]) – One or more article parts.
- Returns:
Dict with keys meta, fulltext, stats,`legacy`.
- Return type:
dict[str, Any]
- text_preparation.importers.olive.helpers.compute_box(scale_factor: float, input_box: str) str | None
Compute IIIF box coordinates of input_box relative to scale_factor.
- Parameters:
scale_factor (float) – Ratio between 2 images with different dimensions.
input_box (str) – String with 4 values separated by spaces.
- Returns:
New box coordinates or None if the string had the wrong fromat.
- Return type:
str | None
- text_preparation.importers.olive.helpers.compute_scale_factor(img_source_path: str, img_dest_path: str) float
Computes x scale factor bewteen 2 images.
- Parameters:
img_source_path (str) – Full path ot the source image.
img_dest_path (str) – Full path to the destination image.
- Returns:
X Scale factor between the two.
- Return type:
float
- text_preparation.importers.olive.helpers.convert_box(coords: list[int], scale_factor: float) list[int]
Rescale iiif box coordinates relative to given scale factor.
- Parameters:
coords (list[int]) – Original box coordinates.
scale_factor (float) – Scale factor based on image conversion necessary.
- Returns:
Rescaled box coordinates.
- Return type:
list[int]
- text_preparation.importers.olive.helpers.convert_image_coordinates(image: dict[str, Any], page_xml: str, page_image_name: str, zip_archive: ZipArchive, box_strategy: str, issue: IssueDir) dict[str, Any]
Convert coordinates of an Olive image element.
Note
This conversion is necessary since the coordinates recorded in the XML file were computed on a different image than the one used for display in the impresso interface.
- Parameters:
image (dict[str, Any]) – Image metadata.
page_xml (str) – Content of Olive page XML.
page_image_name (str) – Name of page image file.
zip_archive (ZipArchive) – Olive Zip archive.
box_strategy (str) – Conversion strategy to apply.
issue (IssueDir) – IssueDie of the newspaper issue the page belongs to.
- Returns:
Updated image metadata based on the conversion.
- Return type:
dict[str, Any]
- text_preparation.importers.olive.helpers.convert_page_coordinates(page: dict[str, Any], page_xml: str, page_image_name: str, zip_archive: ZipArchive, box_strategy: str, issue: NewspaperIssue) bool
Convert coordinates of all elements in a page that have coordinates.
Note
This conversion is necessary since the coordinates recorded in the XML file were computed on a different image than the one used for display in the impresso interface.
- Parameters:
page (dict[str, Any]) – Page data where coordinates should be converted.
page_xml (str) – Content of Olive page XML.
page_image_name (str) – Name of page image file.
zip_archive (ZipArchive) – Olive Zip archive.
box_strategy (str) – Conversion strategy to apply.
issue (NewspaperIssue) – Newspaper issue the page belongs to.
- Returns:
Whether the coordinate conversion was successful or not.
- Return type:
bool
- text_preparation.importers.olive.helpers.get_clusters(articles: list[dict[str, Any]]) dict[str, list[str]]
Created inverted index of legacy ids to article clusters.
Each cluster of articles is indexed by the legacy id of one its members. If a cluster contains only one element, the its id will be in the keys.
- Parameters:
articles (list[dict[str, Any]]) – Articles to cluster by legacy ids.
- Returns:
Article clusters dictionary.
- Return type:
dict[str, list[str]]
- text_preparation.importers.olive.helpers.get_scale_factor(issue_dir_path, archive, page_xml, box_strategy, img_source_name)
Returns the scale factor in Olive context, given a strategy to choose the source image.
- Parameters:
issue_dir_path (str) – the path of the issue
archive (zipfile.ZipFile) – the zip archive
page_xml (bytes) – the xml handler of the page
box_strategy (str) – the box strategy such as found in the info.txt from jp2 folder
img_source_name – as found in the info.txt from jp2 folder
- Returns:
the hopefully correct scale factor
- Return type:
float
Background information
Impresso converts library images to JP2, taking the best image available: tif > highest png > jpg. Olive box coordinates were computed according to an image source which we have to identify among several. Image format coverage is different from issue to issue, and we have to devise strategies.
Case 1: tif
The tif is present and is the file from which the jp2 was converted. Dest: Tif dimensions can therefore be used as jp2 dimensions, no need to read the jp2 file. Source: Image source dimension is present in the page.xml (normally).
Case 2: several png
In this case the jp2 was acquired using the png with the highest dimension. Dest: It looks that in case of several png, Olive also took the highest for the OCR. It is therefore possible to rely on the resolution indicated in the page xml, which should be the same as our jp2. N.B.: the page width and heigth indicated in the xml do not correspond (usually) to the highest resolution png (there is therefore a discrepancy in Olive file between the tag ‘images_resolution’ on the one hand, and ‘page_width|height’on the other). It seems we can ignore this and rely on the resolution only in the current case. Source: the highest png Here source and dest dimension are equals, the function returns 1.
Case 3: one png only
To be checked if it happens. In this case, there is no choice and Olive OCR and JP2 acquisition should be from the same source => scale factor of 1. Here we do an additional check to see if the page_width|height are the same as the image ones. The only danger is if Olive used another image file and did not provide it.
Case 4: one jpg only
Same as Case 3, scale factor of 1. Here we do an additional check to see if the page_width|height are the same as the image ones. (there is only one image and things should fit, not like in case 2)
- text_preparation.importers.olive.helpers.keep_title(title: str) bool
Whether an element’s title should be kept.
The title should not be kept if it is one of “untitled article”, “untitled ad”, and “untitled picture”.
- Parameters:
title (str) – Title to verify
- Returns:
False if given title is in the black list, True otherwise.
- Return type:
bool
- text_preparation.importers.olive.helpers.merge_pseudo_tokens(line: dict[str, list[Any]]) dict[str, list[Any]]
Remove pseudo tokens from a line by merging them.
- Parameters:
line (dict[str, list[Any]]) – A line of OCR in JSON format.
- Returns:
A new line object (with some merged tokens).
- Return type:
dict[str, list[Any]]
- text_preparation.importers.olive.helpers.merge_tokens(tokens: list[dict[str, Any]], line: str) dict[str, Any]
Merge two or more tokens for the same line into one.
The resulting (merged) token will have new coordinates corresponding to the combination of coordinates of the input tokens.
- Parameters:
tokens (list[dict[str, Any]]) – Tokens to merge.
line (str) – The line of text to which the input tokens belong.
- Returns:
The new (merged) token.
- Return type:
dict[str, Any]
- text_preparation.importers.olive.helpers.normalize_hyphenation(line: dict[str, list[Any]]) dict[str, list[Any]]
Normalize end-of-line hyphenated words.
- Parameters:
line (dict[str, list[Any]]) – A line of OCR.
- Returns:
A new line element.
- Return type:
dict[str, list[Any]]
- text_preparation.importers.olive.helpers.normalize_language(language: str) str
Normalize the language’s string representation.
- Parameters:
language (str) – Language to normalize.
- Returns:
Normalized language, one of “fr”, “en” and “de”.
- Return type:
str
- text_preparation.importers.olive.helpers.normalize_line(line: dict[str, list[Any]], lang: str) dict[str, list[Any]]
Apply normalization rules to a line of OCR.
The normalization rules that are applied depend on the language in which the text is written. This normalization is necessary because Olive, unlike e.g. Mets, does not encode explicitly the presence/absence of whitespaces.
- Parameters:
line (dict[str, list[Any]]) – A line of OCR text.
lang (str) – Language of the text.
- Returns:
The new normalized line of text.
- Return type:
dict[str, list[Any]]
- text_preparation.importers.olive.helpers.recompose_ToC(original_toc_data: dict[int, dict[str, dict]], articles: list[dict[str, Any]], images: list[dict[str, str]]) list[dict[str, Any]]
Recompose the ToC of a newspaper issue.
Function used by
OliveNewspaperIssue
.- Parameters:
original_toc_data (dict[int, dict[str, dict]]) – ToC data.
articles (list[dict[str, Any]]) – List of articles in the issue.
images (list[dict[str, str]]) – List of images in the issue.
- Returns:
List of final content items in the issue.
- Return type:
list[dict[str, Any]]
- text_preparation.importers.olive.helpers.recompose_page(page_id: str, info_from_toc: dict[str, dict], page_elements: dict[str, dict], clusters: dict[str, list[str]]) dict[str, Any]
Merge a list of page elements into a single one.
Note
It is here that an
n
attribute is assigned to each region/paragraph/line/token.- Parameters:
page_id (str) – Page canonical id.
info_from_toc (dict[str, dict]) – Dictionary with page element IDs (articles, ads.) as keys, and dictionaries as values.
page_elements (dict[str, dict]) – Page’s articles or advertisements.
clusters (dict[str, list[str]]) – Inverted index of legacy ids; values are clusters of articles, each indexed by one member.
- Returns:
Page data according to impresso canonical format.
- Return type:
dict[str, Any]