BCUL ABBYY importer

This importer is written to accomodate the ABBYY OCR format. It was developed to handle OCR newspaper data provided by the Bibliothèque Cantonale Universitaire de Lausanne (BCUL - Lausanne Cantonal University Library), which are part of the Scriptorium interface and collection.

BCUL Custom classes

This module contains the definition of the BCUL importer classes.

The classes define newspaper Issues and Pages objects which convert OCR data in the ABBYY format to a unified canoncial format.

class text_preparation.importers.bcul.classes.BculNewspaperIssue(issue_dir)

Bases: CanonicalIssue

Newspaper Issue in BCUL (Abby) format.

Parameters:: issue_dir (IssueDir) – Identifying information about the issue.

id

Canonical Issue ID (e.g. GDL-1900-01-02-a).

Type:: str

edition

Lower case letter ordering issues of the same day.

Type:: str

alias

Newspaper unique alias (identifier or name).

Type:: str

path

Path to directory containing the issue’s OCR data.

Type:: str

date

Publication date of issue.

Type:: datetime.date

issue_data

Issue data according to canonical format.

Type:: dict[str, Any]

pages

list of :obj: CanonicalPage instances from this issue.

Type:: list

mit_file

Path to the ABBY ‘mit’ file that contains the OLR.

Type:: str

is_json

Whether the mit_file has the json file extension.

Type:: bool

is_xml

Whether the mit_file has the xml file extension.

Type:: bool

iiif_manifest

Presentation iiif manifest for this issue.

Type:: str

content_items

List of content items in this issue.

Type:: list[dict]

query_iiif_api(num_tries: int = 0, max_retries: int = 3) → dict[str, Any]

Query the Scriptorium IIIF API for the issue’s manifest data.

TODO: implement the retry approach with celery package or similar.

Parameters:

num_tries (int, optional) – Number of retry attempts. Defaults to 0.
max_retries (int, optional) – Maximum number of attempts. Defaults to 3.

Returns:

Issue’s IIIF “canvases” for each page.

Return type:

dict[str, Any]

Raises:

Exception – If the maximum number of retry attempts is reached.

class text_preparation.importers.bcul.classes.BculNewspaperPage(_id: str, number: int, page_path: str, iiif_uri: str)

Bases: CanonicalPage

Newspaper page in BCUL (Abbyy) format.

Parameters:

_id (str) – Canonical page ID.
number (int) – Page number.
page_path (str) – Path to the Abby XML page file.
iiif_uri (str) – URI to image IIIF of this page.

id

Canonical Page ID (e.g. GDL-1900-01-02-a-p0004).

Type:: str

number

Page number.

Type:: int

page_data

Page data according to canonical format.

Type:: dict[str, Any]

issue

Issue this page is from.

Type:: CanonicalIssue

path

Path to the Abby XML page file.

Type:: str

iiif_base_uri

URI to image IIIF of this page.

Type:: str

add_issue(issue: CanonicalIssue) → None

Add to a page object its parent, i.e. the canonical issue.

This allows each page to preserve contextual information coming from the canonical issue.

Parameters:: issue (CanonicalIssue) – Canonical issue containing this page.

property ci_id: str

Create and return the content item ID of the page.

Given that BCUL data do not entail article-level segmentation, each page is considered as a content item. Thus, to mint the content item ID we take the canonical page ID and simply replace the “p” prefix with “i”.

Returns:: Content item id.
Return type:: str

get_ci_divs() → list[Tag]

Fetch and return the divs of tables and pictures from this page.

While BCUL does not entail article-level segmentation, tables and pictures are still segmented. They can thus have their own content item objects.

Returns:: List of segmented table and picture elements.
Return type:: list[Tag]

parse() → None: Process the page XML file and transform into canonical Page format.

Note

This lazy behavior means that the page contents are not processed upon creation of the page object, but only once the parse() method is called.

property xml: BeautifulSoup | None

BCUL Detect functions

This module contains helper functions to find BCUL OCR data to import.

text_preparation.importers.bcul.detect.BculIssueDir

A light-weight data structure to represent a newspaper issue.

This named tuple contains basic metadata about a newspaper issue. They can then be used to locate the relevant data in the filesystem or to create canonical identifiers for the issue and its pages.

Note

In case of newspaper published multiple times per day, a lowercase letter is used to indicate the edition number: ‘a’ for the first, ‘b’ for the second, etc.

Parameters:

provider (str) – Provider for this alias, here always “BCUL”
alias (str) – Newspaper alias.
date (datetime.date) – Publication date or issue.
edition (str) – Edition of the newspaper issue (‘a’, ‘b’, ‘c’, etc.).
path (str) – Path to the directory containing the issue’s OCR data.
mit_file_type (str) – Type of mit file for this issue (json or xml).

>>> from datetime import date
>>> i = BculIssueDir(
    provider='BCUL',
    alias='FAL',
    date=datetime.date(1762, 12, 07),
    edition='a',
    path='./BCUL/46165',
    mit_file_type:'json'
)

text_preparation.importers.bcul.detect.detect_issues(base_dir: str) → list[IssueDirectory]

Detect BCUL newspaper issues to import within the filesystem.

This function expects the directory structure that BCUL used to organize the dump of Abbyy files.

Parameters:: base_dir (str) – Path to the base directory of newspaper data.
Returns:: List of BCULIssueDir instances, to be imported.
Return type:: list[BculIssueDir]

text_preparation.importers.bcul.detect.dir2issue(path: str, journal_info: dict[str, str]) → IssueDirectory | None

Create a BculIssueDir object from a directory.

Note

This function is called internally by detect_issues

Parameters:

path (str) – The path of the issue.
access_rights (dict) – Dictionary for access rights.

Returns:

New BculIssueDir object.

Return type:

BculIssueDir | None

text_preparation.importers.bcul.detect.select_issues(base_dir: str, config: dict) → list[IssueDirectory] | None

Detect selectively newspaper issues to import.

The behavior is very similar to detect_issues() with the only difference that config specifies some rules to filter the data to import. See this section for further details on how to configure filtering.

Parameters:

base_dir (str) – Path to the base directory of newspaper data.
config (dict) – Config dictionary for filtering.

Returns:

List of BculIssueDir to import.

Return type:

list[BculIssueDir] | None

BCUL Helper functions

Helper functions to parse BCUL OCR files.

text_preparation.importers.bcul.helpers.find_mit_file(_dir: str) → str

Given a directory, search for a file with a name ending with mit.

Parameters:: _dir (str) – Directory to look into.
Returns:: Path to the mit file once found.
Return type:: str

text_preparation.importers.bcul.helpers.find_page_file_in_dir(base_path: str, file_id: str) → str | None

Find the page file in a directory given the name it should have.

Parameters:

base_path (str) – The base path of the directory.
file_id (str) – The name of the page file if present.

Returns:

The path to the page file if found, otherwise None.

Return type:

str | None

text_preparation.importers.bcul.helpers.get_div_coords(div: Tag) → list[int]

Extract the coordinates from the given element and format them for iiif.

In Abbyy format, the coordinates are denoted by the bottom, top (y-axis), left and right (x-axis) values. But iiif coordinates should be formatted as [x, y, width, height], where (x,y) denotes the box’s top left corner: (l, t). Thus they need conversion.

Parameters:: div (Tag) – Element to extract the coordinates from
Returns:: Coordinates converted to the iiif format.
Return type:: list[int]

text_preparation.importers.bcul.helpers.get_page_number(exif_file: str) → int

Given an exif file, look for the page number inside.

This is for the JSON ‘flavour’ of BCUL, in which metadata about the pages are in JSON files which contain the substring exif.

Parameters:: exif_file (str) – Path to the exif file.
Raises:: ValueError – The page number could not be extracted from the file.
Returns:: Page number extracted from the file.
Return type:: int

text_preparation.importers.bcul.helpers.parse_char_tokens(char_tokens: list[Tag]) → list[dict[str, list[int] | str]]

Parse a list of div Tag to extract the tokens and coordinates within a line.

Parameters:: char_tokens (list[Tag]) – div Tags corresponding to a line of tokens to parse.
Returns:: List of reconstructed parsed tokens.
Return type:: list[dict[str, list[int] | str]]

text_preparation.importers.bcul.helpers.parse_date(mit_filename: str) → date

Given the Mit filename, parse the date and ensure it is valid.

Parameters:: mit_filename (str) – Filename of the ‘mit’ file.
Returns:: Publication date of the issue
Return type:: date

text_preparation.importers.bcul.helpers.parse_textblock(block: Tag, page_ci_id: str) → dict[str, Any]

Parse the given textblock element into a canonical region element.

Parameters:

block (Tag) – Text block div element to parse.
page_ci_id (str) – Canonical ID of the CI corresponding to this page.

Returns:

Parsed region object in canonical format.

Return type:

dict[str, Any]

text_preparation.importers.bcul.helpers.parse_textline(line: Tag) → dict[str, list[Any]]

Parse the div element corresponding to a textline.

Parameters:: line (Tag) – Textline div element Tag.
Returns:: Parsed line of text.
Return type:: dict[str, list]

text_preparation.importers.bcul.helpers.verify_issue_has_ocr_files(path: str) → None

Ensure the path to the issue considered contains xml files.

Parameters:: path (str) – Path to the issue considered
Raises:: FileNotFoundError – No XNL OCR files were found in the path.