BCUL ABBYY importer

This importer is written to accomodate the ABBYY OCR format. It was developed to handle OCR newspaper data provided by the Bibliothèque Cantonale Universitaire de Lausanne (BCUL - Lausanne Cantonal University Library), which are part of the Scriptorium interface and collection.

BCUL Custom classes

This module contains the definition of the BCUL importer classes.

The classes define newspaper Issues and Pages objects which convert OCR data in the ABBYY format to a unified canoncial format.

class text_importer.importers.bcul.classes.BculNewspaperIssue(issue_dir)

Bases: NewspaperIssue

Newspaper Issue in BCUL (Abby) format.

Parameters:

issue_dir (IssueDir) – Identifying information about the issue.

id

Canonical Issue ID (e.g. GDL-1900-01-02-a).

Type:

str

edition

Lower case letter ordering issues of the same day.

Type:

str

journal

Newspaper unique identifier or name.

Type:

str

path

Path to directory containing the issue’s OCR data.

Type:

str

date

Publication date of issue.

Type:

datetime.date

issue_data

Issue data according to canonical format.

Type:

dict[str, Any]

pages

list of NewspaperPage instances from this issue.

Type:

list

rights

Access rights applicable to this issue.

Type:

str

mit_file

Path to the ABBY ‘mit’ file that contains the OLR.

Type:

str

is_json

Whether the mit_file has the json file extension.

Type:

bool

is_xml

Whether the mit_file has the xml file extension.

Type:

bool

iiif_manifest

Presentation iiif manifest for this issue.

Type:

str

content_items

List of content items in this issue.

Type:

list[dict]

query_iiif_api(num_tries: int = 0, max_retries: int = 3) dict[str, Any]

Query the Scriptorium IIIF API for the issue’s manifest data.

TODO: implement the retry approach with celery package or similar.

Parameters:
  • num_tries (int, optional) – Number of retry attempts. Defaults to 0.

  • max_retries (int, optional) – Maximum number of attempts. Defaults to 3.

Returns:

Issue’s IIIF “canvases” for each page.

Return type:

dict[str, Any]

Raises:

Exception – If the maximum number of retry attempts is reached.

class text_importer.importers.bcul.classes.BculNewspaperPage(_id: str, number: int, page_path: str, iiif_uri: str)

Bases: NewspaperPage

Newspaper page in BCUL (Abbyy) format.

Parameters:
  • _id (str) – Canonical page ID.

  • number (int) – Page number.

  • page_path (str) – Path to the Abby XML page file.

  • iiif_uri (str) – URI to image IIIF of this page.

id

Canonical Page ID (e.g. GDL-1900-01-02-a-p0004).

Type:

str

number

Page number.

Type:

int

page_data

Page data according to canonical format.

Type:

dict[str, Any]

issue

Issue this page is from.

Type:

NewspaperIssue

path

Path to the Abby XML page file.

Type:

str

iiif_base_uri

URI to image IIIF of this page.

Type:

str

add_issue(issue: NewspaperIssue) None

Add to a page object its parent, i.e. the newspaper issue.

This allows each page to preserve contextual information coming from the newspaper issue.

Parameters:

issue (NewspaperIssue) – Newspaper issue containing this page.

property ci_id: str

Create and return the content item ID of the page.

Given that BCUL data do not entail article-level segmentation, each page is considered as a content item. Thus, to mint the content item ID we take the canonical page ID and simply replace the “p” prefix with “i”.

Returns:

Content item id.

Return type:

str

get_ci_divs() list[Tag]

Fetch and return the divs of tables and pictures from this page.

While BCUL does not entail article-level segmentation, tables and pictures are still segmented. They can thus have their own content item objects.

Returns:

List of segmented table and picture elements.

Return type:

list[Tag]

parse() None

Process the page XML file and transform into canonical Page format.

Note

This lazy behavior means that the page contents are not processed upon creation of the page object, but only once the parse() method is called.

property xml: BeautifulSoup

BCUL Detect functions

This module contains helper functions to find BCUL OCR data to import.

text_importer.importers.bcul.detect.BculIssueDir

A light-weight data structure to represent a newspaper issue.

This named tuple contains basic metadata about a newspaper issue. They can then be used to locate the relevant data in the filesystem or to create canonical identifiers for the issue and its pages.

Note

In case of newspaper published multiple times per day, a lowercase letter is used to indicate the edition number: ‘a’ for the first, ‘b’ for the second, etc.

Parameters:
  • journal (str) – Newspaper ID.

  • date (datetime.date) – Publication date or issue.

  • edition (str) – Edition of the newspaper issue (‘a’, ‘b’, ‘c’, etc.).

  • path (str) – Path to the directory containing the issue’s OCR data.

  • rights (str) – Access rights on the data (open, closed, etc.).

  • rights – Type of mit file for this issue (json or xml).

>>> from datetime import date
>>> i = BculIssueDir(
    journal='FAL',
    date=datetime.date(1762, 12, 07),
    edition='a',
    path='./BCUL/46165',
    rights='open_public',
    mit_file_type:'json'
)
text_importer.importers.bcul.detect.detect_issues(base_dir: str, access_rights: str) list[IssueDirectory]

Detect BCUL newspaper issues to import within the filesystem.

This function expects the directory structure that BCUL used to organize the dump of Abbyy files.

Parameters:
  • base_dir (str) – Path to the base directory of newspaper data.

  • access_rights (str) – Path to access_rights_and_aliases.json file.

Returns:

List of BCULIssueDir instances, to be imported.

Return type:

list[BculIssueDir]

text_importer.importers.bcul.detect.dir2issue(path: str, journal_info: dict[str, str]) IssueDirectory | None

Create a BculIssueDir object from a directory.

Note

This function is called internally by detect_issues

Parameters:
  • path (str) – The path of the issue.

  • access_rights (dict) – Dictionary for access rights.

Returns:

New BculIssueDir object.

Return type:

BculIssueDir | None

text_importer.importers.bcul.detect.select_issues(base_dir: str, config: dict, access_rights: str) list[IssueDirectory] | None

Detect selectively newspaper issues to import.

The behavior is very similar to detect_issues() with the only difference that config specifies some rules to filter the data to import. See this section for further details on how to configure filtering.

Parameters:
  • base_dir (str) – Path to the base directory of newspaper data.

  • config (dict) – Config dictionary for filtering.

  • access_rights (str) – Not used for this imported, but argument is kept for uniformity.

Returns:

List of BculIssueDir to import.

Return type:

list[BculIssueDir] | None

BCUL Helper functions

Helper functions to parse BCUL OCR files.

text_importer.importers.bcul.helpers.find_mit_file(_dir: str) str

Given a directory, search for a file with a name ending with mit.

Parameters:

_dir (str) – Directory to look into.

Returns:

Path to the mit file once found.

Return type:

str

text_importer.importers.bcul.helpers.find_page_file_in_dir(base_path: str, file_id: str) str | None

Find the page file in a directory given the name it should have.

Parameters:
  • base_path (str) – The base path of the directory.

  • file_id (str) – The name of the page file if present.

Returns:

The path to the page file if found, otherwise None.

Return type:

str | None

text_importer.importers.bcul.helpers.get_div_coords(div: Tag) list[int]

Extract the coordinates from the given element and format them for iiif.

In Abbyy format, the coordinates are denoted by the bottom, top (y-axis), left and right (x-axis) values. But iiif coordinates should be formatted as [x, y, width, height], where (x,y) denotes the box’s top left corner: (l, t). Thus they need conversion.

Parameters:

div (Tag) – Element to extract the coordinates from

Returns:

Coordinates converted to the iiif format.

Return type:

list[int]

text_importer.importers.bcul.helpers.get_page_number(exif_file: str) int

Given an exif file, look for the page number inside.

This is for the JSON ‘flavour’ of BCUL, in which metadata about the pages are in JSON files which contain the substring exif.

Parameters:

exif_file (str) – Path to the exif file.

Raises:

ValueError – The page number could not be extracted from the file.

Returns:

Page number extracted from the file.

Return type:

int

text_importer.importers.bcul.helpers.parse_char_tokens(char_tokens: list[Tag]) list[dict[str, list[int] | str]]

Parse a list of div Tag to extract the tokens and coordinates within a line.

Parameters:

char_tokens (list[Tag]) – div Tags corresponding to a line of tokens to parse.

Returns:

List of reconstructed parsed tokens.

Return type:

list[dict[str, list[int] | str]]

text_importer.importers.bcul.helpers.parse_date(mit_filename: str) date

Given the Mit filename, parse the date and ensure it is valid.

Parameters:

mit_filename (str) – Filename of the ‘mit’ file.

Returns:

Publication date of the issue

Return type:

date

text_importer.importers.bcul.helpers.parse_textblock(block: Tag, page_ci_id: str) dict[str, Any]

Parse the given textblock element into a canonical region element.

Parameters:
  • block (Tag) – Text block div element to parse.

  • page_ci_id (str) – Canonical ID of the CI corresponding to this page.

Returns:

Parsed region object in canonical format.

Return type:

dict[str, Any]

text_importer.importers.bcul.helpers.parse_textline(line: Tag) dict[str, list[Any]]

Parse the div element corresponding to a textline.

Parameters:

line (Tag) – Textline div element Tag.

Returns:

Parsed line of text.

Return type:

dict[str, list]

text_importer.importers.bcul.helpers.verify_issue_has_ocr_files(path: str) None

Ensure the path to the issue considered contains xml files.

Parameters:

path (str) – Path to the issue considered

Raises:

FileNotFoundError – No XNL OCR files were found in the path.