BCUL ABBYY importer
This importer is written to accomodate the ABBYY OCR format. It was developed to handle OCR newspaper data provided by the Bibliothèque Cantonale Universitaire de Lausanne (BCUL - Lausanne Cantonal University Library), which are part of the Scriptorium interface and collection.
BCUL Custom classes
This module contains the definition of the BCUL importer classes.
The classes define newspaper Issues and Pages objects which convert OCR data in the ABBYY format to a unified canoncial format.
- class text_preparation.importers.bcul.classes.BculNewspaperIssue(issue_dir)
Bases:
NewspaperIssue
Newspaper Issue in BCUL (Abby) format.
- Parameters:
issue_dir (IssueDir) – Identifying information about the issue.
- id
Canonical Issue ID (e.g.
GDL-1900-01-02-a
).- Type:
str
- edition
Lower case letter ordering issues of the same day.
- Type:
str
- journal
Newspaper unique identifier or name.
- Type:
str
- path
Path to directory containing the issue’s OCR data.
- Type:
str
- date
Publication date of issue.
- Type:
datetime.date
- issue_data
Issue data according to canonical format.
- Type:
dict[str, Any]
- pages
list of
NewspaperPage
instances from this issue.- Type:
list
- rights
Access rights applicable to this issue.
- Type:
str
- mit_file
Path to the ABBY ‘mit’ file that contains the OLR.
- Type:
str
- is_json
Whether the mit_file has the json file extension.
- Type:
bool
- is_xml
Whether the mit_file has the xml file extension.
- Type:
bool
- iiif_manifest
Presentation iiif manifest for this issue.
- Type:
str
- content_items
List of content items in this issue.
- Type:
list[dict]
- query_iiif_api(num_tries: int = 0, max_retries: int = 3) dict[str, Any]
Query the Scriptorium IIIF API for the issue’s manifest data.
TODO: implement the retry approach with celery package or similar.
- Parameters:
num_tries (int, optional) – Number of retry attempts. Defaults to 0.
max_retries (int, optional) – Maximum number of attempts. Defaults to 3.
- Returns:
Issue’s IIIF “canvases” for each page.
- Return type:
dict[str, Any]
- Raises:
Exception – If the maximum number of retry attempts is reached.
- class text_preparation.importers.bcul.classes.BculNewspaperPage(_id: str, number: int, page_path: str, iiif_uri: str)
Bases:
NewspaperPage
Newspaper page in BCUL (Abbyy) format.
- Parameters:
_id (str) – Canonical page ID.
number (int) – Page number.
page_path (str) – Path to the Abby XML page file.
iiif_uri (str) – URI to image IIIF of this page.
- id
Canonical Page ID (e.g.
GDL-1900-01-02-a-p0004
).- Type:
str
- number
Page number.
- Type:
int
- page_data
Page data according to canonical format.
- Type:
dict[str, Any]
- issue
Issue this page is from.
- Type:
- path
Path to the Abby XML page file.
- Type:
str
- iiif_base_uri
URI to image IIIF of this page.
- Type:
str
- add_issue(issue: NewspaperIssue) None
Add to a page object its parent, i.e. the newspaper issue.
This allows each page to preserve contextual information coming from the newspaper issue.
- Parameters:
issue (NewspaperIssue) – Newspaper issue containing this page.
- property ci_id: str
Create and return the content item ID of the page.
Given that BCUL data do not entail article-level segmentation, each page is considered as a content item. Thus, to mint the content item ID we take the canonical page ID and simply replace the “p” prefix with “i”.
- Returns:
Content item id.
- Return type:
str
- get_ci_divs() list[Tag]
Fetch and return the divs of tables and pictures from this page.
While BCUL does not entail article-level segmentation, tables and pictures are still segmented. They can thus have their own content item objects.
- Returns:
List of segmented table and picture elements.
- Return type:
list[Tag]
- parse() None
Process the page XML file and transform into canonical Page format.
Note
This lazy behavior means that the page contents are not processed upon creation of the page object, but only once the
parse()
method is called.
- property xml: BeautifulSoup
BCUL Detect functions
This module contains helper functions to find BCUL OCR data to import.
- text_preparation.importers.bcul.detect.BculIssueDir
A light-weight data structure to represent a newspaper issue.
This named tuple contains basic metadata about a newspaper issue. They can then be used to locate the relevant data in the filesystem or to create canonical identifiers for the issue and its pages.
Note
In case of newspaper published multiple times per day, a lowercase letter is used to indicate the edition number: ‘a’ for the first, ‘b’ for the second, etc.
- Parameters:
journal (str) – Newspaper ID.
date (datetime.date) – Publication date or issue.
edition (str) – Edition of the newspaper issue (‘a’, ‘b’, ‘c’, etc.).
path (str) – Path to the directory containing the issue’s OCR data.
rights (str) – Access rights on the data (open, closed, etc.).
rights – Type of mit file for this issue (json or xml).
>>> from datetime import date >>> i = BculIssueDir( journal='FAL', date=datetime.date(1762, 12, 07), edition='a', path='./BCUL/46165', rights='open_public', mit_file_type:'json' )
- text_preparation.importers.bcul.detect.detect_issues(base_dir: str, access_rights: str) list[IssueDirectory]
Detect BCUL newspaper issues to import within the filesystem.
This function expects the directory structure that BCUL used to organize the dump of Abbyy files.
- Parameters:
base_dir (str) – Path to the base directory of newspaper data.
access_rights (str) – Path to access_rights_and_aliases.json file.
- Returns:
List of BCULIssueDir instances, to be imported.
- Return type:
list[BculIssueDir]
- text_preparation.importers.bcul.detect.dir2issue(path: str, journal_info: dict[str, str]) IssueDirectory | None
Create a BculIssueDir object from a directory.
Note
This function is called internally by detect_issues
- Parameters:
path (str) – The path of the issue.
access_rights (dict) – Dictionary for access rights.
- Returns:
New BculIssueDir object.
- Return type:
BculIssueDir | None
- text_preparation.importers.bcul.detect.select_issues(base_dir: str, config: dict, access_rights: str) list[IssueDirectory] | None
Detect selectively newspaper issues to import.
The behavior is very similar to
detect_issues()
with the only difference thatconfig
specifies some rules to filter the data to import. See this section for further details on how to configure filtering.- Parameters:
base_dir (str) – Path to the base directory of newspaper data.
config (dict) – Config dictionary for filtering.
access_rights (str) – Not used for this imported, but argument is kept for uniformity.
- Returns:
List of BculIssueDir to import.
- Return type:
list[BculIssueDir] | None
BCUL Helper functions
Helper functions to parse BCUL OCR files.
- text_preparation.importers.bcul.helpers.find_mit_file(_dir: str) str
Given a directory, search for a file with a name ending with mit.
- Parameters:
_dir (str) – Directory to look into.
- Returns:
Path to the mit file once found.
- Return type:
str
- text_preparation.importers.bcul.helpers.find_page_file_in_dir(base_path: str, file_id: str) str | None
Find the page file in a directory given the name it should have.
- Parameters:
base_path (str) – The base path of the directory.
file_id (str) – The name of the page file if present.
- Returns:
The path to the page file if found, otherwise None.
- Return type:
str | None
- text_preparation.importers.bcul.helpers.get_div_coords(div: Tag) list[int]
Extract the coordinates from the given element and format them for iiif.
In Abbyy format, the coordinates are denoted by the bottom, top (y-axis), left and right (x-axis) values. But iiif coordinates should be formatted as [x, y, width, height], where (x,y) denotes the box’s top left corner: (l, t). Thus they need conversion.
- Parameters:
div (Tag) – Element to extract the coordinates from
- Returns:
Coordinates converted to the iiif format.
- Return type:
list[int]
- text_preparation.importers.bcul.helpers.get_page_number(exif_file: str) int
Given an exif file, look for the page number inside.
This is for the JSON ‘flavour’ of BCUL, in which metadata about the pages are in JSON files which contain the substring exif.
- Parameters:
exif_file (str) – Path to the exif file.
- Raises:
ValueError – The page number could not be extracted from the file.
- Returns:
Page number extracted from the file.
- Return type:
int
- text_preparation.importers.bcul.helpers.parse_char_tokens(char_tokens: list[Tag]) list[dict[str, list[int] | str]]
Parse a list of div Tag to extract the tokens and coordinates within a line.
- Parameters:
char_tokens (list[Tag]) – div Tags corresponding to a line of tokens to parse.
- Returns:
List of reconstructed parsed tokens.
- Return type:
list[dict[str, list[int] | str]]
- text_preparation.importers.bcul.helpers.parse_date(mit_filename: str) date
Given the Mit filename, parse the date and ensure it is valid.
- Parameters:
mit_filename (str) – Filename of the ‘mit’ file.
- Returns:
Publication date of the issue
- Return type:
date
- text_preparation.importers.bcul.helpers.parse_textblock(block: Tag, page_ci_id: str) dict[str, Any]
Parse the given textblock element into a canonical region element.
- Parameters:
block (Tag) – Text block div element to parse.
page_ci_id (str) – Canonical ID of the CI corresponding to this page.
- Returns:
Parsed region object in canonical format.
- Return type:
dict[str, Any]
- text_preparation.importers.bcul.helpers.parse_textline(line: Tag) dict[str, list[Any]]
Parse the div element corresponding to a textline.
- Parameters:
line (Tag) – Textline div element Tag.
- Returns:
Parsed line of text.
- Return type:
dict[str, list]
- text_preparation.importers.bcul.helpers.verify_issue_has_ocr_files(path: str) None
Ensure the path to the issue considered contains xml files.
- Parameters:
path (str) – Path to the issue considered
- Raises:
FileNotFoundError – No XNL OCR files were found in the path.