Generic TETML importer

This generic importer was developed to parse the OCR document data, produced by PDFlib TET.

Tetml Custom classes

Classes to handle the TETML OCR format.

class text_preparation.importers.tetml.classes.TetmlNewspaperIssue(issue_dir: IssueDir)

Class representing a newspaper issue in TETML format.

Upon object initialization the following things happen:

index all the tetml documents
parse the tetml file that contains the actual content and some metadata
initialize page objects (instances of TetmlNewspaperPage).

Parameters:: issue_dir (IssueDir) – Newspaper issue with relevant information.

parse_articles(): Parse all articles of this issue

class text_preparation.importers.tetml.classes.TetmlNewspaperPage(_id: str, number: int, page_content: dict, page_xml)

Generic class representing a page in Tetml format.

Parameters:

number (int) – Page number.
page_content (dict) – Nested article content of a single page
page_xml (str) – Path to the Tetml file of the page.

add_issue(issue: CanonicalIssue)

Add to a page object its parent, i.e. the canonical issue.

This allows each page to preserve contextual information coming from the canonical issue.

Parameters:: issue (CanonicalIssue) – Canonical issue containing this page.

parse(): Process the page XML file and transform into canonical Page format.

Note

This lazy behavior means that the page contents are not processed upon creation of the page object, but only once the parse() method is called.

Tetml Detect functions

This module contains functions to detect Tetml OCR data to be imported.

text_preparation.importers.tetml.detect.TetmlIssueDir

A light-weight data structure to represent a newspaper issue.

This named tuple contains basic metadata about a newspaper issue. They can then be used to locate the relevant data in the filesystem or to create canonical identifiers for the issue and its pages.

Note

In case of newspaper published multiple times per day, a lowercase letter is used to indicate the edition number: ‘a’ for the first, ‘b’ for the second, etc.

Parameters:

provider (str) – Provider for this alias, here always “NZZ” or “FedGaz”
alias (str) – Newspaper alias
date (datetime.date) – Publication date
edition (str) – Edition of the newspaper issue (‘a’, ‘b’, ‘c’, etc.)
path (str) – Path to the directory containing OCR data

>>> from datetime import date
>>> i = TetmlIssueDir('NZZ', 'NZZ', date(1900,1,1), 'a', './NZZ-1900-01-01/')

text_preparation.importers.tetml.detect.tetml_detect_issues(base_dir: str, alias_filter: set = None, exclude: bool = False) → List[TetmlIssueDirectory]

Detect newspaper issues to import within the filesystem.

This function expects the directory structure that RERO used to organize the dump of Tetml OCR data.

Parameters:

base_dir (str) – Path to the base directory of newspaper data.
alias_filter (set) – IDs of newspapers to consider.
exclude (bool) – Whether alias_filter should determine exclusion.

Returns:

List of TetmlIssueDir instances, to be imported.

text_preparation.importers.tetml.detect.tetml_select_issues(base_dir: str, config: dict) → List[TetmlIssueDirectory]

Detect selectively newspaper issues to import.

The behavior is very similar to tetml_detect_issues() with the only difference that config specifies some rules to filter the data to import. See this section for further details on how to configure filtering.

Parameters:

base_dir (str) – Path to the base directory of newspaper data.
config (dict) – Config dictionary for filtering.

Returns:

List of TetmlIssueDir instances, to be imported.

Tetml parsers

Functions to parse TETML data.

text_preparation.importers.tetml.parsers.tetml_parser(tetml: str, filtering: bool = True, ignore_page_number: bool = True, language='de') → dict

Parse a TETML file (e.g. from Swiss Federal Archive).

The main logic implemented here was derived from https://github.com/impresso/nzz/. A TETML file corresponds loosely to one article given by the boundaries of the founding pdf.

Parameters:

tetml (text) – path to tetml file that needs to be parsed
bool (filtering) – call method to filter out pre-defined tokens

Returns:

A dictionary with keys: metadata, pages (content), meta (additional metadata).

Return type:

dict

Tetml Helper methods

Helper functions used by the Tetml Importer.

These functions are mainly used within (i.e. called by) the classes TetmlNewspaperIssue and TetmlNewspaperPage.

text_preparation.importers.tetml.helpers.add_gn_property(tokens: [<class 'dict'>], language: str) → None

Set property to indicate the use of whitespace following a token

Parameters:

tokens (list) – list of token dictionaries.
language (str) – abbreviation of languages (de, fr, eng etc.).

Returns:

None

text_preparation.importers.tetml.helpers.compute_bb(innerbbs: list) → list

Compute coordinates of the bounding box from multiple boxes.

Parameters:: innerbbs (list) – List of multiple inner boxes (x,y,w,h).
Returns:: List of coordinates from the bounding box (x,y,w,h).

text_preparation.importers.tetml.helpers.compute_box(llx: float, lly: float, urx: float, ury: float, pageheight: float, pagewidth: float, imageheight: float, imagewidth: float, placedimage_attribs: dict) → list

Compute IIIF box coordinates of input_box.

New box coordinates [x,y,w,h] are in IIIF coordinate system https://iiif.io/api/image/2.0/#region

(x, y)
*--------------
|             |
|             |
|             |
|             |
|             |
|             |
--------------*
              (x2, y2)

w = x2 - x
h = y2 - y

Parameters:

pageheight (float)
pagewidth (float)
imageheight (float)
imagewidth (float)
llx (float) – lower left x coordinate (lower=smaller)
lly (float) – lower left y coordinate (lower=smaller)
urx (float) – upper right x coordinate (upper=bigger)
ury (float) – upper right y coordinate (upper=bigger)
placedimage_attribs (dict) – all attributes of the placed image

Returns:

list with new box coordinates

Return type:

list

text_preparation.importers.tetml.helpers.filter_special_symbols(jtoken: dict) → bool

Check if token needs to be filtered out as it is a non-content word

Parameters:: jtoken (dict) – Token text and coordinates.
Returns:: bool to indicate stop or content word

text_preparation.importers.tetml.helpers.get_metadata(root: Element) → dict

Return dict with relevant metadata from page file

Parameters:: root – etree.Element of tetml page file
Returns:: A dictionary with keys: tetcdt, pdfpath, pdfcdt, npages.

text_preparation.importers.tetml.helpers.get_placed_image(root: Element) → dict

Return dimensions of the placed image

`` <PlacedImage image=”I0” x=”0.00” y=”0.00” width=”588.84” height=”842.00” /> => {“image”:”IO”, ,…} `` :param etree.Element: TETML document. :return: dict with all attributes of image xml element

text_preparation.importers.tetml.helpers.get_tif_shape(root: Element, id_image: str) → tuple

Return original tiff dimensions stored in tetml

`` <Image id=”I0” extractedAs=”.tif” width=”1404” height=”2367” colorspace=”CS0” bitsPerComponent=”1”/> ``

Parameters:: root – etree.ELement
Returns:: width and height of tiff image.

text_preparation.importers.tetml.helpers.remove_page_number(jtoken: dict, i_line: int, i_word: int) → bool

Check if page number in the header appears within the first 3 tokens of the first line and is not longer than 3 digits.

Parameters:

jtoken (dict) – Token text and coordinates.
i_line (int) – Line number.
i_word (dict) – Word number in line.

Returns:

bool to indicate page number.

text_preparation.importers.tetml.helpers.word2json(word: Element, pageheight: float, pagewidth: float, imageheight: float, imagewidth: float, placed_image_attribs: dict, filename: str = None) → dict

Return dict with all information about the (hyphenated) TETML word element

{"tx": Text, "c": coords, "hy" : Bool, "hyt": {"nf": Text, "c":coords, "tx":coords}}

“hyt” is {} if word is not hyphenated

Parameters:

pageheight (float)
pagewidth (float)
imageheight (float)
imagewidth (float)
placed_image_attribs (dict)
filename (str)
word (lxml.etree.Element)

Returns:

dictionary with token text and metadata