Generic TETML importer

This generic importer was developed to parse the OCR document data, produced by PDFlib TET.

Tetml Custom classes

Classes to handle the TETML OCR format.

class text_importer.importers.tetml.classes.TetmlNewspaperIssue(issue_dir: IssueDir)

Class representing a newspaper issue in TETML format.

Upon object initialization the following things happen:

  • index all the tetml documents

  • parse the tetml file that contains the actual content and some metadata

  • initialize page objects (instances of TetmlNewspaperPage).

Parameters:

issue_dir (IssueDir) – Newspaper issue with relevant information.

parse_articles()

Parse all articles of this issue

class text_importer.importers.tetml.classes.TetmlNewspaperPage(_id: str, number: int, page_content: dict, page_xml)

Generic class representing a page in Tetml format.

Parameters:
  • number (int) – Page number.

  • page_content (dict) – Nested article content of a single page

  • page_xml (str) – Path to the Tetml file of the page.

add_issue(issue: NewspaperIssue)

Add to a page object its parent, i.e. the newspaper issue.

This allows each page to preserve contextual information coming from the newspaper issue.

Parameters:

issue (NewspaperIssue) – Newspaper issue containing this page.

parse()

Process the page XML file and transform into canonical Page format.

Note

This lazy behavior means that the page contents are not processed upon creation of the page object, but only once the parse() method is called.

Tetml Detect functions

text_importer.importers.tetml.detect.TetmlIssueDir

A light-weight data structure to represent a newspaper issue.

This named tuple contains basic metadata about a newspaper issue. They can then be used to locate the relevant data in the filesystem or to create canonical identifiers for the issue and its pages.

Note

In case of newspaper published multiple times per day, a lowercase letter is used to indicate the edition number: ‘a’ for the first, ‘b’ for the second, etc.

Parameters:
  • journal (str) – Newspaper ID

  • date (datetime.date) – Publication date

  • edition (str) – Edition of the newspaper issue (‘a’, ‘b’, ‘c’, etc.)

  • path (str) – Path to the directory containing OCR data

  • rights (str) – Access rights on the data (open, closed, etc.)

>>> from datetime import date
>>> i = TetmlIssueDir('GDL', date(1900,1,1), 'a', './GDL-1900-01-01/', 'open')
text_importer.importers.tetml.detect.dir2tetmldir(issue_dir: IssueDir, access_rights: dict) TetmlIssueDirectory

Helper function that injects access rights info into an IssueDir.

Note

This function is called internally by tetml_detect_issues().

Parameters:
  • issue_dir (IssueDir) – Input IssueDir object.

  • access_rights (dict) – Access rights information.

Returns:

New TetmlIssueDir object.

text_importer.importers.tetml.detect.tetml_detect_issues(base_dir: str, access_rights: str, journal_filter: set = None, exclude: bool = False) List[TetmlIssueDirectory]

Detect newspaper issues to import within the filesystem.

This function expects the directory structure that RERO used to organize the dump of Tetml OCR data.

Parameters:
  • base_dir (str) – Path to the base directory of newspaper data.

  • access_rights (str) – Path to access_rights.json file.

  • journal_filter (set) – IDs of newspapers to consider.

  • exclude (bool) – Whether journal_filter should determine exclusion.

Returns:

List of TetmlIssueDir instances, to be imported.

text_importer.importers.tetml.detect.tetml_select_issues(base_dir: str, config: dict, access_rights: str) List[TetmlIssueDirectory]

Detect selectively newspaper issues to import.

The behavior is very similar to tetml_detect_issues() with the only difference that config specifies some rules to filter the data to import. See this section for further details on how to configure filtering.

Parameters:
  • base_dir (str) – Path to the base directory of newspaper data.

  • config (dict) – Config dictionary for filtering.

  • access_rights (str) – Path to access_rights.json file.

Returns:

List of TetmlIssueDir instances, to be imported.

Tetml parsers

Functions to parse TETML data.

text_importer.importers.tetml.parsers.tetml_parser(tetml: str, filtering: bool = True, ignore_page_number: bool = True, language='de') dict

Parse a TETML file (e.g. from Swiss Federal Archive).

The main logic implemented here was derived from https://github.com/impresso/nzz/. A TETML file corresponds loosely to one article given by the boundaries of the founding pdf.

Parameters:
  • tetml (text) – path to tetml file that needs to be parsed

  • bool (filtering) – call method to filter out pre-defined tokens

Returns:

A dictionary with keys: metadata, pages (content), meta (additional metadata).

Return type:

dict

Tetml Helper methods

Helper functions used by the Tetml Importer.

These functions are mainly used within (i.e. called by) the classes TetmlNewspaperIssue and TetmlNewspaperPage.

text_importer.importers.tetml.helpers.add_gn_property(tokens: [<class 'dict'>], language: str) None

Set property to indicate the use of whitespace following a token

Parameters:
  • tokens (list) – list of token dictionaries.

  • language (str) – abbreviation of languages (de, fr, eng etc.).

Returns:

None

text_importer.importers.tetml.helpers.compute_bb(innerbbs: list) list

Compute coordinates of the bounding box from multiple boxes.

Parameters:

innerbbs (list) – List of multiple inner boxes (x,y,w,h).

Returns:

List of coordinates from the bounding box (x,y,w,h).

text_importer.importers.tetml.helpers.compute_box(llx: float, lly: float, urx: float, ury: float, pageheight: float, pagewidth: float, imageheight: float, imagewidth: float, placedimage_attribs: dict) list

Compute IIIF box coordinates of input_box.

New box coordinates [x,y,w,h] are in IIIF coordinate system https://iiif.io/api/image/2.0/#region

(x, y)
*--------------
|             |
|             |
|             |
|             |
|             |
|             |
--------------*
              (x2, y2)

w = x2 - x
h = y2 - y
Parameters:
  • pageheight (float) –

  • pagewidth (float) –

  • imageheight (float) –

  • imagewidth (float) –

  • llx (float) – lower left x coordinate (lower=smaller)

  • lly (float) – lower left y coordinate (lower=smaller)

  • urx (float) – upper right x coordinate (upper=bigger)

  • ury (float) – upper right y coordinate (upper=bigger)

  • placedimage_attribs (dict) – all attributes of the placed image

Returns:

list with new box coordinates

Return type:

list

text_importer.importers.tetml.helpers.filter_special_symbols(jtoken: dict) bool

Check if token needs to be filtered out as it is a non-content word

Parameters:

jtoken (dict) – Token text and coordinates.

Returns:

bool to indicate stop or content word

text_importer.importers.tetml.helpers.get_metadata(root: Element) dict

Return dict with relevant metadata from page file

Parameters:

root – etree.Element of tetml page file

Returns:

A dictionary with keys: tetcdt, pdfpath, pdfcdt, npages.

text_importer.importers.tetml.helpers.get_placed_image(root: Element) dict

Return dimensions of the placed image

`` <PlacedImage image=”I0” x=”0.00” y=”0.00” width=”588.84” height=”842.00” /> => {“image”:”IO”, ,…} `` :param etree.Element: TETML document. :return: dict with all attributes of image xml element

text_importer.importers.tetml.helpers.get_tif_shape(root: Element, id_image: str) tuple

Return original tiff dimensions stored in tetml

`` <Image id=”I0” extractedAs=”.tif” width=”1404” height=”2367” colorspace=”CS0” bitsPerComponent=”1”/> ``

Parameters:

root – etree.ELement

Returns:

width and height of tiff image.

text_importer.importers.tetml.helpers.remove_page_number(jtoken: dict, i_line: int, i_word: int) bool

Check if page number in the header appears within the first 3 tokens of the first line and is not longer than 3 digits.

Parameters:
  • jtoken (dict) – Token text and coordinates.

  • i_line (int) – Line number.

  • i_word (dict) – Word number in line.

Returns:

bool to indicate page number.

text_importer.importers.tetml.helpers.word2json(word: Element, pageheight: float, pagewidth: float, imageheight: float, imagewidth: float, placed_image_attribs: dict, filename: str = None) dict

Return dict with all information about the (hyphenated) TETML word element

{"tx": Text, "c": coords, "hy" : Bool, "hyt": {"nf": Text, "c":coords, "tx":coords}}

“hyt” is {} if word is not hyphenated

Parameters:
  • pageheight (float) –

  • pagewidth (float) –

  • imageheight (float) –

  • imagewidth (float) –

  • placed_image_attribs (dict) –

  • filename (str) –

  • word (lxml.etree.Element) –

Returns:

dictionary with token text and metadata