Generic TETML importer
This generic importer was developed to parse the OCR document data, produced by PDFlib TET.
Tetml Custom classes
Classes to handle the TETML OCR format.
- class text_preparation.importers.tetml.classes.TetmlNewspaperIssue(issue_dir: IssueDir)
Class representing a newspaper issue in TETML format.
Upon object initialization the following things happen:
index all the tetml documents
parse the tetml file that contains the actual content and some metadata
initialize page objects (instances of
TetmlNewspaperPage
).
- Parameters:
issue_dir (IssueDir) – Newspaper issue with relevant information.
- parse_articles()
Parse all articles of this issue
- class text_preparation.importers.tetml.classes.TetmlNewspaperPage(_id: str, number: int, page_content: dict, page_xml)
Generic class representing a page in Tetml format.
- Parameters:
number (int) – Page number.
page_content (dict) – Nested article content of a single page
page_xml (str) – Path to the Tetml file of the page.
- add_issue(issue: NewspaperIssue)
Add to a page object its parent, i.e. the newspaper issue.
This allows each page to preserve contextual information coming from the newspaper issue.
- Parameters:
issue (NewspaperIssue) – Newspaper issue containing this page.
- parse()
Process the page XML file and transform into canonical Page format.
Note
This lazy behavior means that the page contents are not processed upon creation of the page object, but only once the
parse()
method is called.
Tetml Detect functions
- text_preparation.importers.tetml.detect.TetmlIssueDir
A light-weight data structure to represent a newspaper issue.
This named tuple contains basic metadata about a newspaper issue. They can then be used to locate the relevant data in the filesystem or to create canonical identifiers for the issue and its pages.
Note
In case of newspaper published multiple times per day, a lowercase letter is used to indicate the edition number: ‘a’ for the first, ‘b’ for the second, etc.
- Parameters:
journal (str) – Newspaper ID
date (datetime.date) – Publication date
edition (str) – Edition of the newspaper issue (‘a’, ‘b’, ‘c’, etc.)
path (str) – Path to the directory containing OCR data
rights (str) – Access rights on the data (open, closed, etc.)
>>> from datetime import date >>> i = TetmlIssueDir('GDL', date(1900,1,1), 'a', './GDL-1900-01-01/', 'open')
- text_preparation.importers.tetml.detect.dir2tetmldir(issue_dir: IssueDir, access_rights: dict) TetmlIssueDirectory
Helper function that injects access rights info into an
IssueDir
.Note
This function is called internally by
tetml_detect_issues()
.- Parameters:
issue_dir (IssueDir) – Input
IssueDir
object.access_rights (dict) – Access rights information.
- Returns:
New
TetmlIssueDir
object.
- text_preparation.importers.tetml.detect.tetml_detect_issues(base_dir: str, access_rights: str, journal_filter: set = None, exclude: bool = False) List[TetmlIssueDirectory]
Detect newspaper issues to import within the filesystem.
This function expects the directory structure that RERO used to organize the dump of Tetml OCR data.
- Parameters:
base_dir (str) – Path to the base directory of newspaper data.
access_rights (str) – Path to
access_rights.json
file.journal_filter (set) – IDs of newspapers to consider.
exclude (bool) – Whether
journal_filter
should determine exclusion.
- Returns:
List of TetmlIssueDir instances, to be imported.
- text_preparation.importers.tetml.detect.tetml_select_issues(base_dir: str, config: dict, access_rights: str) List[TetmlIssueDirectory]
Detect selectively newspaper issues to import.
The behavior is very similar to
tetml_detect_issues()
with the only difference thatconfig
specifies some rules to filter the data to import. See this section for further details on how to configure filtering.- Parameters:
base_dir (str) – Path to the base directory of newspaper data.
config (dict) – Config dictionary for filtering.
access_rights (str) – Path to
access_rights.json
file.
- Returns:
List of TetmlIssueDir instances, to be imported.
Tetml parsers
Functions to parse TETML data.
- text_preparation.importers.tetml.parsers.tetml_parser(tetml: str, filtering: bool = True, ignore_page_number: bool = True, language='de') dict
Parse a TETML file (e.g. from Swiss Federal Archive).
The main logic implemented here was derived from https://github.com/impresso/nzz/. A TETML file corresponds loosely to one article given by the boundaries of the founding pdf.
- Parameters:
tetml (text) – path to tetml file that needs to be parsed
bool (filtering) – call method to filter out pre-defined tokens
- Returns:
A dictionary with keys:
metadata
,pages (content)
,meta (additional metadata)
.- Return type:
dict
Tetml Helper methods
Helper functions used by the Tetml Importer.
These functions are mainly used within (i.e. called by) the classes
TetmlNewspaperIssue
and TetmlNewspaperPage
.
- text_preparation.importers.tetml.helpers.add_gn_property(tokens: [<class 'dict'>], language: str) None
Set property to indicate the use of whitespace following a token
- Parameters:
tokens (list) – list of token dictionaries.
language (str) – abbreviation of languages (de, fr, eng etc.).
- Returns:
None
- text_preparation.importers.tetml.helpers.compute_bb(innerbbs: list) list
Compute coordinates of the bounding box from multiple boxes.
- Parameters:
innerbbs (list) – List of multiple inner boxes (x,y,w,h).
- Returns:
List of coordinates from the bounding box (x,y,w,h).
- text_preparation.importers.tetml.helpers.compute_box(llx: float, lly: float, urx: float, ury: float, pageheight: float, pagewidth: float, imageheight: float, imagewidth: float, placedimage_attribs: dict) list
Compute IIIF box coordinates of input_box.
New box coordinates [x,y,w,h] are in IIIF coordinate system https://iiif.io/api/image/2.0/#region
(x, y) *-------------- | | | | | | | | | | | | --------------* (x2, y2) w = x2 - x h = y2 - y
- Parameters:
pageheight (float)
pagewidth (float)
imageheight (float)
imagewidth (float)
llx (float) – lower left x coordinate (lower=smaller)
lly (float) – lower left y coordinate (lower=smaller)
urx (float) – upper right x coordinate (upper=bigger)
ury (float) – upper right y coordinate (upper=bigger)
placedimage_attribs (dict) – all attributes of the placed image
- Returns:
list with new box coordinates
- Return type:
list
- text_preparation.importers.tetml.helpers.filter_special_symbols(jtoken: dict) bool
Check if token needs to be filtered out as it is a non-content word
- Parameters:
jtoken (dict) – Token text and coordinates.
- Returns:
bool to indicate stop or content word
- text_preparation.importers.tetml.helpers.get_metadata(root: Element) dict
Return dict with relevant metadata from page file
- Parameters:
root – etree.Element of tetml page file
- Returns:
A dictionary with keys:
tetcdt
,pdfpath
,pdfcdt
,npages
.
- text_preparation.importers.tetml.helpers.get_placed_image(root: Element) dict
Return dimensions of the placed image
`` <PlacedImage image=”I0” x=”0.00” y=”0.00” width=”588.84” height=”842.00” /> => {“image”:”IO”, ,…} `` :param etree.Element: TETML document. :return: dict with all attributes of image xml element
- text_preparation.importers.tetml.helpers.get_tif_shape(root: Element, id_image: str) tuple
Return original tiff dimensions stored in tetml
`` <Image id=”I0” extractedAs=”.tif” width=”1404” height=”2367” colorspace=”CS0” bitsPerComponent=”1”/> ``
- Parameters:
root – etree.ELement
- Returns:
width and height of tiff image.
- text_preparation.importers.tetml.helpers.remove_page_number(jtoken: dict, i_line: int, i_word: int) bool
Check if page number in the header appears within the first 3 tokens of the first line and is not longer than 3 digits.
- Parameters:
jtoken (dict) – Token text and coordinates.
i_line (int) – Line number.
i_word (dict) – Word number in line.
- Returns:
bool to indicate page number.
- text_preparation.importers.tetml.helpers.word2json(word: Element, pageheight: float, pagewidth: float, imageheight: float, imagewidth: float, placed_image_attribs: dict, filename: str = None) dict
Return dict with all information about the (hyphenated) TETML word element
{"tx": Text, "c": coords, "hy" : Bool, "hyt": {"nf": Text, "c":coords, "tx":coords}}
“hyt” is {} if word is not hyphenated
- Parameters:
pageheight (float)
pagewidth (float)
imageheight (float)
imagewidth (float)
placed_image_attribs (dict)
filename (str)
word (lxml.etree.Element)
- Returns:
dictionary with token text and metadata