SWISSINFO PDF-embedded importer

This importer is a special case of PDF-embedded OCR which has been extracted into a custom JSON format. It was developed to handle OCR radio-bulletin data in PDF format provided by Memoriav from the Swissinfo collection of World-War II radio-bulletins.

SWISSINFO Custom classes

This module contains the definition of the SWISSINFO importer classes.

class text_preparation.importers.swissinfo.classes.SwissInfoRadioBulletinIssue(issue_dir: IssueDir)

Radio-Bulletin Issue for SWISSINFO’s OCR format.

Parameters:: issue_dir (IssueDir) – Identifying information about the issue.

id

Canonical Issue ID (e.g. SOC_CJ-1940-01-05-a).

Type:: str

edition

Lower case letter ordering issues of the same day.

Type:: str

alias

Media title unique alias (identifier or name).

Type:: str

path

Path to directory containing the issue’s OCR data.

Type:: str

date

Publication date of issue.

Type:: datetime.date

issue_data

Issue data according to canonical format.

Type:: dict[str, Any]

pages

list of :obj: SwissInfoRadioBulletinPage instances from this issue.

Type:: list

class text_preparation.importers.swissinfo.classes.SwissInfoRadioBulletinPage(_id: str, number: int)

Radio-Bulletin Page for SWISSINFO’s OCR format.

Parameters:

_id (str) – Canonical Page ID (e.g. GDL-1900-01-02-a-p0004).
number (int) – Page number.

id

Canonical Page ID (e.g. SOC_CJ-1940-01-05-a-p0001).

Type:: str

number

Page number.

Type:: int

page_data

Page data according to canonical format.

Type:: dict[str, Any]

issue

Radio Bulleting issue this page is from.

Type:: CanonicalIssue

path

Path to the jp2 page file.

Type:: str

add_issue(issue: CanonicalIssue) → None

Add to a page object its parent, i.e. the canonical issue.

This allows each page to preserve contextual information coming from the canonical issue.

Parameters:: issue (CanonicalIssue) – Canonical issue containing this page.

parse() → None: Process the page XML file and transform into canonical Page format.

Note

This lazy behavior means that the page contents are not processed upon creation of the page object, but only once the parse() method is called.

SWISSINFO Detect functions

This module contains helper functions to find SWISSINFO OCR data to be imported.

text_preparation.importers.swissinfo.detect.SwissInfoIssueDir

A light-weight data structure to represent a radio bulletin issue.

This named tuple contains basic metadata about a newspaper issue. They can then be used to locate the relevant data in the filesystem or to create canonical identifiers for the issue and its pages.

Note

In case of bulletins published multiple times per day, a lowercase letter is used to indicate the edition number: ‘a’ for the first, ‘b’ for the second, etc.

Parameters:

alias (str) – Bulletin alias.
date (datetime.date) – Publication date or issue.
edition (str) – Edition of the newspaper issue (‘a’, ‘b’, ‘c’, etc.).
path (str) – Path to the directory containing the issue’s OCR data.

>>> from datetime import date
>>> i = SwissInfoIssueDir(
    alias='SOC_CJ',
    date=datetime.date(1940, 07, 22),
    edition='a',
    path='./SOC_CJ/1940/07/22/a',
    metadata_file='../data/sample_data/SWISSINFO/bulletins_metadata.json'
)

text_preparation.importers.swissinfo.detect.detect_issues(base_dir: str) → list[IssueDirectory]

Detect SWISSINFO Radio bulletins to import within the filesystem.

This function expects the directory structure that we created for Swissinfo.

Parameters:

base_dir (str) – Path to the base directory of newspaper data.
access_rights (str) – unused argument kept for conformity for now.

Returns:

List of SwissInfoIssueDir instances, to be imported.

Return type:

list[SwissInfoIssueDir]

text_preparation.importers.swissinfo.detect.dir2issue(path: str, metadata_file_path: str) → IssueDirectory | None

Create a SwissInfoIssueDir object from a directory.

Note

This function is called internally by detect_issues

Parameters:

path (str) – The path of the issue.
access_rights (dict) – Dictionary for access rights.

Returns:

New SwissInfoIssueDir object.

Return type:

SwissInfoIssueDir | None

text_preparation.importers.swissinfo.detect.select_issues(base_dir: str, config: dict) → list[IssueDirectory] | None

Detect selectively newspaper issues to import.

The behavior is very similar to detect_issues() with the only difference that config specifies some rules to filter the data to import. See this section for further details on how to configure filtering.

Parameters:

base_dir (str) – Path to the base directory of newspaper data.
config (dict) – Config dictionary for filtering.

Returns:

List of SwissInfoIssueDir to import.

Return type:

list[SwissInfoIssueDir] | None

SWISSINFO helper functions

Helper functions to parse SWISSINFO OCR files.

text_preparation.importers.swissinfo.helpers.compute_agg_coords(all_coords: list[list[int]]) → list[int]

Compute the coordinates of a paragraph from the coordinates of its lines.

Parameters:: all_coords (list[list[int]]) – All line coordinates to merge into one block.
Returns:: Line coordinates merged into one region block.
Return type:: list[int]

text_preparation.importers.swissinfo.helpers.parse_lines(blocks_with_lines: dict, pg_id: str, pg_notes: list[str]) → tuple[list[list[int]], list[dict]]

Parse the blocks from the OCR to extract the lines of text.

Parameters:

blocks_with_lines (dict) – All blcoks with text lines extracted from the PDF OCR.
pg_id (str) – Canonical ID of the page the text is on.
pg_notes (list[str]) – Notes of the page, to store potential issues found.

Returns:

Parsed text line corresponding to canonical format.

Return type:

tuple[list[list[int]], list[dict]]