SWISSINFO PDF-embedded importer
This importer is a special case of PDF-embedded OCR which has been extracted into a custom JSON format. It was developed to handle OCR radio-bulletin data in PDF format provided by Memoriav from the Swissinfo collection of World-War II radio-bulletins.
SWISSINFO Custom classes
This module contains the definition of the SWISSINFO importer classes.
- class text_preparation.importers.swissinfo.classes.SwissInfoRadioBulletinIssue(issue_dir: IssueDir)
Radio-Bulletin Issue for SWISSINFO’s OCR format.
- Parameters:
issue_dir (IssueDir) – Identifying information about the issue.
- id
Canonical Issue ID (e.g.
SOC_CJ-1940-01-05-a
).- Type:
str
- edition
Lower case letter ordering issues of the same day.
- Type:
str
- alias
Media title unique alias (identifier or name).
- Type:
str
- path
Path to directory containing the issue’s OCR data.
- Type:
str
- date
Publication date of issue.
- Type:
datetime.date
- issue_data
Issue data according to canonical format.
- Type:
dict[str, Any]
- pages
list of :obj: SwissInfoRadioBulletinPage instances from this issue.
- Type:
list
- class text_preparation.importers.swissinfo.classes.SwissInfoRadioBulletinPage(_id: str, number: int)
Radio-Bulletin Page for SWISSINFO’s OCR format.
- Parameters:
_id (str) – Canonical Page ID (e.g.
GDL-1900-01-02-a-p0004
).number (int) – Page number.
- id
Canonical Page ID (e.g.
SOC_CJ-1940-01-05-a-p0001
).- Type:
str
- number
Page number.
- Type:
int
- page_data
Page data according to canonical format.
- Type:
dict[str, Any]
- issue
Radio Bulleting issue this page is from.
- Type:
- path
Path to the jp2 page file.
- Type:
str
- add_issue(issue: CanonicalIssue) None
Add to a page object its parent, i.e. the canonical issue.
This allows each page to preserve contextual information coming from the canonical issue.
- Parameters:
issue (CanonicalIssue) – Canonical issue containing this page.
- parse() None
Process the page XML file and transform into canonical Page format.
Note
This lazy behavior means that the page contents are not processed upon creation of the page object, but only once the
parse()
method is called.
SWISSINFO Detect functions
This module contains helper functions to find SWISSINFO OCR data to be imported.
- text_preparation.importers.swissinfo.detect.SwissInfoIssueDir
A light-weight data structure to represent a radio bulletin issue.
This named tuple contains basic metadata about a newspaper issue. They can then be used to locate the relevant data in the filesystem or to create canonical identifiers for the issue and its pages.
Note
In case of bulletins published multiple times per day, a lowercase letter is used to indicate the edition number: ‘a’ for the first, ‘b’ for the second, etc.
- Parameters:
alias (str) – Bulletin alias.
date (datetime.date) – Publication date or issue.
edition (str) – Edition of the newspaper issue (‘a’, ‘b’, ‘c’, etc.).
path (str) – Path to the directory containing the issue’s OCR data.
>>> from datetime import date >>> i = SwissInfoIssueDir( alias='SOC_CJ', date=datetime.date(1940, 07, 22), edition='a', path='./SOC_CJ/1940/07/22/a', metadata_file='../data/sample_data/SWISSINFO/bulletins_metadata.json' )
- text_preparation.importers.swissinfo.detect.detect_issues(base_dir: str) list[IssueDirectory]
Detect SWISSINFO Radio bulletins to import within the filesystem.
This function expects the directory structure that we created for Swissinfo.
- Parameters:
base_dir (str) – Path to the base directory of newspaper data.
access_rights (str) – unused argument kept for conformity for now.
- Returns:
List of SwissInfoIssueDir instances, to be imported.
- Return type:
list[SwissInfoIssueDir]
- text_preparation.importers.swissinfo.detect.dir2issue(path: str, metadata_file_path: str) IssueDirectory | None
Create a SwissInfoIssueDir object from a directory.
Note
This function is called internally by detect_issues
- Parameters:
path (str) – The path of the issue.
access_rights (dict) – Dictionary for access rights.
- Returns:
New SwissInfoIssueDir object.
- Return type:
SwissInfoIssueDir | None
- text_preparation.importers.swissinfo.detect.select_issues(base_dir: str, config: dict) list[IssueDirectory] | None
Detect selectively newspaper issues to import.
The behavior is very similar to
detect_issues()
with the only difference thatconfig
specifies some rules to filter the data to import. See this section for further details on how to configure filtering.- Parameters:
base_dir (str) – Path to the base directory of newspaper data.
config (dict) – Config dictionary for filtering.
- Returns:
List of SwissInfoIssueDir to import.
- Return type:
list[SwissInfoIssueDir] | None
SWISSINFO helper functions
Helper functions to parse SWISSINFO OCR files.
- text_preparation.importers.swissinfo.helpers.compute_agg_coords(all_coords: list[list[int]]) list[int]
Compute the coordinates of a paragraph from the coordinates of its lines.
- Parameters:
all_coords (list[list[int]]) – All line coordinates to merge into one block.
- Returns:
Line coordinates merged into one region block.
- Return type:
list[int]
- text_preparation.importers.swissinfo.helpers.parse_lines(blocks_with_lines: dict, pg_id: str, pg_notes: list[str]) tuple[list[list[int]], list[dict]]
Parse the blocks from the OCR to extract the lines of text.
- Parameters:
blocks_with_lines (dict) – All blcoks with text lines extracted from the PDF OCR.
pg_id (str) – Canonical ID of the page the text is on.
pg_notes (list[str]) – Notes of the page, to store potential issues found.
- Returns:
Parsed text line corresponding to canonical format.
- Return type:
tuple[list[list[int]], list[dict]]