SWA Alto importer
This importer is a special case of Mets/Alto. Here, we only have ALTO.xml files, so only the Pages are in alto format. It was developed to handle OCR newspaper data in Alto format provided by the Schweizerisches Wirtschaftsarchiv (SWA) of Basel University Library.
SWA Custom classes
This module contains the definition of the SWA importer classes.
The classes define newspaper Issues and Pages objects which convert OCR data in the SWA version of the Mets/Alto format to a unified canoncial format. Theses classes are subclasses of generic Mets/Alto importer classes.
- class text_preparation.importers.swa.classes.SWANewspaperIssue(issue_dir: IssueDirectory, temp_dir: str)
Newspaper issue in SWA Mets/Alto format.
Note
SWA is in ALTO format, but there isn’t any Mets file. So in that case, issues are simply a collection of pages.
- Parameters:
issue_dir (SwaIssueDir) – Identifying information about the issue.
temp_dir (str) – Temporary directory to extract archives.
- id
Canonical Issue ID (e.g.
GDL-1900-01-02-a
).- Type:
str
- edition
Lower case letter ordering issues of the same day.
- Type:
str
- journal
Newspaper unique identifier or name.
- Type:
str
- path
Path to directory containing the issue’s OCR data.
- Type:
str
- date
Publication date of issue.
- Type:
datetime.date
- issue_data
Issue data according to canonical format.
- Type:
dict[str, Any]
- pages
list of
NewspaperPage
instances from this issue.- Type:
list
- rights
Access rights applicable to this issue.
- Type:
str
- archive
Archive containing all the Alto XML files for the issue’s pages.
- Type:
ZipArchive
- temp_pages
Temporary list of pages found for this issue. A page is a tuple (page_canonical_id, alto_path), where alto_path is the path from within the archive.
- Type:
list[tuple[str, str]]
- content_items
Content items from this issue.
- Type:
list[dict[str,Any]]
- notes
Notes of missing pages gathered while parsing.
- Type:
list[str]
- class text_preparation.importers.swa.classes.SWANewspaperPage(_id: str, number: int, alto_path: str)
Newspaper page in SWA (Mets/Alto) format.
- Parameters:
_id (str) – Canonical page ID.
number (int) – Page number.
alto_path (str) – Full path to the Alto XML file.
- id
Canonical Page ID (e.g.
GDL-1900-01-02-a-p0004
).- Type:
str
- number
Page number.
- Type:
int
- page_data
Page data according to canonical format.
- Type:
dict[str, Any]
- issue
Issue this page is from.
- Type:
- filename
Name of the Alto XML page file.
- Type:
str
- basedir
Base directory where Alto files are located.
- Type:
str
- encoding
Encoding of XML file.
- Type:
str, optional
- iiif
The iiif URI to the newspaper page image.
- Type:
str
- add_issue(issue: NewspaperIssue) None
Add to a page object its parent, i.e. the newspaper issue.
This allows each page to preserve contextual information coming from the newspaper issue.
- Parameters:
issue (NewspaperIssue) – Newspaper issue containing this page.
- property ci_id: str
Return the content item ID of the page.
Given that SWA data do not entail article-level segmentation, each page is considered as a content item. Thus, to mint the content item ID we take the canonical page ID and simply replace the “p” prefix with “i”.
- Returns:
Content item id.
- Return type:
str
- property file_exists: bool
Check whether the Alto XML file exists for this page.
- Returns:
True if the Alto XML file exists, False otherwise.
- Return type:
bool
- get_iiif_image() str
Create the iiif URI to the full journal page image.
- Returns:
iiif URI of the image of the full page.
- Return type:
str
- parse() None
Process the page XML file and transform into canonical Page format.
Note
This lazy behavior means that the page contents are not processed upon creation of the page object, but only once the
parse()
method is called.
SWA Detect functions
This module contains helper functions to find SWA OCR data to be imported.
- text_preparation.importers.swa.detect.SwaIssueDir
A light-weight data structure to represent a SWA newspaper issue.
This named tuple contains basic metadata about a newspaper issue. They can then be used to locate the relevant data in the filesystem or to create canonical identifiers for the issue and its pages.
Note
In case of newspaper published multiple times per day, a lowercase letter is used to indicate the edition number: ‘a’ for the first, ‘b’ for the second, etc.
- Parameters:
journal (str) – Newspaper ID.
date (datetime.date) – Publication date or issue.
edition (str) – Edition of the newspaper issue (‘a’, ‘b’, ‘c’, etc.).
path (str) – Path to the directory containing the issue’s OCR data.
rights (str) – Access rights on the data (open, closed, etc.).
pages (list) – list of tuples (page_canonical_id, alto_path), alto_path is the path from within the archive.
>>> from datetime import date >>> i = IssueDirectory( journal='arbeitgeber', date=datetime.date(1908, 7, 4), edition='a', path='./SWA/impresso_ocr/schwar_000059110_DSV01_1908.zip', rights='open_public', pages=[( 'arbeitgeber-1908-07-04-a-p0001', 'schwar_000059110_DSV01_1908/ocr/schwar_000059110_DSV01_1908_alto/BAU_1_000059110_1908_0001.xml' ), ...] )
- text_preparation.importers.swa.detect.detect_issues(base_dir: str, access_rights: str) list[IssueDirectory]
Detect newspaper issues to import within the filesystem.
This function expects the directory structure that SWA used to organize the dump of Alto OCR data.
The access rights information is not in place yet, but needs to be specified by the content provider (SWA).
TODO: Add the directory structure of SWA OCR data dumps.
- Parameters:
base_dir (str) – Path to the base directory of newspaper data.
access_rights (str) – Path to
access_rights.json
file.
- Returns:
list of
SwaIssueDir
instances, to be imported.- Return type:
list[SwaIssueDir]
- text_preparation.importers.swa.detect.select_issues(base_dir: str, config: dict, access_rights: str) list[IssueDirectory]
Detect selectively newspaper issues to import.
The behavior is very similar to
detect_issues()
with the only difference thatconfig
specifies some rules to filter the data to import. See this section for further details on how to configure filtering.The access rights information is not in place yet, but needs to be specified by the content provider (SWA).
- Parameters:
base_dir (str) – Path to the base directory of newspaper data.
config (dict) – Config dictionary for filtering.
access_rights (str) – Path to
access_rights.json
file.
- Returns:
list of
SwaIssueDir
instances, to be imported.- Return type:
list[SwaIssueDir]