INA AudioDoc importer

This importer is a special case of AudioDoc format which is an output of the Whisper Algorithm. It was developed to handle ASR radio data provided by the Institut National de l’Audiovisuel (INA) of France.

INA Custom classes

This module contains the definition of INA importer classes.

The classes define Issues and Audio record objects which convert ASR data to a unified canoncial format.

class text_preparation.importers.ina.classes.INABroadcastAudioRecord(_id: str, number: int, xml_filepath: str)

Radio-Broadcast Audio Record for INA’s ASR format.

Parameters:

_id (str) – Canonical Audio Record ID (e.g. CFCE-1900-01-02-a-r0001).
number (int) – Record number (for compatibility with other source mediums).

id

Canonical Audio Record ID (e.g. CFCE-1900-01-02-a-r0001).

Type:: str

number

Record number.

Type:: int

record_data

Audio record data according to canonical format.

Type:: dict[str, Any]

issue

Issue this page is from.

Type:: CanonicalIssue | None

add_issue(issue: CanonicalIssue) → None

Add to an audio record object its parent, i.e. the canonical issue.

This allows each page to preserve contextual information coming from the canonical issue.

Parameters:: issue (CanonicalIssue) – Canonical issue containing this page.

create_iiif() → str

Create the IIIF URI for this audio record from all its parts

Returns:: Created IIIF URI for this audio record.
Return type:: str

parse() → None: Process the audio record XML file and transform into canonical AudioRecord format.

Note

This lazy behavior means that the record contents are not processed upon creation of the audio record object, but only once the parse() method is called.

property xml: BeautifulSoup

Read XML file of the audio record and create a BeautifulSoup object.

Returns:: BeautifulSoup object with XML of the audio record.
Return type:: BeautifulSoup

class text_preparation.importers.ina.classes.INABroadcastIssue(issue_dir: IssueDir)

Radio-Broadcast Issue for INA’s OCR format.

Parameters:: issue_dir (IssueDir) – Identifying information about the issue.

id

Canonical Issue ID (e.g. [alias]-1940-01-05-a).

Type:: str

edition

Lower case letter ordering issues of the same day.

Type:: str

alias

Media title unique alias (identifier or name).

Type:: str

path

Path to directory containing the issue’s OCR data.

Type:: str

date

Publication date of issue.

Type:: datetime.date

issue_data

Issue data according to canonical format.

Type:: dict[str, Any]

audio_records

list of :obj: INABroadcastAudioRecord instances from this issue.

Type:: list

INA Detect functions

This module contains helper functions to find INA ASR data to import.

text_preparation.importers.ina.detect.INAIssueDir

A light-weight data structure to represent a radio audio broadcast issue.

This named tuple contains basic metadata about a newspaper issue. They can then be used to locate the relevant data in the filesystem or to create canonical identifiers for the issue and its pages.

Note

In case of bulletins published multiple times per day, a lowercase letter is used to indicate the edition number: ‘a’ for the first, ‘b’ for the second, etc.

Parameters:

provider (str) – Provider for this alias, here always “INA”
alias (str) – Bulletin alias.
date (datetime.date) – Publication date or issue.
edition (str) – Edition of the newspaper issue (‘a’, ‘b’, ‘c’, etc.).
path (str) – Path to the directory containing the issue’s OCR data.

>>> from datetime import date
>>> i = INAIssueDir(
    provider='INA',
    alias='SOC_CJ',
    date=datetime.date(1940, 07, 22),
    edition='a',
    path='./SOC_CJ/1940/07/22/a',
)

text_preparation.importers.ina.detect.detect_issues(base_dir: str) → list[IssueDirectory]

Detect INA Radio broadcasts to import within the filesystem.

This function expects the directory structure that we created for Swissinfo.

Parameters:: base_dir (str) – Path to the base directory of newspaper data.
Returns:: List of INAIssueDir instances, to be imported.
Return type:: list[INAIssueDir]

text_preparation.importers.ina.detect.dir2issue(path: str, metadata_file_path: str) → IssueDirectory | None

Create a INAIssueDir object from a directory.

Note

This function is called internally by detect_issues

Parameters:

path (str) – The path of the issue.
access_rights (dict) – Dictionary for access rights.

Returns:

New INAIssueDir object.

Return type:

INAIssueDir | None

text_preparation.importers.ina.detect.select_issues(base_dir: str, config: dict) → list[IssueDirectory] | None

Detect selectively issues to import.

The behavior is very similar to detect_issues() with the only difference that config specifies some rules to filter the data to import. See this section for further details on how to configure filtering.

Parameters:

base_dir (str) – Path to the base directory of newspaper data.
config (dict) – Config dictionary for filtering.

Returns:

List of INAIssueDir to import.

Return type:

list[INAIssueDir] | None

INA helper functions

Helper functions used by the INA Importer.

text_preparation.importers.ina.helpers.extract_time_coords_from_elem(elem: Tag) → list[float] | None

Extract the time coordinates from a given speech element.

Parameters:: elem (Tag) – Element from the beautifulsoup object extracted from the ASR.
Raises:: NotImplementedError – The element did not have one of the expected names.
Returns:: The time coordinates for the given ASR element.
Return type:: list[float] | None

text_preparation.importers.ina.helpers.get_utterances(xml_doc: BeautifulSoup) → list[dict]

Construct the utterances composed of speech segments for a given record.

An utterance is a list of consecutive speechsegments with the same speaker ID.

Parameters:: xml_doc (BeautifulSoup) – Contents of the ASR xml document of the record.
Returns:: List of utterances, composed of speechsegments for the record.
Return type:: list[dict]