INA AudioDoc importer

This importer is a special case of AudioDoc format which is an output of the Whisper Algorithm. It was developed to handle ASR radio data provided by the Institut National de l’Audiovisuel (INA) of France.

INA Custom classes

This module contains the definition of INA importer classes.

The classes define Issues and Audio record objects which convert ASR data to a unified canoncial format.

class text_preparation.importers.ina.classes.INABroadcastAudioRecord(_id: str, number: int, xml_filepath: str)

Radio-Broadcast Audio Record for INA’s ASR format.

Parameters:
  • _id (str) – Canonical Audio Record ID (e.g. CFCE-1900-01-02-a-r0001).

  • number (int) – Record number (for compatibility with other source mediums).

id

Canonical Audio Record ID (e.g. CFCE-1900-01-02-a-r0001).

Type:

str

number

Record number.

Type:

int

record_data

Audio record data according to canonical format.

Type:

dict[str, Any]

issue

Issue this page is from.

Type:

CanonicalIssue | None

add_issue(issue: CanonicalIssue) None

Add to an audio record object its parent, i.e. the canonical issue.

This allows each page to preserve contextual information coming from the canonical issue.

Parameters:

issue (CanonicalIssue) – Canonical issue containing this page.

create_iiif() str

Create the IIIF URI for this audio record from all its parts

Returns:

Created IIIF URI for this audio record.

Return type:

str

parse() None

Process the audio record XML file and transform into canonical AudioRecord format.

Note

This lazy behavior means that the record contents are not processed upon creation of the audio record object, but only once the parse() method is called.

property xml: BeautifulSoup

Read XML file of the audio record and create a BeautifulSoup object.

Returns:

BeautifulSoup object with XML of the audio record.

Return type:

BeautifulSoup

class text_preparation.importers.ina.classes.INABroadcastIssue(issue_dir: IssueDir)

Radio-Broadcast Issue for INA’s OCR format.

Parameters:

issue_dir (IssueDir) – Identifying information about the issue.

id

Canonical Issue ID (e.g. [alias]-1940-01-05-a).

Type:

str

edition

Lower case letter ordering issues of the same day.

Type:

str

alias

Media title unique alias (identifier or name).

Type:

str

path

Path to directory containing the issue’s OCR data.

Type:

str

date

Publication date of issue.

Type:

datetime.date

issue_data

Issue data according to canonical format.

Type:

dict[str, Any]

audio_records

list of :obj: INABroadcastAudioRecord instances from this issue.

Type:

list

INA Detect functions

This module contains helper functions to find INA ASR data to import.

text_preparation.importers.ina.detect.INAIssueDir

A light-weight data structure to represent a radio audio broadcast issue.

This named tuple contains basic metadata about a newspaper issue. They can then be used to locate the relevant data in the filesystem or to create canonical identifiers for the issue and its pages.

Note

In case of bulletins published multiple times per day, a lowercase letter is used to indicate the edition number: ‘a’ for the first, ‘b’ for the second, etc.

Parameters:
  • alias (str) – Bulletin alias.

  • date (datetime.date) – Publication date or issue.

  • edition (str) – Edition of the newspaper issue (‘a’, ‘b’, ‘c’, etc.).

  • path (str) – Path to the directory containing the issue’s OCR data.

>>> from datetime import date
>>> i = INAIssueDir(
    alias='SOC_CJ',
    date=datetime.date(1940, 07, 22),
    edition='a',
    path='./SOC_CJ/1940/07/22/a',
)
text_preparation.importers.ina.detect.detect_issues(base_dir: str) list[IssueDirectory]

Detect INA Radio broadcasts to import within the filesystem.

This function expects the directory structure that we created for Swissinfo.

Parameters:

base_dir (str) – Path to the base directory of newspaper data.

Returns:

List of INAIssueDir instances, to be imported.

Return type:

list[INAIssueDir]

text_preparation.importers.ina.detect.dir2issue(path: str, metadata_file_path: str) IssueDirectory | None

Create a INAIssueDir object from a directory.

Note

This function is called internally by detect_issues

Parameters:
  • path (str) – The path of the issue.

  • access_rights (dict) – Dictionary for access rights.

Returns:

New INAIssueDir object.

Return type:

INAIssueDir | None

text_preparation.importers.ina.detect.select_issues(base_dir: str, config: dict) list[IssueDirectory] | None

Detect selectively issues to import.

The behavior is very similar to detect_issues() with the only difference that config specifies some rules to filter the data to import. See this section for further details on how to configure filtering.

Parameters:
  • base_dir (str) – Path to the base directory of newspaper data.

  • config (dict) – Config dictionary for filtering.

Returns:

List of INAIssueDir to import.

Return type:

list[INAIssueDir] | None

INA helper functions

Helper functions used by the INA Importer.

text_preparation.importers.ina.helpers.extract_time_coords_from_elem(elem: Tag) list[float] | None

Extract the time coordinates from a given speech element.

Parameters:

elem (Tag) – Element from the beautifulsoup object extracted from the ASR.

Raises:

NotImplementedError – The element did not have one of the expected names.

Returns:

The time coordinates for the given ASR element.

Return type:

list[float] | None

text_preparation.importers.ina.helpers.get_utterances(xml_doc: BeautifulSoup) list[dict]

Construct the utterances composed of speech segments for a given record.

An utterance is a list of consecutive speechsegments with the same speaker ID.

Parameters:

xml_doc (BeautifulSoup) – Contents of the ASR xml document of the record.

Returns:

List of utterances, composed of speechsegments for the record.

Return type:

list[dict]