INA AudioDoc importer
This importer is a special case of AudioDoc format which is an output of the Whisper Algorithm. It was developed to handle ASR radio data provided by the Institut National de l’Audiovisuel (INA) of France.
INA Custom classes
This module contains the definition of INA importer classes.
The classes define Issues and Audio record objects which convert ASR data to a unified canoncial format.
- class text_preparation.importers.ina.classes.INABroadcastAudioRecord(_id: str, number: int, xml_filepath: str)
Radio-Broadcast Audio Record for INA’s ASR format.
- Parameters:
_id (str) – Canonical Audio Record ID (e.g.
CFCE-1900-01-02-a-r0001
).number (int) – Record number (for compatibility with other source mediums).
- id
Canonical Audio Record ID (e.g.
CFCE-1900-01-02-a-r0001
).- Type:
str
- number
Record number.
- Type:
int
- record_data
Audio record data according to canonical format.
- Type:
dict[str, Any]
- issue
Issue this page is from.
- Type:
CanonicalIssue | None
- add_issue(issue: CanonicalIssue) None
Add to an audio record object its parent, i.e. the canonical issue.
This allows each page to preserve contextual information coming from the canonical issue.
- Parameters:
issue (CanonicalIssue) – Canonical issue containing this page.
- create_iiif() str
Create the IIIF URI for this audio record from all its parts
- Returns:
Created IIIF URI for this audio record.
- Return type:
str
- parse() None
Process the audio record XML file and transform into canonical AudioRecord format.
Note
This lazy behavior means that the record contents are not processed upon creation of the audio record object, but only once the
parse()
method is called.
- property xml: BeautifulSoup
Read XML file of the audio record and create a BeautifulSoup object.
- Returns:
BeautifulSoup object with XML of the audio record.
- Return type:
BeautifulSoup
- class text_preparation.importers.ina.classes.INABroadcastIssue(issue_dir: IssueDir)
Radio-Broadcast Issue for INA’s OCR format.
- Parameters:
issue_dir (IssueDir) – Identifying information about the issue.
- id
Canonical Issue ID (e.g.
[alias]-1940-01-05-a
).- Type:
str
- edition
Lower case letter ordering issues of the same day.
- Type:
str
- alias
Media title unique alias (identifier or name).
- Type:
str
- path
Path to directory containing the issue’s OCR data.
- Type:
str
- date
Publication date of issue.
- Type:
datetime.date
- issue_data
Issue data according to canonical format.
- Type:
dict[str, Any]
- audio_records
list of :obj: INABroadcastAudioRecord instances from this issue.
- Type:
list
INA Detect functions
This module contains helper functions to find INA ASR data to import.
- text_preparation.importers.ina.detect.INAIssueDir
A light-weight data structure to represent a radio audio broadcast issue.
This named tuple contains basic metadata about a newspaper issue. They can then be used to locate the relevant data in the filesystem or to create canonical identifiers for the issue and its pages.
Note
In case of bulletins published multiple times per day, a lowercase letter is used to indicate the edition number: ‘a’ for the first, ‘b’ for the second, etc.
- Parameters:
alias (str) – Bulletin alias.
date (datetime.date) – Publication date or issue.
edition (str) – Edition of the newspaper issue (‘a’, ‘b’, ‘c’, etc.).
path (str) – Path to the directory containing the issue’s OCR data.
>>> from datetime import date >>> i = INAIssueDir( alias='SOC_CJ', date=datetime.date(1940, 07, 22), edition='a', path='./SOC_CJ/1940/07/22/a', )
- text_preparation.importers.ina.detect.detect_issues(base_dir: str) list[IssueDirectory]
Detect INA Radio broadcasts to import within the filesystem.
This function expects the directory structure that we created for Swissinfo.
- Parameters:
base_dir (str) – Path to the base directory of newspaper data.
- Returns:
List of INAIssueDir instances, to be imported.
- Return type:
list[INAIssueDir]
- text_preparation.importers.ina.detect.dir2issue(path: str, metadata_file_path: str) IssueDirectory | None
Create a INAIssueDir object from a directory.
Note
This function is called internally by detect_issues
- Parameters:
path (str) – The path of the issue.
access_rights (dict) – Dictionary for access rights.
- Returns:
New INAIssueDir object.
- Return type:
INAIssueDir | None
- text_preparation.importers.ina.detect.select_issues(base_dir: str, config: dict) list[IssueDirectory] | None
Detect selectively issues to import.
The behavior is very similar to
detect_issues()
with the only difference thatconfig
specifies some rules to filter the data to import. See this section for further details on how to configure filtering.- Parameters:
base_dir (str) – Path to the base directory of newspaper data.
config (dict) – Config dictionary for filtering.
- Returns:
List of INAIssueDir to import.
- Return type:
list[INAIssueDir] | None
INA helper functions
Helper functions used by the INA Importer.
- text_preparation.importers.ina.helpers.extract_time_coords_from_elem(elem: Tag) list[float] | None
Extract the time coordinates from a given speech element.
- Parameters:
elem (Tag) – Element from the beautifulsoup object extracted from the ASR.
- Raises:
NotImplementedError – The element did not have one of the expected names.
- Returns:
The time coordinates for the given ASR element.
- Return type:
list[float] | None
- text_preparation.importers.ina.helpers.get_utterances(xml_doc: BeautifulSoup) list[dict]
Construct the utterances composed of speech segments for a given record.
An utterance is a list of consecutive speechsegments with the same speaker ID.
- Parameters:
xml_doc (BeautifulSoup) – Contents of the ASR xml document of the record.
- Returns:
List of utterances, composed of speechsegments for the record.
- Return type:
list[dict]