FedGaz TETML importer
This importer is an adapted version of the generic TETML importer to parse the Federal Gazette data, which is complemented by an additional metadata file besides the document files in the TETML format.
The separate metadata file is used to look up additional information for documents not provided by the respective TETML files. The file needs to be located in the top folder of the input directory and is named metadata.tsv. Moreover, the dataset provides the following columns: article_docid, issue_date, article_title, volume_language, canonical_page_first, canonical_page_last, pruned. Notably, the tetml file’s name needs to correspond with the article_docid of the metadata as it is used as a key to look up other information (e.g., 10000032.word.tetml).
By default, the importer assumes that an article starts on a new page. Practically, there are many cases of in-page segmentations (i.e., an article starts on the same page where the previous ends). Thus, the FedGaz importer also performs a heuristic article segmentation for documents that share the page with the subsequent articles, indicated by the attribute pruned. Unless the attribute is set to True, the content of the shared page is automatically assigned to the subsequent article, limiting an article to its last full page. However, in case of an indicated pruning, the importer performs a fuzzy search to locate the subsequent article title on its starting page. If successful, the procedure sets the article boundary at the matching position and reassigns the content accordingly.
FedGaz Custom classes
Classes to handle the TETML OCR format.
- class text_preparation.importers.fedgaz.classes.FedgazNewspaperIssue(issue_dir: IssueDir)
Class representing a issue in FedGaz TETML format.
All functions defined in this child class are used to parse additional information specific for FedGaz and extend the generic TETML importer.
Upon object initialization the following things happen:
index all the tetml documents of an issue
parse the metadata file to determine the logical structure of the issue
parse the tetml file that contains the actual content and some metadata
perform a heuristic article segmentation
redefine metadata and initialize page objects (instances of
TetmlNewspaperPage
).
- Parameters:
issue_dir (IssueDir) – Newspaper issue with relevant information.
- parse_articles()
Parse all articles of this issue
- class text_preparation.importers.fedgaz.classes.FedgazNewspaperPage(_id: str, number: int, page_content: dict, page_xml)
Class representing a page in FedGaz TETML format.
- Parameters:
n (int) – Page number.
page_content (dict) – Nested article content of a single page
page_xml (str) – Path to the Tetml file of the page.
- parse()
Process the page XML file and transform into canonical Page format.
Note
This lazy behavior means that the page contents are not processed upon creation of the page object, but only once the
parse()
method is called.
- class text_preparation.importers.fedgaz.classes.TokPosition(art, page, reg, para, line, tok)
Create a an identifier to store the position of the fuzzy match.
- Parameters:
art (int) – Article number.
page (int) – Page number.
reg (int) – Region number.
para (int) – Paragraph number.
line (int) – Line number.
tok (int) – Token number.
Create new instance of TokPosition(art, page, reg, para, line, tok)