Overview ======== Data architecture ----------------- `Impreso Text Preparation`, composed of the `Importer` and the `Rebuilder` is the main part of the data architecture defined in the framework of the impresso project to store and process a large-scale archive of historical newspapers. To understand the importer's logic is worth touching upon the key points of the architecure into which it fits. Canonical identifiers ********************* Canonical identifiers are defined at the following levels: 1. newspaper issue 2. newspaper page 3. content item (e.g. article, advertisement, weather forecast, obituary, etc.) Issue IDs ######### - template: ``{newspaper_id}-{date.year}-{date.month}-{date.day}-{edition}`` - examples: ``GDL-1900-01-02-a``, ``luxzeit1858-1858-12-7-a`` Page IDs ######## - template: ``{newspaper_id}-{date.year}-{date.month}-{date.day}-{edition}-p{page_number}`` - examples: ``GDL-1900-01-02-a-p0004``, ``luxzeit1858-1858-12-7-a-p0002`` Content item IDs ################ - template: ``{newspaper_id}-{date.year}-{date.month}-{date.day}-{edition}-i{item_number}`` - examples: ``GDL-1900-01-02-a-i0048``, ``JDG-1901-01-01-a-i0031`` Some things to note about these templates: - ``newspaper_id`` is an arbitrary string, not containing white spaces, unambiguously identifying a given newspaper - ``page_number`` is a four-digits integer (zeroes are used for filling) - ``edition``: in case of newspapers published multiple times per day, a lowercase letter is used to indicate the edition number: ``a`` for the first, ``b`` for the second, etc. - ``item_number``: is a four-digits integer (zeroes are used for filling); **NB**: content item IDs are **expected to remain stable** across any two runs of the importer given the same input data. Data packaging ************** The JSON data produced by the ``Importer`` and ``Rebuilder`` are packaged into ``.bz2`` archives for efficient storage. Each archive consists of one JSON-line file, where each line contains a JSON document. The JSON schemas are described `here `_. In Impresso we use an S3 solution for distributed storage to store newspaper data and accessed them at processing time. Issue data ########## They are packaged **by newspaper and by year** (as they tend to be very small files). Each archive contains, one document per line, all issues of a newspaper that appeared in that year. Examples: ``GDL-1900-issues.jsonl.bz2`` contains all issues of the *Gazette de Lausanne* published in 1900. Page data ######### They are packaged **by newspaper issue**. Each archive contains, one document per line, all JSON pages belonging to a given newspaper issue (edition). Examples: ``GDL-1900-01-01-a-pages.jsonl.bz2`` contains all issues of the *Gazette de Lausanne* (= ``GDL``) published on January 1, 1900. Rebuilt data ############ They are packaged **by newspaper and by year**. Each archive contains, one document per line, all JSON content-items belonging to a given newspaper and year. Examples: ``GDL-1900.jsonl.bz2`` contains all rebuilt data of the *Gazette de Lausanne* (= ``GDL``) published in 1900. Image data ********** They are expected to be delivered via a dedicated IIIF endpoint, and typically stored in an image server. To each newspaper page corresponds an image file. .. note :: In case the canonical ID of a page and the internal ID of its image differ, the content provider is expected to be able to provide a mapping of the two identifier systems. Processing ---------- .. automodule:: text_preparation.importers.core :members: