Input/Output

General

impresso_commons.path.id2IssueDir(id, path): TODO: documentation

impresso_commons.path.parse_canonical_filename(filename)

Parse a canonical page names into its components.

Parameters:: filename (string) – the filename to parse
Returns:: a tuple

>>> filename = "GDL-1950-01-02-a-i0002"
>>> parse_canonical_filename(filename)
>>> ('GDL', ('1950', '01', '02'), 'a', 'i', 2, '')

I/O from file system

Code for parsing impresso’s canonical directory structures.

impresso_commons.path.path_fs.ContentItem: alias of Item

class impresso_commons.path.path_fs.IssueDir(journal, date, edition, path)

Bases: tuple

date: Alias for field number 1

edition: Alias for field number 2

journal: Alias for field number 0

path: Alias for field number 3

impresso_commons.path.path_fs.canonical_path(dir, name=None, extension=None, path_type='file')

Create a canonical dir/file path from an IssueDir object.

Parameters:

dir (IssueDir) – an object representing a newspaper issue
name (string) – the file name (used only if path_type==’file’)
extension (string) – the file extension (used only if path_type==’file’)
path_type (string) – type of path to build (‘dir’ | ‘file’)

Return type:

string

impresso_commons.path.path_fs.check_filenaming(file_basename)

Checks whether a filename complies with our naming convention (GDL-1900-01-10-a-p0001)

Parameters:: file_basename (str) – page file (txt or image)

impresso_commons.path.path_fs.detect_canonical_issues(base_dir, newspapers)

Parse a directory structure and detect newspaper issues to be imported.

NB: invalid directories are skipped, and a warning message is logged.

Parameters:

base_dir (IssueDir) – the root of the directory structure
newspapers (str) – the list of newspapers to consider (acronym blank separated)

Returns:

list of IssueDir instances

Return type:

list

impresso_commons.path.path_fs.detect_issues(base_dir, journal_filter=None, exclude=False)

Parse a directory structure and detect newspaper issues to be imported.

NB: invalid directories are skipped, and a warning message is logged.

Parameters:

base_dir (basestring) – the root of the directory structure
journal_filter (set) – list of newspaper to filter (positive or negative)
exclude (boolean) – whether journal_filter is positive or negative

Return type:

list of IssueDir instances

impresso_commons.path.path_fs.detect_journal_issues(base_dir, newspapers)

Parse a directory structure and detect newspaper issues to be imported.

Parameters:

base_dir (IssueDir) – the root of the directory structure
newspapers (str) – the list of newspapers to consider (acronym blank separated)

Returns:

list of IssueDir instances

Return type:

list

impresso_commons.path.path_fs.get_issueshortpath(issuedir): Returns short version of issue dir path

impresso_commons.path.path_fs.pair_issue(issue_list1, issue_list2)

Associates pairs of issues originating from original and canonical repositories.

Parameters:

issue_list1 (array) – list of IssueDir
issue_list2 (array) – list of IssueDir

Returns:

list containing tuples of issue pairs [(issue1, issue2), (…)]

Return type:

list

impresso_commons.path.path_fs.select_issues(config_dict, inp_dir)

Reads a configuration file and select newspapers/issues to consider See config.example.md for explanations.

Usage example:

if config_file and os.path.isfile(config_file):

with open(config_file, ‘r’) as f:: config = json.load(f) issues = select_issues(config, inp_dir)
else:: issues = detect_issues(inp_dir)

Parameters:

config_dict (dict) – dict of newspaper filter parameters
inp_dir (str) – base dit where to get the issues from

I/O from S3

Code for parsing impresso’s S3 directory structures.

impresso_commons.path.path_s3.IssueDir: alias of IssueDirectory

Fetch issue and/or page canonical JSON files from an s3 bucket.

If compute=True, the output will be a list of the contents of all files in the bucket for the specified newspapers and type of files. If compute=False, the output will remain in a distributed dask.bag.

Based on file_type, the issue files, page files or both will be returned. In the returned tuple, issues are always in the first element and pages in the second, hence if file_type is not ‘both’, the tuple entry corresponding to the undesired type of files will be None.

Note

adapted from https://github.com/impresso/impresso-data-sanitycheck/tree/master/sanity_check/contents/s3_data.py

Parameters:

bucket_name (str) – Name of the s3 bucket to fetch the files form
compute (bool, optional) – Whether to compute result and output as list. Defaults to True.
file_type – (str, optional): Type of files to list, possible values are “issues”, “pages” and “both”. Defaults to “issues”.
newspapers_filter – (list[str]|None,optional): List of newspapers to consider. If None, all will be considered. Defaults to None.

Raises:

NotImplementedError – The given file_type is not one of [‘issues’, ‘pages’, ‘both’].

Returns:

[0] Issue files’ contents or None and [1] Page files’ contents or None based on file_type

Return type:

impresso_commons.path.path_s3.impresso_iter_bucket(bucket_name, item_type=None, prefix=None, filter_config=None, partition_size=15): Iterate over a bucket, possibly with a filter, and return an array of either IssueDir or ContentItem. VALID ONLY for original-canonical data, where there is individual files for issues and content items (articles). :param bucket_name: string, e.g. ‘original-canonical-data’ :param item_type: ‘issue’ or ‘item’ :param prefix: string, e.g. ‘GDL/1950’, used to filter key. Exclusive of ‘filter_config’ :param filter_config: a dict with newspaper acronyms as keys and array of year interval as values: e.g. { “GDL”: [1950, 1960], “JDG”: [1890, 1900] }. Last year is excluded. :param partition_size: partition size of dask to build the object (Issuedir or ContentItem) @return: an array of (filtered) IssueDir or ContentItems.

impresso_commons.path.path_s3.list_files(bucket_name: str, file_type: str = 'issues', newspapers_filter: list[str] | None = None) → tuple[list[str] | None, list[str] | None]

List the canonical files located in a given S3 bucket.

Note

adapted from https://github.com/impresso/impresso-data-sanitycheck/tree/master/sanity_check/contents/s3_data.py

Parameters:

bucket_name (str) – S3 bucket name.
file_type (str, optional) – Type of files to list, possible values are “issues”, “pages” and “both”. Defaults to “issues”.
newspapers_filter (list[str] | None, optional) – List of newspapers to consider. If None, all will be considered. Defaults to None.

Raises:

NotImplementedError – The given file_type is not one of [‘issues’, ‘pages’, ‘both’].

Returns:

[0] List of issue files or None and: [1] List of page files or None based on file_type

Return type:

tuple[list[str] | None, list[str] | None]

impresso_commons.path.path_s3.list_newspapers(bucket_name: str, s3_client=<botocore.client.S3 object>, page_size: int = 10000) → list[str]

List newspapers contained in an s3 bucket with impresso data.

Note

25,000 seems to be the maximum PageSize value supported by SwitchEngines’ S3 implementation (ceph).

Note

Copied from https://github.com/impresso/impresso-data-sanitycheck/tree/master/sanity_check/contents/s3_data.py

Parameters:

bucket_name (str) – Name of the S3 bucket to consider
s3_client (optional) – S3 client to use. Defaults to get_s3_client().
page_size (int, optional) – Pagination configuration. Defaults to 10000.

Returns:

List of newspaper (aliases) present in the given S3 bucket.

Return type:

list[str]

impresso_commons.path.path_s3.read_s3_issues(newspaper, year, input_bucket)

class impresso_commons.path.path_s3.s3ContentItem(journal, date, edition, number, key_name, doc_type=None, rebuilt_version=None, canonical_version=None): Bases: object

impresso_commons.path.path_s3.s3_filter_archives(bucket_name, config, suffix='.jsonl.bz2')

Iterate over bucket and filter according to config and suffix. Config is a dict where k= newspaper acronym and v = array of 2 years, considered as time interval. .. rubric:: Example

config = {: “GDL” : [1960, 1970], => will take all years in interval “JDG”: [], => Empty array means no filter, all years. “GDL”: [1798, 1999, 10] => take each 10th item within sequence of years

}

Parameters:

bucket_name (str) – the name of the bucket
config (Dict) – newspaper/years to consider
key_suffix – end of the key

Returns:

array of keys

impresso_commons.path.path_s3.s3_iter_bucket(bucket_name, prefix, suffix)

Iterate over a bucket and return all keys with prefix and suffix.

>>> b = get_bucket("myBucket", create=False)
>>> k = s3_iter_bucket(b.name, prefix='GDL', suffix=".bz2")
>>>
:param bucket_name: the name of the bucket
:type bucket_name: str
:param prefix: beginning of the key
:type prefix: str
:param key_suffix: how the key ends
:type prefix: str
@return: array of keys