Input/Output
General
- impresso_commons.path.id2IssueDir(id, path)
TODO: documentation
- impresso_commons.path.parse_canonical_filename(filename)
Parse a canonical page names into its components.
- Parameters:
filename (string) – the filename to parse
- Returns:
a tuple
>>> filename = "GDL-1950-01-02-a-i0002" >>> parse_canonical_filename(filename) >>> ('GDL', ('1950', '01', '02'), 'a', 'i', 2, '')
I/O from file system
Code for parsing impresso’s canonical directory structures.
- impresso_commons.path.path_fs.ContentItem
alias of
Item
- class impresso_commons.path.path_fs.IssueDir(journal, date, edition, path)
Bases:
tuple
- date
Alias for field number 1
- edition
Alias for field number 2
- journal
Alias for field number 0
- path
Alias for field number 3
- impresso_commons.path.path_fs.canonical_path(dir, name=None, extension=None, path_type='file')
Create a canonical dir/file path from an IssueDir object.
- Parameters:
dir (IssueDir) – an object representing a newspaper issue
name (string) – the file name (used only if path_type==’file’)
extension (string) – the file extension (used only if path_type==’file’)
path_type (string) – type of path to build (‘dir’ | ‘file’)
- Return type:
string
- impresso_commons.path.path_fs.check_filenaming(file_basename)
Checks whether a filename complies with our naming convention (GDL-1900-01-10-a-p0001)
- Parameters:
file_basename (str) – page file (txt or image)
- impresso_commons.path.path_fs.detect_canonical_issues(base_dir, newspapers)
Parse a directory structure and detect newspaper issues to be imported.
NB: invalid directories are skipped, and a warning message is logged.
- Parameters:
base_dir (IssueDir) – the root of the directory structure
newspapers (str) – the list of newspapers to consider (acronym blank separated)
- Returns:
list of IssueDir instances
- Return type:
list
- impresso_commons.path.path_fs.detect_issues(base_dir, journal_filter=None, exclude=False)
Parse a directory structure and detect newspaper issues to be imported.
NB: invalid directories are skipped, and a warning message is logged.
- Parameters:
base_dir (basestring) – the root of the directory structure
journal_filter (set) – list of newspaper to filter (positive or negative)
exclude (boolean) – whether journal_filter is positive or negative
- Return type:
list of IssueDir instances
- impresso_commons.path.path_fs.detect_journal_issues(base_dir, newspapers)
Parse a directory structure and detect newspaper issues to be imported.
- Parameters:
base_dir (IssueDir) – the root of the directory structure
newspapers (str) – the list of newspapers to consider (acronym blank separated)
- Returns:
list of IssueDir instances
- Return type:
list
- impresso_commons.path.path_fs.get_issueshortpath(issuedir)
Returns short version of issue dir path
- impresso_commons.path.path_fs.pair_issue(issue_list1, issue_list2)
Associates pairs of issues originating from original and canonical repositories.
- Parameters:
issue_list1 (array) – list of IssueDir
issue_list2 (array) – list of IssueDir
- Returns:
list containing tuples of issue pairs [(issue1, issue2), (…)]
- Return type:
list
- impresso_commons.path.path_fs.select_issues(config_dict, inp_dir)
Reads a configuration file and select newspapers/issues to consider See config.example.md for explanations.
- Usage example:
- if config_file and os.path.isfile(config_file):
- with open(config_file, ‘r’) as f:
config = json.load(f) issues = select_issues(config, inp_dir)
- else:
issues = detect_issues(inp_dir)
- Parameters:
config_dict (dict) – dict of newspaper filter parameters
inp_dir (str) – base dit where to get the issues from
I/O from S3
Code for parsing impresso’s S3 directory structures.
- impresso_commons.path.path_s3.IssueDir
alias of
IssueDirectory
- impresso_commons.path.path_s3.fetch_files(bucket_name: str, compute: bool = True, file_type: str = 'issues', newspapers_filter: list[str] | None = None) tuple[Bag | list[str] | None, Bag | list[str] | None]
Fetch issue and/or page canonical JSON files from an s3 bucket.
If compute=True, the output will be a list of the contents of all files in the bucket for the specified newspapers and type of files. If compute=False, the output will remain in a distributed dask.bag.
Based on file_type, the issue files, page files or both will be returned. In the returned tuple, issues are always in the first element and pages in the second, hence if file_type is not ‘both’, the tuple entry corresponding to the undesired type of files will be None.
Note
adapted from https://github.com/impresso/impresso-data-sanitycheck/tree/master/sanity_check/contents/s3_data.py
- Parameters:
bucket_name (str) – Name of the s3 bucket to fetch the files form
compute (bool, optional) – Whether to compute result and output as list. Defaults to True.
file_type – (str, optional): Type of files to list, possible values are “issues”, “pages” and “both”. Defaults to “issues”.
newspapers_filter – (list[str]|None,optional): List of newspapers to consider. If None, all will be considered. Defaults to None.
- Raises:
NotImplementedError – The given file_type is not one of [‘issues’, ‘pages’, ‘both’].
- Returns:
[0] Issue files’ contents or None and [1] Page files’ contents or None based on file_type
- Return type:
tuple[db.core.Bag|None, db.core.Bag|None] | tuple[list[str]|None, list[str]|None]
- impresso_commons.path.path_s3.impresso_iter_bucket(bucket_name, item_type=None, prefix=None, filter_config=None, partition_size=15)
Iterate over a bucket, possibly with a filter, and return an array of either IssueDir or ContentItem. VALID ONLY for original-canonical data, where there is individual files for issues and content items (articles). :param bucket_name: string, e.g. ‘original-canonical-data’ :param item_type: ‘issue’ or ‘item’ :param prefix: string, e.g. ‘GDL/1950’, used to filter key. Exclusive of ‘filter_config’ :param filter_config: a dict with newspaper acronyms as keys and array of year interval as values: e.g. { “GDL”: [1950, 1960], “JDG”: [1890, 1900] }. Last year is excluded. :param partition_size: partition size of dask to build the object (Issuedir or ContentItem) @return: an array of (filtered) IssueDir or ContentItems.
- impresso_commons.path.path_s3.list_files(bucket_name: str, file_type: str = 'issues', newspapers_filter: list[str] | None = None) tuple[list[str] | None, list[str] | None]
List the canonical files located in a given S3 bucket.
Note
adapted from https://github.com/impresso/impresso-data-sanitycheck/tree/master/sanity_check/contents/s3_data.py
- Parameters:
bucket_name (str) – S3 bucket name.
file_type (str, optional) – Type of files to list, possible values are “issues”, “pages” and “both”. Defaults to “issues”.
newspapers_filter (list[str] | None, optional) – List of newspapers to consider. If None, all will be considered. Defaults to None.
- Raises:
NotImplementedError – The given file_type is not one of [‘issues’, ‘pages’, ‘both’].
- Returns:
- [0] List of issue files or None and
[1] List of page files or None based on file_type
- Return type:
tuple[list[str] | None, list[str] | None]
- impresso_commons.path.path_s3.list_newspapers(bucket_name: str, s3_client=<botocore.client.S3 object>, page_size: int = 10000) list[str]
List newspapers contained in an s3 bucket with impresso data.
Note
25,000 seems to be the maximum PageSize value supported by SwitchEngines’ S3 implementation (ceph).
Note
Copied from https://github.com/impresso/impresso-data-sanitycheck/tree/master/sanity_check/contents/s3_data.py
- Parameters:
bucket_name (str) – Name of the S3 bucket to consider
s3_client (optional) – S3 client to use. Defaults to get_s3_client().
page_size (int, optional) – Pagination configuration. Defaults to 10000.
- Returns:
List of newspaper (aliases) present in the given S3 bucket.
- Return type:
list[str]
- impresso_commons.path.path_s3.read_s3_issues(newspaper, year, input_bucket)
- class impresso_commons.path.path_s3.s3ContentItem(journal, date, edition, number, key_name, doc_type=None, rebuilt_version=None, canonical_version=None)
Bases:
object
- impresso_commons.path.path_s3.s3_filter_archives(bucket_name, config, suffix='.jsonl.bz2')
Iterate over bucket and filter according to config and suffix. Config is a dict where k= newspaper acronym and v = array of 2 years, considered as time interval. .. rubric:: Example
- config = {
“GDL” : [1960, 1970], => will take all years in interval “JDG”: [], => Empty array means no filter, all years. “GDL”: [1798, 1999, 10] => take each 10th item within sequence of years
}
- Parameters:
bucket_name (str) – the name of the bucket
config (Dict) – newspaper/years to consider
key_suffix – end of the key
- Returns:
array of keys
- impresso_commons.path.path_s3.s3_iter_bucket(bucket_name, prefix, suffix)
Iterate over a bucket and return all keys with prefix and suffix.
>>> b = get_bucket("myBucket", create=False) >>> k = s3_iter_bucket(b.name, prefix='GDL', suffix=".bz2") >>> :param bucket_name: the name of the bucket :type bucket_name: str :param prefix: beginning of the key :type prefix: str :param key_suffix: how the key ends :type prefix: str @return: array of keys