hdx.utilities.retriever

Retrieve Objects

class Retrieve(BaseDownload)

[view_source]

Retrieve class which takes in a Download object and can either download, download and save or use previously downloaded and saved data. It also allows the use of a static fallback when downloading fails.

Arguments:

  • downloader Download - Download object
  • fallback_dir str - Directory containing static fallback data
  • saved_dir str - Directory to save or load downloaded data
  • temp_dir str - Temporary directory for when data is not needed after downloading
  • save bool - Whether to save downloaded data. Defaults to False.
  • use_saved bool - Whether to use saved data. Defaults to False.
  • prefix str - Prefix to add to filenames. Defaults to "".
  • delete bool - Whether to delete saved_dir if save is True. Defaults to True.
  • log_level int - Level at which to log messages. Defaults to logging.INFO.

check_flags

@staticmethod
def check_flags(saved_dir: str, save: bool, use_saved: bool,
                delete: bool) -> None

[view_source]

Check flags. Also delete saved_dir if save and delete are True.

Arguments:

  • saved_dir str - Directory to save or load downloaded data
  • save bool - Whether to save downloaded data
  • use_saved bool - Whether to use saved data
  • delete bool - Whether to delete saved_dir if save is True

Returns:

None

get_url_logstr

@staticmethod
def get_url_logstr(url: str) -> str

[view_source]

Url string that will be logged. It is limited to 100 characters if necessary.

Arguments:

  • url str - URL to download

Returns:

  • str - Url string to use in logs

clone

def clone(downloader: Download) -> "Retrieve"

[view_source]

Clone a given retriever but use the given downloader.

Arguments:

  • downloader Download - Downloader to use

Returns:

  • Retrieve - Cloned retriever

get_filename

def get_filename(url: str,
                 filename: Optional[str] = None,
                 possible_extensions: Tuple[str, ...] = tuple(),
                 **kwargs: Any) -> Tuple[str, Any]

[view_source]

Get filename from url and given parameters.

Arguments:

  • url str - Url from which to get filename
  • filename optional[str] - Filename to use. Defaults to None (infer from url).
  • possible_extensions Tuple[str, ...] - Possible extensions to look for in url
  • **kwargs - See below
  • format str - Given extension to look for in url
  • file_type str - Given extension to look for in url

Returns:

Tuple[str, Any]: Tuple of (filename, kwargs)

set_bearer_token

def set_bearer_token(bearer_token: str) -> None

[view_source]

Set bearer token in downloader

Arguments:

  • bearer_token str - Bearer token

Returns:

None

download_file

def download_file(url: str,
                  filename: Optional[str] = None,
                  logstr: Optional[str] = None,
                  fallback: bool = False,
                  log_level: int = None,
                  **kwargs: Any) -> str

[view_source]

Retrieve file.

Arguments:

  • url str - URL to download
  • filename Optional[str] - Filename of saved file. Defaults to getting from url.
  • logstr Optional[str] - Text to use in log string to describe download. Defaults to filename.
  • fallback bool - Whether to use static fallback if download fails. Defaults to False.
  • log_level int - Level at which to log messages. Overrides level from constructor.
  • **kwargs - Parameters to pass to download_file call

Returns:

  • str - Path to downloaded file

download_text

def download_text(url: str,
                  filename: Optional[str] = None,
                  logstr: Optional[str] = None,
                  fallback: bool = False,
                  log_level: int = None,
                  **kwargs: Any) -> str

[view_source]

Download text.

Arguments:

  • url str - URL to download
  • filename Optional[str] - Filename of saved file. Defaults to getting from url.
  • logstr Optional[str] - Text to use in log string to describe download. Defaults to filename.
  • fallback bool - Whether to use static fallback if download fails. Defaults to False.
  • log_level int - Level at which to log messages. Overrides level from constructor.
  • **kwargs - Parameters to pass to download_text call

Returns:

  • str - The text from the file

download_yaml

def download_yaml(url: str,
                  filename: Optional[str] = None,
                  logstr: Optional[str] = None,
                  fallback: bool = False,
                  log_level: int = None,
                  **kwargs: Any) -> Any

[view_source]

Retrieve YAML.

Arguments:

  • url str - URL to download
  • filename Optional[str] - Filename of saved file. Defaults to getting from url.
  • logstr Optional[str] - Text to use in log string to describe download. Defaults to filename.
  • fallback bool - Whether to use static fallback if download fails. Defaults to False.
  • log_level int - Level at which to log messages. Overrides level from constructor.
  • **kwargs - Parameters to pass to download_yaml call

Returns:

  • Any - The data from the YAML file

download_json

def download_json(url: str,
                  filename: Optional[str] = None,
                  logstr: Optional[str] = None,
                  fallback: bool = False,
                  log_level: int = None,
                  **kwargs: Any) -> Any

[view_source]

Retrieve JSON.

Arguments:

  • url str - URL to download
  • filename Optional[str] - Filename of saved file. Defaults to getting from url.
  • logstr Optional[str] - Text to use in log string to describe download. Defaults to filename.
  • fallback bool - Whether to use static fallback if download fails. Defaults to False.
  • log_level int - Level at which to log messages. Overrides level from constructor.
  • **kwargs - Parameters to pass to download_json call

Returns:

  • Any - The data from the JSON file

get_tabular_rows

def get_tabular_rows(url: Union[str, ListTuple[str]],
                     has_hxl: bool = False,
                     headers: Union[int, ListTuple[int], ListTuple[str]] = 1,
                     dict_form: bool = False,
                     filename: Optional[str] = None,
                     logstr: Optional[str] = None,
                     fallback: bool = False,
                     **kwargs: Any) -> Tuple[List[str], Iterator[ListDict]]

[view_source]

Returns header of tabular file(s) pointed to by url and an iterator where each row is returned as a list or dictionary depending on the dict_rows argument.

When a list of urls is supplied (in url), then the has_hxl flag indicates if the files are HXLated so that the HXL row is only included from the first file. The headers argument is either a row number or list of row numbers (in case of multi-line headers) to be considered as headers (rows start counting at 1), or the actual headers defined as a list of strings. It defaults to 1. The dict_form arguments specifies if each row should be returned as a dictionary or a list, defaulting to a list.

Arguments:

  • url Union[str, ListTuple[str]] - A single or list of URLs or paths to read from
  • has_hxl bool - Whether files have HXL hashtags. Defaults to False.
  • headers Union[int, ListTuple[int], ListTuple[str]] - Number of row(s) containing headers or list of headers. Defaults to 1.
  • dict_form bool - Return dict or list for each row. Defaults to False (list)
  • filename Optional[str] - Filename of saved file. Defaults to getting from url.
  • logstr Optional[str] - Text to use in log string to describe download. Defaults to filename.
  • fallback bool - Whether to use static fallback if download fails. Defaults to False.
  • **kwargs - Parameters to pass to download_file call

Returns:

  • Tuple[List[str],Iterator[ListDict]] - Tuple (headers, iterator where each row is a list or dictionary)

generate_retrievers

@classmethod
def generate_retrievers(cls,
                        fallback_dir: str,
                        saved_dir: str,
                        temp_dir: str,
                        save: bool = False,
                        use_saved: bool = False,
                        ignore: ListTuple[str] = tuple(),
                        delete: bool = True,
                        **kwargs: Any) -> None

[view_source]

Generate retrievers. Retrievers are generated from downloaders so Download.generate_downloaders() needs to have been called first. Each retriever can either download, download and save or use previously downloaded and saved data. It also allows the use of a static fallback when downloading fails.

Arguments:

  • fallback_dir str - Directory containing static fallback data
  • saved_dir str - Directory to save or load downloaded data
  • temp_dir str - Temporary directory for when data is not needed after downloading
  • save bool - Whether to save downloaded data. Defaults to False.
  • use_saved bool - Whether to use saved data. Defaults to False.
  • ignore ListTuple[str] - Don't generate retrievers for these downloaders
  • delete bool - Whether to delete saved_dir if save is True. Defaults to True.
  • **kwargs Any - Any other arguments to pass.

Returns:

None

get_retriever

@classmethod
def get_retriever(cls, name: Optional[str] = None) -> "Retrieve"

[view_source]

Get a generated retriever given a name. If name is not supplied, the default one will be returned.

Arguments:

  • name Optional[str] - Name of retriever. Defaults to None (get default).

Returns:

  • Retriever - Retriever object