hdx.utilities.retriever
Retrieve Objects
class Retrieve(BaseDownload)
Retrieve class which takes in a Download object and can either download, download and save or use previously downloaded and saved data. It also allows the use of a static fallback when downloading fails.
Arguments:
downloader
Download - Download objectfallback_dir
str - Directory containing static fallback datasaved_dir
str - Directory to save or load downloaded datatemp_dir
str - Temporary directory for when data is not needed after downloadingsave
bool - Whether to save downloaded data. Defaults to False.use_saved
bool - Whether to use saved data. Defaults to False.prefix
str - Prefix to add to filenames. Defaults to "".delete
bool - Whether to delete saved_dir if save is True. Defaults to True.log_level
int - Level at which to log messages. Defaults to logging.INFO.
check_flags
@staticmethod
def check_flags(saved_dir: str, save: bool, use_saved: bool,
delete: bool) -> None
Check flags. Also delete saved_dir if save and delete are True.
Arguments:
saved_dir
str - Directory to save or load downloaded datasave
bool - Whether to save downloaded datause_saved
bool - Whether to use saved datadelete
bool - Whether to delete saved_dir if save is True
Returns:
None
get_url_logstr
@staticmethod
def get_url_logstr(url: str) -> str
Url string that will be logged. It is limited to 100 characters if necessary.
Arguments:
url
str - URL to download
Returns:
str
- Url string to use in logs
clone
def clone(downloader: Download) -> "Retrieve"
Clone a given retriever but use the given downloader.
Arguments:
downloader
Download - Downloader to use
Returns:
Retrieve
- Cloned retriever
get_filename
def get_filename(url: str,
filename: Optional[str] = None,
possible_extensions: Tuple[str, ...] = tuple(),
**kwargs: Any) -> Tuple[str, Any]
Get filename from url and given parameters.
Arguments:
url
str - Url from which to get filenamefilename
optional[str] - Filename to use. Defaults to None (infer from url).possible_extensions
Tuple[str, ...] - Possible extensions to look for in url**kwargs
- See belowformat
str - Given extension to look for in urlfile_type
str - Given extension to look for in url
Returns:
Tuple[str, Any]: Tuple of (filename, kwargs)
set_bearer_token
def set_bearer_token(bearer_token: str) -> None
Set bearer token in downloader
Arguments:
bearer_token
str - Bearer token
Returns:
None
download_file
def download_file(url: str,
filename: Optional[str] = None,
logstr: Optional[str] = None,
fallback: bool = False,
log_level: int = None,
**kwargs: Any) -> str
Retrieve file.
Arguments:
url
str - URL to downloadfilename
Optional[str] - Filename of saved file. Defaults to getting from url.logstr
Optional[str] - Text to use in log string to describe download. Defaults to filename.fallback
bool - Whether to use static fallback if download fails. Defaults to False.log_level
int - Level at which to log messages. Overrides level from constructor.**kwargs
- Parameters to pass to download_file call
Returns:
str
- Path to downloaded file
download_text
def download_text(url: str,
filename: Optional[str] = None,
logstr: Optional[str] = None,
fallback: bool = False,
log_level: int = None,
**kwargs: Any) -> str
Download text.
Arguments:
url
str - URL to downloadfilename
Optional[str] - Filename of saved file. Defaults to getting from url.logstr
Optional[str] - Text to use in log string to describe download. Defaults to filename.fallback
bool - Whether to use static fallback if download fails. Defaults to False.log_level
int - Level at which to log messages. Overrides level from constructor.**kwargs
- Parameters to pass to download_text call
Returns:
str
- The text from the file
download_yaml
def download_yaml(url: str,
filename: Optional[str] = None,
logstr: Optional[str] = None,
fallback: bool = False,
log_level: int = None,
**kwargs: Any) -> Any
Retrieve YAML.
Arguments:
url
str - URL to downloadfilename
Optional[str] - Filename of saved file. Defaults to getting from url.logstr
Optional[str] - Text to use in log string to describe download. Defaults to filename.fallback
bool - Whether to use static fallback if download fails. Defaults to False.log_level
int - Level at which to log messages. Overrides level from constructor.**kwargs
- Parameters to pass to download_yaml call
Returns:
Any
- The data from the YAML file
download_json
def download_json(url: str,
filename: Optional[str] = None,
logstr: Optional[str] = None,
fallback: bool = False,
log_level: int = None,
**kwargs: Any) -> Any
Retrieve JSON.
Arguments:
url
str - URL to downloadfilename
Optional[str] - Filename of saved file. Defaults to getting from url.logstr
Optional[str] - Text to use in log string to describe download. Defaults to filename.fallback
bool - Whether to use static fallback if download fails. Defaults to False.log_level
int - Level at which to log messages. Overrides level from constructor.**kwargs
- Parameters to pass to download_json call
Returns:
Any
- The data from the JSON file
get_tabular_rows
def get_tabular_rows(url: Union[str, ListTuple[str]],
has_hxl: bool = False,
headers: Union[int, ListTuple[int], ListTuple[str]] = 1,
dict_form: bool = False,
filename: Optional[str] = None,
logstr: Optional[str] = None,
fallback: bool = False,
**kwargs: Any) -> Tuple[List[str], Iterator[ListDict]]
Returns header of tabular file(s) pointed to by url and an iterator where each row is returned as a list or dictionary depending on the dict_rows argument.
When a list of urls is supplied (in url), then the has_hxl flag indicates if the files are HXLated so that the HXL row is only included from the first file. The headers argument is either a row number or list of row numbers (in case of multi-line headers) to be considered as headers (rows start counting at 1), or the actual headers defined as a list of strings. It defaults to 1. The dict_form arguments specifies if each row should be returned as a dictionary or a list, defaulting to a list.
Arguments:
url
Union[str, ListTuple[str]] - A single or list of URLs or paths to read fromhas_hxl
bool - Whether files have HXL hashtags. Defaults to False.headers
Union[int, ListTuple[int], ListTuple[str]] - Number of row(s) containing headers or list of headers. Defaults to 1.dict_form
bool - Return dict or list for each row. Defaults to False (list)filename
Optional[str] - Filename of saved file. Defaults to getting from url.logstr
Optional[str] - Text to use in log string to describe download. Defaults to filename.fallback
bool - Whether to use static fallback if download fails. Defaults to False.**kwargs
- Parameters to pass to download_file call
Returns:
Tuple[List[str],Iterator[ListDict]]
- Tuple (headers, iterator where each row is a list or dictionary)
generate_retrievers
@classmethod
def generate_retrievers(cls,
fallback_dir: str,
saved_dir: str,
temp_dir: str,
save: bool = False,
use_saved: bool = False,
ignore: ListTuple[str] = tuple(),
delete: bool = True,
**kwargs: Any) -> None
Generate retrievers. Retrievers are generated from downloaders so Download.generate_downloaders() needs to have been called first. Each retriever can either download, download and save or use previously downloaded and saved data. It also allows the use of a static fallback when downloading fails.
Arguments:
fallback_dir
str - Directory containing static fallback datasaved_dir
str - Directory to save or load downloaded datatemp_dir
str - Temporary directory for when data is not needed after downloadingsave
bool - Whether to save downloaded data. Defaults to False.use_saved
bool - Whether to use saved data. Defaults to False.ignore
ListTuple[str] - Don't generate retrievers for these downloadersdelete
bool - Whether to delete saved_dir if save is True. Defaults to True.**kwargs
Any - Any other arguments to pass.
Returns:
None
get_retriever
@classmethod
def get_retriever(cls, name: Optional[str] = None) -> "Retrieve"
Get a generated retriever given a name. If name is not supplied, the default one will be returned.
Arguments:
name
Optional[str] - Name of retriever. Defaults to None (get default).
Returns:
Retriever
- Retriever object