dratio.models.Dataset#

class dratio.models.Dataset(client, code: str, version: str | None = None)#

Representation of a dataset in the database. This class allows to obtain information about the dataset and its versions and download as a pandas or geopandas dataframe.

Parameters:
  • code (str) – Unique identifier of the feature in the database.

  • version (str | None) – Version of the dataset to be used. If None, the latest version is used.

  • client (Client) – Client object used to perform requests to the database.

  • **kwargs – Additional keyword arguments used to initialize the metadata information.

Examples

Retrieve a dataset from the dratio.io marketplace:

>>> from dratio import Client
>>> client = Client('YOUR_API_KEY')
>>> dataset = client.get('municipalities')
>>> dataset
Dataset('municipalities')

Access fields included in the metadata of the dataset:

>>> dataset.name
'Municipalities'
>>> dataset.description
'Municipalities of Spain according to the name under which they are registered ...'

Get a dictionary with all metadata:

>>> dataset.metadata
{'code': 'municipalities', 'name': 'Municipalities', 'description': ...}

Get current version of the dataset

>>> dataset.version
Version('municipalities-v1')

Download a dataset as a pandas dataframe:

>>> df = dataset.to_pandas()

Download as a geopandas dataframe (for geospatial datasets):

>>> gdf = dataset.to_geopandas()
__init__(client, code: str, version: str | None = None)#

Initializes the Dataset object

Methods

__init__(client, code[, version])

Initializes the Dataset object

add_feature(feature)

Adds a feature to the dataset.

delete()

Deletes the object from the database.

describe()

Returns a string representation of the object's metadata.

fetch([fail_not_found])

Updates the metadata dictionary of the object by performing an HTTP request to the server.

from_dict(metadata)

Updates the internal state of the object with the provided metadata.

keys()

Returns the keys of the metadata dictionary.

list_features([format])

Returns the features associated to the object.

list_files([filetype, format])

Returns a list of files associated to the version.

list_versions([format])

List available versions of the dataset

metadata_from_pandas(df, publisher[, ...])

Automatically generates the metadata of the dataset from a pandas dataframe.

save()

Saves the object's metadata to the database.

set_version(version)

to_geopandas([cross_strategy])

Downloads the dataset as a geopandas geodataframe.

to_pandas()

Downloads the dataset as a pandas dataframe.

upload_file(file[, filetype, update])

Upload a file to the dataset.

Attributes

categories

Returns the categories associated to the object.

columns

Return a list with all the columns of the dataset (List[str], read-only).

description

Returns the description of the object.

features

Dictionary with features indexed by column name (Dict[str, Feature], read-only).

granularity

Granularity of the dataset, i.e., the time between different timestamps points (str, read-only).

last_data

Last date of the dataset (str, read-only).

last_update

Last update of the dataset (str, read-only).

level

Level of the dataset (dict, read-only).

license

License of the dataset (str, read-only).

metadata

Retrieves the metadata associated with the object.

n_features

Number of features in the dataset (int, read-only).

n_time_slices

Number of time slices in the dataset (int, read-only).

n_values

Number of values in the dataset (int, read-only).

n_variables

Number of variables in the dataset (int, read-only).

name

Returns the name of the object.

next_update

Next scheduled update of the dataset (str, read-only).

publisher

Name of the publisher of the dataset (str, read-only).

scope

Scope of the dataset (dict, read-only).

start_data

Start date of the dataset (str, read-only).

timestamp_column

Name of the column used as timestamp (str, read-only).

update_frequency

Update frequency of the dataset (str, read-only).

version

Return the current version of the dataset (Version, read-only).

add_feature(feature: Feature) None#

Adds a feature to the dataset.

Parameters:

feature (Feature) – Feature to add to the dataset.

Examples

>>>
Raises:

requests.exceptions.RequestException. – If the request fails due to an HTTP or Conection Error.

property categories: List[Category]#

Returns the categories associated to the object.

property columns: List[str]#

Return a list with all the columns of the dataset (List[str], read-only).

delete() None#

Deletes the object from the database.

Raises:

requests.exceptions.RequestException – If the request fails.

describe() str#

Returns a string representation of the object’s metadata.

property description: str#

Returns the description of the object.

property features: List[Feature]#

Dictionary with features indexed by column name (Dict[str, Feature], read-only).

fetch(fail_not_found: bool = True) DatabaseResource#

Updates the metadata dictionary of the object by performing an HTTP request to the server.

Returns:

  • self (DatabaseResource) – The object itself.

  • fail_not_found (bool, default True) – Whether to raise an exception if the object is not found in the database.

Notes

This method modifies the object’s internal state.

Raises:
from_dict(metadata)#

Updates the internal state of the object with the provided metadata.

Parameters:

metadata (dict) – Dictionary containing the metadata of the object.

Returns:

self – The object itself.

Return type:

DatabaseResource

Notes

This method modifies the object’s internal state.

property granularity: str | None#

Granularity of the dataset, i.e., the time between different timestamps points (str, read-only).

keys() List[str]#

Returns the keys of the metadata dictionary.

property last_data: str | None#

Last date of the dataset (str, read-only).

property last_update: str | None#

Last update of the dataset (str, read-only).

property level: DataLevel | None#

Level of the dataset (dict, read-only).

property license: License | None#

License of the dataset (str, read-only).

list_features(format: Literal['pandas', 'json', 'api'] = 'pandas') pd.DataFrame | List[Dict[str, Any]] | List[Feature]#

Returns the features associated to the object.

Parameters:

format (str, optional) – Format of the output. Either “pandas”, “json” or “api”. Defaults to “pandas”. If “pandas”, the output is a pandas DataFrame. If “json”, the output is a list of dictionaries. If “api”, the output is a list of Feature objects.

Returns:

List of features associated to the object.

Return type:

Union[“pd.DataFrame”, List[Dict[str, Any]], List[“Feature”]]

Examples

List all features available in the database:

>>> from dratio import Client
>>> client = Client("Your API key")
>>> client.list_features()

List all features associated to the publisher “ine” (National Institute of Statistics):

>>> publisher = client.get_publisher("ine")
>>> publisher.list_features()

List all features of a dataset (its columns):

>>> dataset = client.get_dataset("municipalities")
>>> dataset.list_features()

List all features availabe at census level:

>>> level = client.get("census", kind="data-level")
>>> level.list_features()
Raises:
  • ValueError – If the format is not “pandas”, “json” or “api”.

  • HTTPError – If the request to the API fails.

  • DratioException: – If the response from the API is not valid (e.g an invalid api key or insufficient permissions).

list_files(filetype: Literal['parquet', 'geoparquet'] | None = None, format: Literal['pandas', 'json', 'api'] = 'pandas') pd.DataFrame | List[Dict[str, Any]] | List[File]#

Returns a list of files associated to the version.

Parameters:
  • filetype (Optional[Literal["parquet", "geoparquet"]]) – Type of file to filter. If None, all files are returned.

  • format (Literal["pandas", "json"]) – Format of the returned list, either a list of dictionaries or a pandas DataFrame.

Returns:

List of files associated to the version.

Return type:

Literal[“pandas”, “json”]

list_versions(format: Literal['pandas', 'json', 'api'] = 'pandas') pd.DataFrame | List[Dict[str, Any]] | List[Version]#

List available versions of the dataset

Returns:

List of features.

Return type:

List[Feature]

Examples

>>>
Raises:

requests.exceptions.RequestException. – If the request fails due to an HTTP or Conection Error.

property metadata: Dict[str, Any]#

Retrieves the metadata associated with the object.

Notes

The first time this property is accessed, a request is made to the server to fetch the metadata. Subsequent accesses return the previously loaded information. To update the metadata, create a new instance of the object.

metadata_from_pandas(df: pd.DataFrame | gpd.GeoDataFrame, publisher: str | Publisher, license: str | License | None = None, timestamp_column: str = 'timestamp') Dataset#

Automatically generates the metadata of the dataset from a pandas dataframe. This method is useful to create a dataset from a pandas dataframe, and is intended to be used for data providers that want to upload their data to dratio.io.

Parameters:
  • df (Union[pandas.DataFrame, geopandas.GeoDataFrame]) – Pandas dataframe with the data.

  • publisher (Union[str, Publisher]) – Publisher of the dataset.

  • license (Optional[Union[str, License]]) – License of the dataset.

  • timestamp_column (str) – Name of the column used as timestamp (if applicable).

Returns:

Dataset object with the metadata generated from the pandas dataframe.

Return type:

Dataset

property n_features: int | None#

Number of features in the dataset (int, read-only).

property n_time_slices: int | None#

Number of time slices in the dataset (int, read-only).

property n_values: int | None#

Number of values in the dataset (int, read-only).

property n_variables: int | None#

Number of variables in the dataset (int, read-only).

property name: str#

Returns the name of the object.

property next_update: str | None#

Next scheduled update of the dataset (str, read-only).

property publisher: Publisher | None#

Name of the publisher of the dataset (str, read-only).

save() Dataset#

Saves the object’s metadata to the database.

Returns:

self – The object itself.

Return type:

DatabaseResource

Raises:

requests.exceptions.RequestException – If the request fails.

property scope: Scope | None#

Scope of the dataset (dict, read-only).

property start_data: str | None#

Start date of the dataset (str, read-only).

property timestamp_column: str | None#

Name of the column used as timestamp (str, read-only).

to_geopandas(cross_strategy: str = 'auto') gpd.GeoDataFrame#

Downloads the dataset as a geopandas geodataframe.

Returns:

GeoDataFrame with the dataset.

Return type:

geopandas.GeoDataFrame

Notes

This method requires the geopandas library to be installed.

Raises:
  • ImportError. – If the geopandas library is not installed. You can install it using pip install dratio[geo].

  • requests.exceptions.RequestException. – If the request fails due to an HTTP or Conection Error.

to_pandas() pd.DataFrame#

Downloads the dataset as a pandas dataframe.

Returns:

Dataframe with the dataset.

Return type:

pandas.DataFrame

Examples

>>>
Raises:

requests.exceptions.RequestException. – If the request fails due to an HTTP or Conection Error.

property update_frequency: str | None#

Update frequency of the dataset (str, read-only).

upload_file(file: str | Path | pd.DataFrame | gpd.GeoDataFrame, filetype: Literal['parquet', 'geoparquet'] | None = None, update: bool = False) File#

Upload a file to the dataset.

property version: Version#

Return the current version of the dataset (Version, read-only).