Refresh/read schema on dataset via API

tomas · ‎07-10-2020

Hi,

created just a brand new dataset using the Python API pointing to a S3 location. Is there any way how to automatically do a reload of schema from the data? The schema is empty and I would like to trigger the check data process to reload the schema.

I tried to check on the methods of dataset object, but found nothing there:

'__weakref__',
 'clear',
 'client',
 'compute_metrics',
 'create_statistics_worksheet',
 'dataset_name',
 'delete',
 'get_definition',
 'get_last_metric_values',
 'get_metadata',
 'get_metric_history',
 'get_object_discussions',
 'get_schema',
 'get_statistics_worksheet',
 'get_usages',
 'iter_rows',
 'list_partitions',
 'list_statistics_worksheets',
 'project_key',
 'run_checks',
 'set_definition',
 'set_metadata',
 'set_schema',
 'synchronize_hive_metastore',
 'update_from_hive'

except update_from_hive, but that's not what I want.

Thanks

dimitri · ‎07-29-2020

Hi @tomas ,

Thanks for these additional details!

With Dataiku 8.0, you can now programmatically perform the schema autodetection using the autodetect_settings() method of the dataset object, that you can then persist using DSSDatasetSettings.save().

A full code sample is available in the new Files-based dataset: Programmatic creation section of our documentation.

For your information, we've also added a helper for creating S3 datasets more conveniently using DSSProject.create_s3_dataset().

Have a great day!

View solution in original post

dimitri · ‎07-10-2020

Hi @tomas ,

I'm not sure I fully understand what you're trying to achieve. Where does the data from which you want to read the schema come from? How is it stored?

Since a python recipe allows altering the data in an unpredictable way, the output schema cannot be automatically inferred from the input dataset(s). Therefore, the schema of the output dataset must either be declared prior to running the script or it must be declared in the script.

To specify it in the script, if you use DataFrames, you can use the write_with_schema() or the write_schema_from_dataframe() methods that will set the dataset schema based on the DataFrame column types. Else you can manually specify the dataset schema using the write_schema() method, as described here.

I hope it helps!

tomas · ‎07-28-2020

Hi @dimitri no, you did not understand what I tried to achieve, or I was not clean enough.

The data is stored in S3. It is stored in CSV (lets say, but could be parquet as well).

The project does not contain anything, no datasets, no recipes.

Now I am using the python API dataiku client, to create a dataset. During the creation I pass the correct connection, bucket, path to the CSV. So this creates me 1 dataset in my project.

Now I need to "reload" the schema programatically. Because the dataset, when I open it in the UI, does not have any columns, right. But Dataiku DSS provides a button for this, check schema or reload or something like this. Now after I click on it manually, DSS reads nicely the column names and stores it into the schema of the dataset.

Now the question is that is this functionality available via the public API? Or do I need to open the CSV (maybe using boto, dealing with access/secret keys), read the CSV header and use DSS Public API to set the schema of the dataset

dimitri · ‎07-29-2020

Hi @tomas ,

Thanks for these additional details!

With Dataiku 8.0, you can now programmatically perform the schema autodetection using the autodetect_settings() method of the dataset object, that you can then persist using DSSDatasetSettings.save().

A full code sample is available in the new Files-based dataset: Programmatic creation section of our documentation.

For your information, we've also added a helper for creating S3 datasets more conveniently using DSSProject.create_s3_dataset().

Have a great day!

tomas · ‎07-29-2020

Great! I was waiting for this feature, thanks!

Sign up to take part

Refresh/read schema on dataset via API

Refresh/read schema on dataset via API