Refresh/read schema on dataset via API

tomas
tomas Registered, Neuron 2022 Posts: 120 ✭✭✭✭✭
edited July 16 in Using Dataiku

Hi,

created just a brand new dataset using the Python API pointing to a S3 location. Is there any way how to automatically do a reload of schema from the data? The schema is empty and I would like to trigger the check data process to reload the schema.

I tried to check on the methods of dataset object, but found nothing there:

'__weakref__',
 'clear',
 'client',
 'compute_metrics',
 'create_statistics_worksheet',
 'dataset_name',
 'delete',
 'get_definition',
 'get_last_metric_values',
 'get_metadata',
 'get_metric_history',
 'get_object_discussions',
 'get_schema',
 'get_statistics_worksheet',
 'get_usages',
 'iter_rows',
 'list_partitions',
 'list_statistics_worksheets',
 'project_key',
 'run_checks',
 'set_definition',
 'set_metadata',
 'set_schema',
 'synchronize_hive_metastore',
 'update_from_hive'

except update_from_hive, but that's not what I want.

Thanks

Best Answer

Answers

  • dimitri
    dimitri Dataiker, Product Ideas Manager Posts: 33 Dataiker

    Hi @tomas
    ,

    I'm not sure I fully understand what you're trying to achieve. Where does the data from which you want to read the schema come from? How is it stored?

    Since a python recipe allows altering the data in an unpredictable way, the output schema cannot be automatically inferred from the input dataset(s). Therefore, the schema of the output dataset must either be declared prior to running the script or it must be declared in the script.

    To specify it in the script, if you use DataFrames, you can use the write_with_schema() or the write_schema_from_dataframe() methods that will set the dataset schema based on the DataFrame column types. Else you can manually specify the dataset schema using the write_schema() method, as described here.

    I hope it helps!

  • tomas
    tomas Registered, Neuron 2022 Posts: 120 ✭✭✭✭✭

    Hi @dimitri
    no, you did not understand what I tried to achieve, or I was not clean enough.

    The data is stored in S3. It is stored in CSV (lets say, but could be parquet as well).

    The project does not contain anything, no datasets, no recipes.

    Now I am using the python API dataiku client, to create a dataset. During the creation I pass the correct connection, bucket, path to the CSV. So this creates me 1 dataset in my project.

    Now I need to "reload" the schema programatically. Because the dataset, when I open it in the UI, does not have any columns, right. But Dataiku DSS provides a button for this, check schema or reload or something like this. Now after I click on it manually, DSS reads nicely the column names and stores it into the schema of the dataset.

    Now the question is that is this functionality available via the public API? Or do I need to open the CSV (maybe using boto, dealing with access/secret keys), read the CSV header and use DSS Public API to set the schema of the dataset

  • tomas
    tomas Registered, Neuron 2022 Posts: 120 ✭✭✭✭✭

    Great! I was waiting for this feature, thanks!

Setup Info
    Tags
      Help me…