Allow datasets to automatically reload schema when jobs run.

Redacted · April 14

Currently, if columns in a dataset source are added or removed, jobs and scenarios that read from that dataset will fail until you reload the the schema from table. Even if everything downstream does not have dependencies on the column changes.

We would like to see a setting to allow datasets to always reload schema when first read or built. This would be much more manageable than adding a "Reload Schema" step to each and every scenario.

Turribeach · April 14

This feature already exists and it’s called schema propagation:

https://doc.dataiku.com/dss/latest/flow/building-datasets.html#propagate-schema-across-flow-from-here

Redacted · April 15

Selecting automatic schema propagation in the Schema Propagation tool is not preventing the scenarios from encountering errors when the schema of the underlying input dataset changes. It's not an error due to schema changes of the input or output datasets of downstream recipes, but a result of the table being read for dataset input. Specifically, when [dataset].get_dataframe() is in a Python recipe immediately following a dataset import from an external table, the scenario throws the following error even with automatic schema propagation selected:

Error in Python process: At line XX: <class 'Exception'>: Reading dataset failed: b'Invalid number of columns in query (YY, expected ZZ) [...]. Please check dataset schema'

Schema propagation tool settings do not appear to have the same effect as manually entering the dataset and forcing the schema to update via dataset settings.

Turribeach · April 15

https://community.dataiku.com/discussion/comment/45801#Comment_45801

This can be easily be done via the Dataiku API:

client = dataiku.api_client()
project = client.get_default_project()
dataset = project.get_dataset("dataset_name")
# Reset dataset schema
dataset.set_schema({})
# Autodetect dataset schema and save it
dataset.autodetect_settings(infer_storage_types=True).save()

Add this to your Python recipe before the [dataset].get_dataframe() call and it will auto-update the schema of the input dataset. I will not recommend you do this though. Having code that updates schemas autonomously without human review can cause unintended consequences.

I will also suggest you post on the regular forum or ask support before posting a product enhancement as you may be unaware of functionality like the one above and assume the feature is missing from the product.

Turribeach · April 15

One way to add some protection from schema changes is that you call dataset.get_schema(), store it in a variable and then compare it to dataset.autodetect_settings(infer_storage_types=True) so you can detect changes and send an email for instance rasther than just silently updating the input dataset.