Allow datasets to automatically reload schema when jobs run.

Currently, if columns in a dataset source are added or removed, jobs and scenarios that read from that dataset will fail until you reload the the schema from table. Even if everything downstream does not have dependencies on the column changes.
We would like to see a setting to allow datasets to always reload schema when first read or built. This would be much more manageable than adding a "Reload Schema" step to each and every scenario.
Comments
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,409 Neuron
This feature already exists and it’s called schema propagation:
-
Selecting automatic schema propagation in the Schema Propagation tool is not preventing the scenarios from encountering errors when the schema of the underlying input dataset changes. It's not an error due to schema changes of the input or output datasets of downstream recipes, but a result of the table being read for dataset input. Specifically, when
[dataset].get_dataframe()
is in a Python recipe immediately following a dataset import from an external table, the scenario throws the following error even with automatic schema propagation selected:Error in Python process: At line XX: <class 'Exception'>: Reading dataset failed: b'Invalid number of columns in query (YY, expected ZZ) [...]. Please check dataset schema'
Schema propagation tool settings do not appear to have the same effect as manually entering the dataset and forcing the schema to update via dataset settings.
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,409 Neuron
This can be easily be done via the Dataiku API:
client = dataiku.api_client() project = client.get_default_project() dataset = project.get_dataset("dataset_name") # Reset dataset schema dataset.set_schema({}) # Autodetect dataset schema and save it dataset.autodetect_settings(infer_storage_types=True).save()
Add this to your Python recipe before the [dataset].get_dataframe() call and it will auto-update the schema of the input dataset. I will not recommend you do this though. Having code that updates schemas autonomously without human review can cause unintended consequences.
I will also suggest you post on the regular forum or ask support before posting a product enhancement as you may be unaware of functionality like the one above and assume the feature is missing from the product.
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,409 Neuron
One way to add some protection from schema changes is that you call dataset.get_schema(), store it in a variable and then compare it to dataset.autodetect_settings(infer_storage_types=True) so you can detect changes and send an email for instance rasther than just silently updating the input dataset.
-
시나리오에서 Reload schema로 해서 Reload하고싶은 dataset을 추가하시거나
dataset을 클릭하시고 UPSTREAM 이나 DOWNSTREAM으로 build 하시면 스키마가 전파됩니다.
처음 스키마 변경 된 것 리로드 하는것은 dataset에 들어가서도 할 수도 있지만
dataset에서 우클릭하시면 reload를 바로하실 수있습니다.