Force schema syncs
Hi all,
In my pipeline, shown below, I build the final (right-most) dataset every day with a forced recursive rebuild of dependant data sets.
I have a recurring issue where sometimes the data pulled by the python jobs has a slightly different schema, and my pipeline breaks. I fix this by manually running the python step in question, then DSS will prompt me to update the schema based on this new data. When I do this I can re-run the scenario without error.
My question is, can I enable my forced recursive rebuild to also update the schema (if required), so that this does not cause my scenario to fail?
Thank you,
Ben
Best Answer
-
Hi,
When automating a flow, the assumption is that you need control over dataset schemas.
Note that DSS never automatically changes the schema of a dataset while running a job. Changing the schema of a dataset is a dangerous operation, which can lead to previous data becoming unreadable, especially for partitioned datasets.
You can find more information on this page: https://doc.dataiku.com/dss/latest/schemas/index.html
In your case, I would advise to implement schema control in your python recipes, to prevent downstream schema changes.
Cheers,
Alex
Answers
-
ben_p Neuron 2020, Registered, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant Posts: 143 ✭✭✭✭✭✭✭
Thanks Alex, this makes sense, I'll look to enforce the schema in my code instead!
Thanks,
Ben