Community Conundrum 25:Feature Visualization is now live! Read More

Force schema syncs

Level 4
Force schema syncs

Hi all,

In my pipeline, shown below, I build the final (right-most) dataset every day with a forced recursive rebuild of dependant data sets.

ben_p_0-1587483928995.png

I have a recurring issue where sometimes the data pulled by the python jobs has a slightly different schema, and my pipeline breaks. I fix this by manually running the python step in question, then DSS will prompt me to update the schema based on this new data. When I do this I can re-run the scenario without error.

My question is, can I enable my forced recursive rebuild to also update the schema (if required), so that this does not cause my scenario to fail?

Thank you,
Ben

2 Replies
Dataiker
Dataiker

Hi,

When automating a flow, the assumption is that you need control over dataset schemas.

Note that DSS never automatically changes the schema of a dataset while running a job. Changing the schema of a dataset is a dangerous operation, which can lead to previous data becoming unreadable, especially for partitioned datasets.

You can find more information on this page: https://doc.dataiku.com/dss/latest/schemas/index.html

In your case, I would advise to implement schema control in your python recipes, to prevent downstream schema changes.

Cheers,

Alex

Level 4
Author

Thanks Alex, this makes sense, I'll look to enforce the schema in my code instead!

Thanks,
Ben