Community Conundrum 28: News Engagement is live! Read More

Force schema syncs

Neuron
Neuron
Force schema syncs

Hi all,

In my pipeline, shown below, I build the final (right-most) dataset every day with a forced recursive rebuild of dependant data sets.

ben_p_0-1587483928995.png

I have a recurring issue where sometimes the data pulled by the python jobs has a slightly different schema, and my pipeline breaks. I fix this by manually running the python step in question, then DSS will prompt me to update the schema based on this new data. When I do this I can re-run the scenario without error.

My question is, can I enable my forced recursive rebuild to also update the schema (if required), so that this does not cause my scenario to fail?

Thank you,
Ben

2 Replies
Dataiker
Dataiker

Hi,

When automating a flow, the assumption is that you need control over dataset schemas.

Note that DSS never automatically changes the schema of a dataset while running a job. Changing the schema of a dataset is a dangerous operation, which can lead to previous data becoming unreadable, especially for partitioned datasets.

You can find more information on this page: https://doc.dataiku.com/dss/latest/schemas/index.html

In your case, I would advise to implement schema control in your python recipes, to prevent downstream schema changes.

Cheers,

Alex

Neuron
Neuron
Author

Thanks Alex, this makes sense, I'll look to enforce the schema in my code instead!

Thanks,
Ben

A banner prompting to get Dataiku DSS