Force schema syncs

ben_p · ‎04-21-2020

Hi all,

In my pipeline, shown below, I build the final (right-most) dataset every day with a forced recursive rebuild of dependant data sets.

I have a recurring issue where sometimes the data pulled by the python jobs has a slightly different schema, and my pipeline breaks. I fix this by manually running the python step in question, then DSS will prompt me to update the schema based on this new data. When I do this I can re-run the scenario without error.

My question is, can I enable my forced recursive rebuild to also update the schema (if required), so that this does not cause my scenario to fail?

Thank you,
Ben

Alex_Combessie · ‎04-22-2020

Hi,

When automating a flow, the assumption is that you need control over dataset schemas.

Note that DSS never automatically changes the schema of a dataset while running a job. Changing the schema of a dataset is a dangerous operation, which can lead to previous data becoming unreadable, especially for partitioned datasets.

You can find more information on this page: https://doc.dataiku.com/dss/latest/schemas/index.html

In your case, I would advise to implement schema control in your python recipes, to prevent downstream schema changes.

Cheers,

Alex

View solution in original post

Alex_Combessie · ‎04-22-2020

Hi,

When automating a flow, the assumption is that you need control over dataset schemas.

Note that DSS never automatically changes the schema of a dataset while running a job. Changing the schema of a dataset is a dangerous operation, which can lead to previous data becoming unreadable, especially for partitioned datasets.

You can find more information on this page: https://doc.dataiku.com/dss/latest/schemas/index.html

In your case, I would advise to implement schema control in your python recipes, to prevent downstream schema changes.

Cheers,

Alex

ben_p · ‎04-23-2020

Thanks Alex, this makes sense, I'll look to enforce the schema in my code instead!

Thanks,
Ben

Sign up to take part

Force schema syncs

Force schema syncs