Force schema syncs

ben_p
ben_p Neuron 2020, Registered, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant Posts: 143 ✭✭✭✭✭✭✭

Hi all,

In my pipeline, shown below, I build the final (right-most) dataset every day with a forced recursive rebuild of dependant data sets.

ben_p_0-1587483928995.png

I have a recurring issue where sometimes the data pulled by the python jobs has a slightly different schema, and my pipeline breaks. I fix this by manually running the python step in question, then DSS will prompt me to update the schema based on this new data. When I do this I can re-run the scenario without error.

My question is, can I enable my forced recursive rebuild to also update the schema (if required), so that this does not cause my scenario to fail?

Thank you,
Ben

Best Answer

  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭
    Answer ✓

    Hi,

    When automating a flow, the assumption is that you need control over dataset schemas.

    Note that DSS never automatically changes the schema of a dataset while running a job. Changing the schema of a dataset is a dangerous operation, which can lead to previous data becoming unreadable, especially for partitioned datasets.

    You can find more information on this page: https://doc.dataiku.com/dss/latest/schemas/index.html

    In your case, I would advise to implement schema control in your python recipes, to prevent downstream schema changes.

    Cheers,

    Alex

Answers

  • ben_p
    ben_p Neuron 2020, Registered, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant Posts: 143 ✭✭✭✭✭✭✭

    Thanks Alex, this makes sense, I'll look to enforce the schema in my code instead!

    Thanks,
    Ben

Setup Info
    Tags
      Help me…