Scenarios to Run Code Recipe that appends to data set.
How do I set up a Scenario to produce cumulative results over many runs of a Python Code Recipie?
I have a very simple Flow with a Python Recipe and 1 SQL based dataset.
I'm tracking the results over time in the WalkupResults Table which is stored in PostgreSQL. I want long term cumulative results in this table so for the Recipe I set the output to "Append instead of overwrite."
Here is my Run Trigger for my first scenario setup.
Here is my Run Step
I've tried a number of variants of the Build Mode
In some cases, I've lost a bunch or test records.
Was this because I used the "Force-rebuild dataset and dependencies" and this somehow overrode the append directive of the Python Recipe?
I've not found a way of using scenarios to consistently run a recipe.
I'm running in DSS 5.1.5.
Does anyone have any suggestions that will move me forward?
--Tom
Best Answers
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron
So, I think that I've worked out my own issue. I hope.
I used "Force-rebuilds dataset and dependency." I think that this is the one that wiped my data table.
I've chosen to build only this dataset for now. And this seems to work OK.
I think that other parts of my problems were a couple of remaining Python bugs that needed to be cleaned up.
--Tom
-
The table will be overriden even if "Append" is enabled if the schema output by the Python recipe does not match the table.
We plan to add options so that it would fail instead of overwriting but it does not exist. What you can do however, is to use `write_dataframe` instead of `write_with_schema` in your Python code after the first time. That will cause write to fail in case of schema mismatch but the data will not be dropped.
Answers
-
This already happened to me. The issue was not the 'force rebuild' but the fact that the initial table had not been created with Dataiku, and its schema (column types) did not match the types Dataiku is able to handle. Therefore, the first time the recipe ran, it dropped the table completely to rebuilt it with a schema corresponding to the output of the recipe. Following runs were ok given the schema matched. But this means we need to be very careful when sending data to pre-existing tables. You can see the 'DROP TABLE' in the logs when you first ran the scenario.
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron
Thanks for sharing.
In my case, my experience is a little different than yours. Although I may not have made this clear in my description, the table had already been created by a Dataiku DSS code recipe and updated by running the same recipe.
My problem happened only in subsequent runs of the recipe initiated by a scenario when I was trying to setup a scenario.
Now I also had some python code bugs in the recipe. I don't believe that this was causal in my experience. (Of course one can't know for sure.)
It would be interesting to know from the Dataiku folks under what cases Table Overwrites occur and when the append directive is followed. And maybe a little bit more about the functioning of the Scenario editor. Is there a video or some more information about this? Would this be a topic for a webinar provided by Dataiku staff or community members here?
--Tom
-
Absolutely, thank you Clément, the write_from_dataframe is one of my favorite 'best practice tips' I received from Dataiku. The issue with drop actually occurred with a sync recipe, but it is great that Dataiku's roadmap already took it into account.
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron
I had this same situation hit me again a few weeks ago.
Has there been any progress on making this situation any better with schema changes causing append datasets to be cleared when for example column order might change. Or a column gets added to data. I have datasets that can take days to re-build and every time this happens this is sort of a nightmare. Particularly if I'm on some kind of deadline.