Scenarios to Run Code Recipe that appends to data set.

Solved!
tgb417
Scenarios to Run Code Recipe that appends to data set.

How do I set up a Scenario to produce cumulative results over many runs of a Python Code Recipie?

I have a very simple Flow with a Python Recipe and 1 SQL based dataset.

tgb417_0-1580869087314.png

I'm tracking the results over time in the WalkupResults Table which is stored in PostgreSQL.  I want long term cumulative results in this table so for the Recipe I set the output to "Append instead of overwrite."

tgb417_1-1580869344430.png

Here is my Run Trigger for my first scenario setup.

tgb417_2-1580869564786.png

Here is my Run Step

tgb417_3-1580869664297.png

I've tried a number of variants of the Build Mode

tgb417_4-1580869751231.png

 

In some cases, I've lost a bunch or test records.

Was this because I used the "Force-rebuild dataset and dependencies" and this somehow overrode the append directive of the Python Recipe?

I've not found a way of using scenarios to consistently run a recipe.

tgb417_5-1580870002650.png

 

 

I'm running in DSS 5.1.5.

 

Does anyone have any suggestions that will move me forward?

--Tom

 

--Tom
0 Kudos
2 Solutions
tgb417
Author

So, I think that I've worked out my own issue.  I hope.

I used "Force-rebuilds dataset and dependency."  I think that this is the one that wiped my data table.

I've chosen to build only this dataset for now.  And this seems to work OK.

I think that other parts of my problems were a couple of remaining Python bugs that needed to be cleaned up.

--Tom

--Tom

View solution in original post

Clément_Stenac
Dataiker

The table will be overriden even if "Append" is enabled if the schema output by the Python recipe does not match the table.

We plan to add options so that it would fail instead of overwriting but it does not exist. What you can do however, is to use `write_dataframe` instead of `write_with_schema` in your Python code after the first time. That will cause write to fail in case of schema mismatch but the data will not be dropped.

View solution in original post

6 Replies
tgb417
Author

So, I think that I've worked out my own issue.  I hope.

I used "Force-rebuilds dataset and dependency."  I think that this is the one that wiped my data table.

I've chosen to build only this dataset for now.  And this seems to work OK.

I think that other parts of my problems were a couple of remaining Python bugs that needed to be cleaned up.

--Tom

--Tom
Caroline
Level 2

This already happened to me. The issue was not the 'force rebuild' but the fact that the initial table had not been created with Dataiku, and its schema (column types) did not match the types Dataiku is able to handle. Therefore, the first time the recipe ran, it dropped the table completely to rebuilt it with a schema corresponding to the output of the recipe. Following runs were ok given the schema matched. But this means we need to be very careful when sending data to pre-existing tables. You can see the 'DROP TABLE' in the logs when you first ran the scenario.

tgb417
Author

@Caroline 

Thanks for sharing. 

In my case, my experience is a little different than yours.  Although I may not have made this clear in my description, the table had already been created by a Dataiku DSS code recipe and updated by running the same recipe.

My problem happened only in subsequent runs of the recipe initiated by a scenario when I was trying to setup a scenario.

Now I also had some python code bugs in the recipe.  I don't believe that this was causal in my experience.  (Of course one can't know for sure.)

It would be interesting to know from the Dataiku folks under what cases Table Overwrites occur and when the append directive is followed.  And maybe a little bit more about the functioning of the Scenario editor.  Is there a video or some more information about this? Would this be a topic for a webinar provided by Dataiku staff or community members here?

--Tom

--Tom
0 Kudos
Clément_Stenac
Dataiker

The table will be overriden even if "Append" is enabled if the schema output by the Python recipe does not match the table.

We plan to add options so that it would fail instead of overwriting but it does not exist. What you can do however, is to use `write_dataframe` instead of `write_with_schema` in your Python code after the first time. That will cause write to fail in case of schema mismatch but the data will not be dropped.

Caroline
Level 2

Absolutely, thank you Clément, the write_from_dataframe is one of my favorite 'best practice tips' I received from Dataiku. The issue with drop actually occurred with a sync recipe, but it is great that Dataiku's roadmap already took it into account.

tgb417
Author

@Clément_Stenac 

I had this same situation hit me again a few weeks ago.

Has there been any progress on making this situation any better with schema changes causing append datasets to be cleared when for example column order might change.  Or a column gets added to data.  I have datasets that can take days to re-build and every time this happens this is sort of a nightmare.  Particularly if I'm on some kind of deadline.

--Tom
0 Kudos