Limit the size of a dataset with appending behavior

J-L · January 2024

Hi, I want to limit the number of rows in a dataiku dataset. It should only keep the latest 90 Rows and delete the oldest. The dataset is built by appending one row at a time to it.

I tried, instead of appending directly to the dataset, to create another dataset with just the newest row and a python recipe to implement the appending logic myself, but it didn't work out since the dataset couldn't read itself.

Is there any way of implementing such behavior?

Thanks for your help

Turribeach · January 2024

Yes that’s a circular issue of the flow design. Don’t run the whole flow. Build a scenario which builds the last dataset before the circular reference recursively and then add two scenario steps to only build the final datasets without recursion.

Turribeach · January 2024

I will discourage you from using the append option. Using the append option doesn't really work the way most people expect since Dataiku will still drop the table if you push schema changes to it deleting all historical data.

Dataiku doesn't allow circular references but you can get around that using a two dataset approach. The logic works like this. First you create a Python recipe that creates 90 rows in a dataset called top_90_output. Then you sync the 90 rows to another dataset (let's call it top_90_output_copy) using the Sync recipe. Then you go back to the Python recipe and add the sync'ed recipe as an input. Now you modify the Python recipe to take the top_90_output_copy, drop 1 row and append it from the other source dataset and write the output to top_90_output which then gets synced to top_90_output_copy.

Screenshot 2024-01-16 at 22.53.48.png

J-L · January 2024

This approach looks good, but I can't get it to work correctly. If I run the whole flow (or downstream from the python recipe), a StackOverflowException occurs (i think because it tries to infinitely run 'in circles'). How do I stop the flow after the python recipe ran once?

Turribeach · January 2024

If you want to avoid the circular reference you dump the second dataset to a Dataiku folder as CSV. Then use the folder as an input for the Python recipe and load the CSV files. Folders won’t cause flow build recursive issues but I think they look a bit less clean.

Limit the size of a dataset with appending behavior

Best Answer

Answers

Categories

Setup Info

Tags