'Edit in notebook' in python recipe on partitioned dataset

rene · ‎01-29-2024

For an application we have a high priority on transparency and understandability of the code. For that reason we use the 'Edit in notebook' function on our python recipes so that new team members and others can easily skim through the code as if a notebook and study the outputs of each step.

However, on a partitioned dataset this does not work because then a huge dataframe containing the entire dataset will be loaded, instead of just the latest partition.

Any workarounds?

Turribeach · ‎01-29-2024

Hi, you can use the Dataset.list_partitions() method to list all partitions. Then use Dataset.add_read_partitions() method to set the partition or partitions you want to read. Finally call the Dataset.get_dataframe() method to read the selected partitions. Note that you cannot manually add read partitions when running inside a Python recipe. They are automatically computed according to the partition dependencies defined on the recipe’s Input/Output tab. So this will only work for a Notebook. Therefore you may want to use this trick to have your notebook code only run when it's on a notebook.

https://developer.dataiku.com/latest/api-reference/python/datasets.html#dataiku.Dataset.list_partiti...

https://developer.dataiku.com/latest/api-reference/python/datasets.html#dataiku.Dataset.add_read_par...

https://developer.dataiku.com/latest/api-reference/python/datasets.html#dataiku.Dataset.get_datafram...

View solution in original post

Turribeach · ‎01-29-2024

Hi, you can use the Dataset.list_partitions() method to list all partitions. Then use Dataset.add_read_partitions() method to set the partition or partitions you want to read. Finally call the Dataset.get_dataframe() method to read the selected partitions. Note that you cannot manually add read partitions when running inside a Python recipe. They are automatically computed according to the partition dependencies defined on the recipe’s Input/Output tab. So this will only work for a Notebook. Therefore you may want to use this trick to have your notebook code only run when it's on a notebook.

https://developer.dataiku.com/latest/api-reference/python/datasets.html#dataiku.Dataset.list_partiti...

https://developer.dataiku.com/latest/api-reference/python/datasets.html#dataiku.Dataset.add_read_par...

https://developer.dataiku.com/latest/api-reference/python/datasets.html#dataiku.Dataset.get_datafram...

rene · ‎01-31-2024

Update: I initially accepted this as solution however the hasattr(dataiku, 'dku_flow_variables') does NOT work when you use the 'Edit in notebook' option on a python recipe. Even when running the notebook, it will still return True.

apfk · ‎02-22-2024

Unsure if you're still looking, but we use the native dataiku in_ipython to check if something is running in the notebook or not.

import dataiku
dataiku.in_ipython

I use it my code libraries, and it seems able to differentiate between when I'm running those library functions in my notebook vs running through in a scenario.

Hope this helps!

rene · ‎01-31-2024

Yes that's what I was looking for. I just directly use add_read_partitions('CURRENT_DAY') to just show the latest data in the notebook whenever it is ran.

Thanks!

Sign up to take part

'Edit in notebook' in python recipe on partitioned dataset

'Edit in notebook' in python recipe on partitioned dataset

Setup info