'Edit in notebook' in python recipe on partitioned dataset

rene · January 29

For an application we have a high priority on transparency and understandability of the code. For that reason we use the 'Edit in notebook' function on our python recipes so that new team members and others can easily skim through the code as if a notebook and study the outputs of each step.

However, on a partitioned dataset this does not work because then a huge dataframe containing the entire dataset will be loaded, instead of just the latest partition.

Any workarounds?

Turribeach · January 29

Hi, you can use the Dataset.list_partitions() method to list all partitions. Then use Dataset.add_read_partitions() method to set the partition or partitions you want to read. Finally call the Dataset.get_dataframe() method to read the selected partitions. Note that you cannot manually add read partitions when running inside a Python recipe. They are automatically computed according to the partition dependencies defined on the recipe’s Input/Output tab. So this will only work for a Notebook. Therefore you may want to use this trick to have your notebook code only run when it's on a notebook.

https://developer.dataiku.com/latest/api-reference/python/datasets.html#dataiku.Dataset.list_partitions

https://developer.dataiku.com/latest/api-reference/python/datasets.html#dataiku.Dataset.add_read_partitions

https://developer.dataiku.com/latest/api-reference/python/datasets.html#dataiku.Dataset.get_dataframe

rene · January 31

Yes that's what I was looking for. I just directly use add_read_partitions('CURRENT_DAY') to just show the latest data in the notebook whenever it is ran.

Thanks!

rene · January 31

Update: I initially accepted this as solution however the hasattr(dataiku, 'dku_flow_variables') does NOT work when you use the 'Edit in notebook' option on a python recipe. Even when running the notebook, it will still return True.

apfk · February 22

Unsure if you're still looking, but we use the native dataiku in_ipython to check if something is running in the notebook or not.

import dataiku
dataiku.in_ipython

I use it my code libraries, and it seems able to differentiate between when I'm running those library functions in my notebook vs running through in a scenario.

Hope this helps!

'Edit in notebook' in python recipe on partitioned dataset

Best Answer

Answers

Categories

Setup Info

Tags