'Edit in notebook' in python recipe on partitioned dataset
For an application we have a high priority on transparency and understandability of the code. For that reason we use the 'Edit in notebook' function on our python recipes so that new team members and others can easily skim through the code as if a notebook and study the outputs of each step.
However, on a partitioned dataset this does not work because then a huge dataframe containing the entire dataset will be loaded, instead of just the latest partition.
Any workarounds?
Best Answer
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,073 Neuron
Hi, you can use the Dataset.list_partitions() method to list all partitions. Then use Dataset.add_read_partitions() method to set the partition or partitions you want to read. Finally call the Dataset.get_dataframe() method to read the selected partitions. Note that you cannot manually add read partitions when running inside a Python recipe. They are automatically computed according to the partition dependencies defined on the recipe’s Input/Output tab. So this will only work for a Notebook. Therefore you may want to use this trick to have your notebook code only run when it's on a notebook.
Answers
-
Yes that's what I was looking for. I just directly use add_read_partitions('CURRENT_DAY') to just show the latest data in the notebook whenever it is ran.
Thanks!
-
Update: I initially accepted this as solution however the hasattr(dataiku, 'dku_flow_variables') does NOT work when you use the 'Edit in notebook' option on a python recipe. Even when running the notebook, it will still return True.
-
Unsure if you're still looking, but we use the native dataiku in_ipython to check if something is running in the notebook or not.
import dataiku
dataiku.in_ipythonI use it my code libraries, and it seems able to differentiate between when I'm running those library functions in my notebook vs running through in a scenario.
Hope this helps!