Variable Usage in Partition's Custom Python Function
Hey everyone,
I have a recipe that connects two partitioned datasets.
In order to map which partitions to use from the input dataset, I'm using a python dependency function. In order for it to be as dynamic and practical as possible, I would like to use a global variable previously defined called threshold_date, but I can't seem to use it in the function.
I've tried using '${threshold_date}' but it returns the literal string ${threshold_date}.
I've also tried using the code approach like so:
import dataiku date_val = dataiku.get_custom_variables()['threshold_date']
But I'm promptly greeted with the error No module named dataiku as you can see in the attached image.
So my question is - is there anyway I can use a variable in a python dependency function when mapping between partitioned datasets?
Thanks in advance.
Best regards,
Márcio Coelho
Operating system used: Windows
Best Answer
-
JordanB Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 296 Dataiker
Hi @MarcioCoelho
,Thank you for clarifying. Custom Python dependency can't use Dataiku APIs, so you can't read variables directly. Even if you read them from disk you will run into the same issues as when passing a variable directly, the list passed is not interpreted correctly.
We have logged this limitation and will look to resolve it in the future. In the meantime, you incorporate your variables and use set_write_partitions within your python recipe. Note, you will need to override the partiton setting in the recipe by adding the "ignore_flow=True" in every occurrence of "Dataset" in your python code. For example:input_dataset = dataiku.Dataset("Menu_item", ignore_flow=True) input_df = input_dataset.get_dataframe() output = dataiku.Dataset("menu_item_partition", ignore_flow=True) output.write_schema_from_dataframe(input_df) output.set_write_partition("${Menu Category}")
Answers
-
JordanB Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 296 Dataiker
Hi @MarcioCoelho
,You should not need to use a python dependency function to define a variable as a partition identifier. What you can do instead is use your project global variable and select "Explicit values". Then, you will want to add your variable as ${variable}.
You can also use a variable when partitioning the input dataset:
Note, importing dataiku will not work in a python depency function, which is why you are seeing the module not found error.
Please give this a try and let me know if you run into any issues.
Thanks!
Jordan
-
Hey @JordanB
,Thank you for your reply.
I might not have explained it properly, but my goal isn't to define a partition via a variable, but instead to use the variable in the middle calculations.
I noticed the import error, hence why I hoped there would be something similar to what you propose of using ${variable_value}.
Thanks.
-
Great, thanks for the help @JordanB
, and for taking notes for this to be implemented in the future.I really liked your snippet and will use it in the future!
-
Thank you for your clarification. Because custom Python dependencies cannot access Dataiku APIs, you cannot read variables directly. Even if you read them from disc, you will encounter the same problems as when passing a variable directly: the list passed is incorrectly interpreted.