Variable Usage in Partition's Custom Python Function

Options
MarcioCoelho
MarcioCoelho Dataiku DSS Core Designer, Registered Posts: 12 ✭✭✭✭

Hey everyone,

I have a recipe that connects two partitioned datasets.

In order to map which partitions to use from the input dataset, I'm using a python dependency function. In order for it to be as dynamic and practical as possible, I would like to use a global variable previously defined called threshold_date, but I can't seem to use it in the function.

I've tried using '${threshold_date}' but it returns the literal string ${threshold_date}.

I've also tried using the code approach like so:

import dataikudate_val = dataiku.get_custom_variables()['threshold_date']

But I'm promptly greeted with the error No module named dataiku as you can see in the attached image.

So my question is - is there anyway I can use a variable in a python dependency function when mapping between partitioned datasets?

Thanks in advance.

Best regards,

Márcio Coelho


Operating system used: Windows

Tagged:

Best Answer

  • JordanB
    JordanB Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 293 Dataiker
    Answer ✓
    Options

    Hi @MarcioCoelho
    ,

    Thank you for clarifying. Custom Python dependency can't use Dataiku APIs, so you can't read variables directly. Even if you read them from disk you will run into the same issues as when passing a variable directly, the list passed is not interpreted correctly.

    We have logged this limitation and will look to resolve it in the future. In the meantime, you incorporate your variables and use set_write_partitions within your python recipe. Note, you will need to override the partiton setting in the recipe by adding the "ignore_flow=True" in every occurrence of "Dataset" in your python code. For example:

    input_dataset = dataiku.Dataset("Menu_item", ignore_flow=True)input_df = input_dataset.get_dataframe()output = dataiku.Dataset("menu_item_partition", ignore_flow=True)output.write_schema_from_dataframe(input_df)output.set_write_partition("${Menu Category}")

Answers

  • JordanB
    JordanB Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 293 Dataiker
    Options

    Hi @MarcioCoelho
    ,

    You should not need to use a python dependency function to define a variable as a partition identifier. What you can do instead is use your project global variable and select "Explicit values". Then, you will want to add your variable as ${variable}.

    Screen Shot 2023-01-27 at 3.17.51 PM.png

    You can also use a variable when partitioning the input dataset:

    Screen Shot 2023-01-27 at 3.44.18 PM.png

    Note, importing dataiku will not work in a python depency function, which is why you are seeing the module not found error.

    Please give this a try and let me know if you run into any issues.

    Thanks!

    Jordan

  • MarcioCoelho
    MarcioCoelho Dataiku DSS Core Designer, Registered Posts: 12 ✭✭✭✭
    Options

    Hey @JordanB
    ,

    Thank you for your reply.

    I might not have explained it properly, but my goal isn't to define a partition via a variable, but instead to use the variable in the middle calculations.

    I noticed the import error, hence why I hoped there would be something similar to what you propose of using ${variable_value}.

    Thanks.

  • MarcioCoelho
    MarcioCoelho Dataiku DSS Core Designer, Registered Posts: 12 ✭✭✭✭
    Options

    Great, thanks for the help @JordanB
    , and for taking notes for this to be implemented in the future.

    I really liked your snippet and will use it in the future!

  • robinh12
    robinh12 Registered Posts: 1
    Options

    Thank you for your clarification. Because custom Python dependencies cannot access Dataiku APIs, you cannot read variables directly. Even if you read them from disc, you will encounter the same problems as when passing a variable directly: the list passed is incorrectly interpreted.

Setup Info
    Tags
      Help me…