Application as Recipe Inputs are Broken (or insanely obtuse to use)

Jared
Jared Registered Posts: 2
edited August 21 in Using Dataiku

I have a project that I built out to be an Application-As-A-Recipe to upload a Dataiku dataset as a file to our API. I will refer to my Application-as-a-Recipe as my "child process" for brevity's sake. Another Project calls this recipe within its flow. The child process has a scenario to build out all datasets and effectively run my Python code.

Expected Behavior: When a parent project/process/flow calls an Application-As-A-Recipe and provides a dataset as an input, that App Recipe should intake that dataset and have full access to it.

Actual Behavior: This does not happen at all, and App Recipe inputs don't actually contain any data.

I'm going to outline everything I've done to limit the need for clarifying questions, so this will be a long post.

Application as a Recipe configuration

In my child process's flow, I have an empty dataset called "recipe_input", which is used as the input for a Python recipe. My Python recipe takes that dataset as an input within the configuration. My code attempts to read the dataset as such:

input_dataset = dataiku.Dataset("recipe_input")
df = input_dataset.get_dataframe()

When I populate the empty dataset "recipe_input" from within the Flow editor of the child process with dummy data, everything works perfectly fine and the code runs to completion when ran as its own scenario. The Python recipe has an output called "dummy-output", which is only there because Dataiku requires it. My Python code does not need to write out any datasets, and so this dataset is effectively unused.

As far as I can tell, my child process is configured correctly as an Application Recipe. All permissions are set to be as open as possible. The "Included Content" of my App Recipe has all Datasets Data configured for export. Under the Datasets button where I can define which Datasets are to be included, only the "dummy-output" is listed (and selected); I'm not sure why this is the case but for whatever reason, my input dataset "recipe_input" does not even show in the list. Within my Recipe Definition, my "recipe_input" dataset is configured to be my input using type "Dataset". No Output is configured. I also have a series of global variables that can get updated via a small module from within the Parent Process which calls it. Finally, I created a scenario to run this Child Process Flow, which is also configured to run when the App Recipe gets called from a Parent Process.

Integrating with the Parent Process as an Application Recipe…

When I integrate my child process from a parent project/Flow, I have very small test data called "process_output", which is used as the input for my Child Process (app as recipe). My expectation is that other Projects will be able to send a dataset to my app recipe child process to be uploaded to our API. The Global Variables are also configured for the child process from within the Parent Process. Under the Inputs tab for my App Recipe child process, I have selected my "process_output" dataset as the recipe's input. Under the Advanced tab, I have checked the "Keep Instance" box so I may review the logs of the child process when it runs from my parent process.

In theory, this should now work perfectly fine. The child process app recipe has its inputs configured on its end, and passes those inputs to my Python script. The child process is configured (as far as I can tell) correctly such that inputs can be passed into it. It works on its own, but not as an App Recipe.

My scenario is configured to build "recipe_input" and "dummy-output", which effectively runs my Python script (truly, I wish I didn't need a fake output to run a Python script but that's a complaint for another day.) Build Mode is configured as "build dependencies, then these items". Handling of Dependencies is configured as "Force-Build". Update Output Schemas is checked.

This is when it becomes very frustrating… This is what actually happens

When I run my flow, it hits the app recipe stage. I've reviewed the logs and can see Dataiku is correctly replacing the child process's dataset called "recipe_input" with the parent process's dataset called "process_output":

[INFO] [dku.flow.app] act.compute_app_output_NP - Performing input replacements
[INFO] [dku.flow.app] act.compute_app_output_NP - Found 1 objects to expose
[INFO] [dku.flow.app] act.compute_app_output_NP - Will add the following rules: {}
[INFO] [dku.flow.app] act.compute_app_output_NP - Adding rules to project CXAPIPARENT in to expose: [{"type":"DATASET","localName":"process_output","quickSharingEnabled":false,"rules":[{"targetProject":"RUN_compute_app_output_iP4v11US","appearOnFlow":true}]}]
[INFO] [dku.flow.app] act.compute_app_output_NP - Swapping recipes' inputs for the exposed ones
[INFO] [dku.flow.app] act.compute_app_output_NP - Replacing input recipe_input of recipe cx_api by CXAPIPARENT.process_output
[INFO] [dku.flow.app] act.compute_app_output_NP - Changed 1 inputs in recipe cx_api
[INFO] [dku.flow.app] act.compute_app_output_NP - Setting project variables
[INFO] [dku.flow.app] act.compute_app_output_NP - Replacements done, running scenario

This looks like it works, but it in fact does not. The scenario immediately fails. When reviewing the logs for the instance of the sub-process that is automatically created, I see this error:

Error in Python process: At line 495: <class 'Exception'>: Dataset recipe_input cannot be used : declare it as input or output of your recipe

It is truly vexing that I have configured "recipe_input" as my input, the empty dataset exists in the child-process flow, and the Python script has it configured as its recipe input. So why doesn't this work?!

My troubleshooting attempts

I added a Python Code step to my child-process scenario to attempt to read the inputs and log them to validate that the child process is indeed receiving the input. I saw in other questions on these forum boards, as well as in the documentation, that I can run this code to fetch the recipe's inputs:

from dataiku import recipe
inputA = recipe.get_input()
print("****RECIPE INPUT****")
print(inputA)

Yet this yields the following error:

[INFO] [process]  - Traceback (most recent call last):
[INFO] [process] - File "/dataikuData/design/scenarios/RUN_compute_app_output_iP4v11US/RUN_CX_FLOW/2024-08-21-19-32-33-052/custom-step-Step #2/script.py", line 2, in <module>
[INFO] [process] - inputA = recipe.get_input()
[INFO] [process] - File "/dataiku/dataiku-dss/python/dataiku/recipe/__init__.py", line 74, in get_input
[INFO] [process] - l = get_inputs(index, role, object_type, as_type)
[INFO] [process] - File "/dataiku/dataiku-dss/python/dataiku/recipe/__init__.py", line 71, in get_inputs
[INFO] [process] - return [_get_typed_recipe_input_output(x, as_type) for x in flow_spec["in"] if _recipe_input_output_matches(x, index, role, object_type)]
[INFO] [process] - TypeError: 'NoneType' object is not subscriptable

This exception message can only mean that the app recipe is in fact not receiving any inputs.

Whilst reading through other community posts, I found this which is fairly similar to my issue, however I DO specify the dataset as an input, and yet it still does not work: https://community.dataiku.com/discussion/comment/3234

The suggestion is to add in the argument "ignore_flow=True" when constructing my dataset class. This does in fact stop the errors from being raised and the process completes successfully. BUT, the output data from the parent process STILL does not pass into my child process. The data that gets uploaded to my API ends up being an empty file, rather than the data that my Parent Process passes into the child App Recipe.

What is the solution here?

I cannot figure out for the life of me how to get this to work. I've spent multiple days working on this with no success. Dataiku does not seem to process inputs and outputs like a programming language. Instead Inputs are these vague ideas which hold no value, which makes them useless in my opinion. Why is is SO HARD to pass a dataset from a parent flow to a child App as a Recipe? It really should not be this difficult, but I welcome any assistance if you've managed to read through all this!

*SPECIAL NOTE: Can someone who works at Dataiku please update your documentation to accurately reflect your Dataiku library? I exported the library from my account to avoid a lot of errors when developing locally within my IDE. The Dataiku library does not have a "recipe" module, nor does it have a method called "get_inputs", however your documentation states it does: https://doc.dataiku.com/dss/13/applications/application-as-recipe.html

from dataiku import recipe
inputA = recipe.get_input()

Answers

  • Jared
    Jared Registered Posts: 2

    In case anyone comes across this later, I met with the Dataiku team and my "hacky" approach was considered their expected method. Here is my code to read dataset inputs when using an App as a Recipe:

    import dataiku
    import dataiku.recipe import pandas as pd df = pd.DataFrame()
    for _input in dataiku.recipe.get_inputs():
    print(f"Working on input {_input.name}")
    if isinstance(_input, dataiku.Dataset):
    try:
    print(f'Input {_input.name} is of type Dataset. Reading as pd.DataFrame object.')
    df = _input.get_dataframe()
    print(df.head())
    except Exception as e:
    raise Exception(f'Failed to fetch the input dataset as a Pandas DataFrame: {e}')

Setup Info
    Tags
      Help me…