How to handle empty inputs to predict or scoring recipe in the flow ?
I have a flow with 2 scoring models. Sometimes based on the data, the input dataset to one of the scoring models might be empty. This is causing the flow to fail. Is there any way where we can get empty output or empty predictions from the predict recipe for the empty input. I am okay to have the empty predictions for the empty input as I can handle the empty outputs of one of my models in my python recipe. I want to understand how to handle this case ?
Answers
-
I believe you are running a scenario with multiple branching and some parts of the flow are not built. I had worked on something similar using python recipes that write data to a dataset. But in some cases, I would not have the requirement to build that particular part of the flow and as a result any recipe that consumes the un-built dataframe fails.
If that is the case, you can set the recipe to create an empty dataframe if no processing is required by a recipe and write it to the dataset. This is different from an unbuilt dataset and when your input (prediction) recipe tries to reads this blank dataset, you can use "try catch" to handle the error it throws.
Another alternative is to use some variable that flags whether a particular recipe has to be executed and use the 'Run this step' in steps under scenario to selectively run recipes.
-
Thanks for your response @SuhailS7
I did try to create custom scenario instead of the step-based scenario. If the input dataset to a predict recipe is empty then am creating a empty dataset and assigning it to Dataiku dataset which is the output of my predict recipe. This empty output dataset is considered as an input to other python recipe, I am facing one more issue. When I write scenario.build_dataset("datasetname"), it tries to build all downstream datasets as well which is giving me the error. In the custom scenario case how to control the datasets to be built with scenario.build_dataset() ?
-
You can define the build mode in the build_dataset method -
build_dataset(dataset_name, project_key=None, build_mode='RECURSIVE_BUILD', partitions=None, step_name=None, asynchronous=False, fail_fatal=True, **kwargs)
Executes the build of a dataset
Parameters
dataset_name – name of the dataset to build
project_key – optional, project key of the project in which the dataset is built
build_mode – one of “RECURSIVE_BUILD” (default), “NON_RECURSIVE_FORCED_BUILD”, “RECURSIVE_FORCED_BUILD”, “RECURSIVE_MISSING_ONLY_BUILD”
partitions – can be given as a partitions spec, variables expansion is supported
More info in the scenario docs - https://doc.dataiku.com/dss/latest/python-api/scenarios-inside.html