Create scoring recipe using the API

sseveur · October 2020

Hi,

I have some troubles creating a scoring recipe and only keeping some features in the output dataset.
In order to have the wanted engine and Spark config, I've firstly manually created the recipe and used the recipe.get_settings().get_payload().

Then, I've created this function creating the scoring recipe with the wanted parameters that we want.

from dataikuapi.dss.recipe import PredictionScoringRecipeCreator
import dataiku
import json

def creation_recipe_scoring(project, recipe_name, dataset_to_score, model_id, output_dataset_name = "scoring_temp_name"
, output_connection="some_connection", payload=None, keep_columns=None) : 
""" 
Create a recipe scoring and its output dataset using a deployed model 
If the output dataset already exists, delete it and recreate it 

Params
:project object: A :class: dataikuapi.dss.project.DSSProject to create the recipe in the project 
:recipe_name str: the name of the recipe 
:dataset_to_score str: the name of the dataset to score 
:model_id str: the ID of the deployed model that will score the dataset_to_score 
:output_dataset_name str: the dataset's name of output 
:output_connection str: connection used to store the output_dataset 
:payload unicode (str): json as unicode of the payload 
If None, no changement in the default engine (in memory)
The payload has to be taken from a manually created recipe using : 
project.get_recipe('recipe').get_settings().get_payload()
:keep_columns list: list of the columns to keep in the scored dataset 

Returns : 
:recipe_handle object: handle to the recipe just created 
A :class: dataikuapi.dss.recipe.DSSRecipe 
"""
try : 
builder = PredictionScoringRecipeCreator(name =recipe_name, project=project)
builder.with_input_model(model_id)
builder.with_input(dataset_to_score)
builder.with_new_output(output_dataset_name, output_connection)
recipe_handle = builder.build()

except Exception as e : # if the dataset already exists 
print(e)
project.get_dataset(output_dataset_name).delete(drop_data=True)
print('Dataset dropped')

builder = PredictionScoringRecipeCreator(name=recipe_name, project=project)
builder.with_input_model(model_id)
builder.with_input(dataset_to_score)
builder.with_new_output(output_dataset_name, output_connection)
recipe_handle = builder.build() 

if payload is not None and keep_columns is not None : 
print('Modifying payload')
settings = recipe_handle.get_settings() # def_payload = recipe_handle.get_definition_and_payload()
payload = json.loads(payload)
unicode_columns = [unicode(col) for col in keep_columns]
payload['keptInputColumns'] = unicode_columns # only keep those columns (have to be in unicode in the payload)
settings.set_payload(json.dumps(payload)) # def_payload.set_payload(json.dumps(payload)) # add them 
settings.save() # recipe_handle.set_definition_and_payload(def_payload) # save the modifications done 

print('Payload of the recipe :\n{0}'.format(settings.get_payload()))

print('Recipe set')

return recipe_handle

But when the dataset is created, there are still all my input columns.

I've compared all the settings from a manually created and an API created recipe and everything is the same : recipe_settings, payload and status.
(all checked from the corresponding methods recipe.get_settings().recipe_settings, recipe.get_settings.get_payload(), recipe.get_settings().get_status().get_engines_details() / .get_selected_engine_details() )

Do you have any idea on how to correclty keep only the wanted columns ?

Greetings,
Steven

PS : I want to use the scoring recipe with Spark and the Java Scoring because it is way quicker than getting the predictor, the dataframe and applying it (15 minutes vs 20-30 minutes to get the dataframe in DSS RAM + 10 minutes of scoring).

PS2: The recipe name added to the recipe creator isn't the one displayed : it is always "compute_" + [dataset_name]

Edit 1 : adjusted display, added PS2

Sarina · January 2021

Hi there,

It looks like the reason why the keptInputColumns adjustment is not getting applied as expected is because the parameter filterInputColumns also must be set to True in the payload.

To apply this to your code, your if statement would probably look like this:

if payload is not None and keep_columns is not None : 
    print('Modifying payload')
    settings = recipe_handle.get_settings() 
    payload = json.loads(payload)
    unicode_columns = [unicode(col) for col in keep_columns]

    # newly added line: 
    payload['filterInputColumns'] = True

    payload['keptInputColumns'] = unicode_columns
    settings.set_payload(json.dumps(payload)) 
    settings.save()

    print('Payload of the recipe : \n{0}'.format(settings.get_payload()))

The additional setting will have the same effect as checking the “Input columns to include” checkbox in the scoring recipe UI.

You may also want to take a look at the compute_schema_updates() function to see if it should be applied as well.

Thanks,
Sarina

sseveur · February 2021

Thanks !

Those two points were the ones I missed, especially the compute_schema_updates()

The code I used (same as on the wiki) :

required_updates = scoring_recipe_handle.compute_schema_updates()
required_updates.apply()

Create scoring recipe using the API

Best Answer

Answers

Categories

Setup Info

Tags