Use Python script to predict test data with script steps

Options
deguodedongxi
deguodedongxi Registered Posts: 4
edited July 16 in Using Dataiku

Hey everyone,

I deployed a model from the lab to my project's flow and I want to predict my test data.

I do not want to use the Score or Evaluate recipe. Instead, I want to use a Python Recipe for my predictions. In my lab, I defined some Script Steps, which added some extra columns.

If I would be using the Score Recipe, those script steps would be included inside my model. Using the model inside my python recipe, the model pipeline seems not to include those Script Steps and I receive the Error:

ValueError: The feature "xyz" doesn't exist in the dataset

I am using the following code:

model_1 = dataiku.Model("5Galy2dl")
pred_1 = model_1.get_predictor()
dataset_xy= dataiku.Dataset("dataset_xy")
df= dataset_xy.get_dataframe()

pred_1.predict(df.head())

I also tried accessing the lab's trained models. The model details contain the steps definition, but no possible interface to apply to my incoming data:

labs = p.list_ml_tasks()["mlTasks"]
mltask = p.get_ml_task(labs[0]['analysisId'], labs[0]['mlTaskId'])
model_ids = mltask.get_trained_models_ids()
model_detail = mltask.get_trained_model_details(model_ids[0])

print(model_detail.details) # The details contain the steps information, but does not provide a mean to access those

Is there any way to apply my script steps without writing them by hand in python?

Thanks in advance and best regards,

Jakob

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    Options

    Hi @deguodedongxi
    ,

    Indeed script steps may not be included when using get_predictor in code directly. You would need to handle this yourself.
    You would need to deploy the script to the flow and apply to your input dataset. Also DSS casts all categorical features as str in a scoring recipe. So you need to convert any boolean and integesr used as categorical features to string in your input data before reading the dataframe from your Python recipe using get_predictor. To ensure the processing is the same as if it was run in the scoring recipe.

    Another way to do handle this would be first to deploy your model to an API endpoint and later just call the endpoint from Python Step in a prepared recipe to run the prediction. This will ensure all the correct casting is done and Script steps are run without the need to handle this yourself.



    Thanks,


  • deguodedongxi
    deguodedongxi Registered Posts: 4
    Options

    Thanks for you Answer Alex.

    Deploying a model from the lab directly is sometimes just a bit inconvenient. We need to do some post-processing. As we only have a limited number of endpoints in our license, adding extra python function endpoints would not be an option.

    Doing all the transformation tasks in the python endpoint "by hand" can get kind of messy, but I don't see any way around it for now.

    Maybe consider adding the preprocessing steps to the predictor class in a future release.

    Best regards,

    Jakob

Setup Info
    Tags
      Help me…