Optimization for predict recipe and method

0 Kudos

Hi,

So I've recently been working with the method from dataiku package to run predictions on a dataframe, and it seems that we cannot feed the recipe with only the variables needed for running the prediction.

We have a 500+ variables table on which we train our model, and we do a variable selection to keep only the top 50 variables. When deploying the model and using it to run predictions however, it looks like Dataiku expects the predict table to contain all 500+ variables in order to run the predictions, even though it'll only need the 50. Now that is not that problematic (we produce those variables anyway). But when computing predictions it looks like the preprocessing is done on all variables before using only the 50 selected ones for actually producing the output.

So an idea to improve prediction performances for such task would be to allow the preprocessing to be run only on the selected variables from a model. That will also give users the ability to further optimize upstream datamanagement processes as you could use the list of selected variables to restrain it.

Let me know what you guys think, or if I'm missing something ๐Ÿ˜Š

Cheers,

Pierre.