Scoring partitioned dataset

Newuser01 · May 2022

Hi,

I have created a partitioned dataset with 10 partitions. Now I am training 3 regression model on this dataset.
What I have observed is that different partitions have different models(out of the 3) which gave better result.

For example: Partition 1 had best RMSE from model 2 while Partition 2 had best RMSE from model 1.

Is it possible that while scoring I can have the model with best result specific to that partition? Is it possible to automate it instead of manually selecting the model and then selecting that partition?

Any help would be appreciated.

Thanks

tgb417 · May 2022

@Newuser01

Welcome to the Dataiku community.

One way, might be, NOT to use the built in partitioning for building models. But to instead use a split recipe to split by the partitions you currently have as a partition. Then rather than create one model create three models. Once all are trained then picking the best model for each of the subsets that were part of your original portioned model.

The next challenge will be how you operationalize your model. When submitting your data for production scoring you will have to have a split routine for incoming data. Sending the new data to the right one of these separate models.

The next question I would have is how big a difference do these different model type making. If they are huge and important difference then it might be worth this extra work and compute. If the differences were very small, you will need to decide if the extra effort makes a difference in your context.

Others here may have other ideas. Please jump in with your perspectives.

Scoring partitioned dataset

Answers

Categories

Setup Info

Tags