Sample weights scoring on test set

joostjansenn · December 2020

Dear DataIKU,

Currently I am using one of my columns in the trainingset to weight the samples. To my surprise I saw that these samples are also used to score the test set. As I have created the weights based on occurence in the training set, samples that are not present in the training set are not scored.

Could you elaborate if this is the expected behaviour in DataIKU and if you see an easy way for me to turn it off?

AlexandreL · December 2020

Hi,

I tried to reproduce your issue but didn't succeed, could you provide a bit more context information to help me ? To start I would need, if possible:

your DSS version
the model algorithm on which the issue occurs
the settings you used in the "Weighting Strategy" section when you trained your model

Thanks in advance,

Alex

joostjansenn · December 2020

Hi Alexander,

- I have DataIKU version 8.0.2 installed

- I have split the training and test sets already before I open a model SVM-model Recipe. For the Weighing strategy I have selected 'class and sample weights'.

- For train and test I have selected 'Explicit extracts from two dataset', with 'no sampling method' for both.

It was already a surprise to me when I didn't select the sample weight column as a feature for my model, but I was not able to train the model, because I got an error message that the column was not present in the test set. After I trained the model and looked into the predicted data I saw the issue.

AlexandreL · December 2020

Thanks for the answer. Just for clarification, is your issue linked to the fact that you have the weights column in the dataset displayed in the "predicted data" tab of your model report ? In this case this dataset is just the test set with the prediction columns added to it, so it's normal that you see your weights column in it.

Have you tried scoring new data using a Score recipe and a dataset that doesn't have the weight column in it ? Works fine on my 8.0.2 instance and the weights column is not needed

joostjansenn · December 2020

Hi Alexandre, You are right if I score the testset in the flow it works as expected.

However then I still think the results showing in the training recipe is a bug. The weights are used to generate the confusion matrix as well, and I suspect this also happens for calculating the metrics to evaluate the test set.

Sample weights scoring on test set

Answers

Categories

Setup Info

Tags