We're excited to announce that we're launching the second installment of Dataiku Product Days Register Now

Sample weights scoring on test set

joostjansenn
Level 1
Sample weights scoring on test set

Dear DataIKU,

Currently I am using one of my columns in the trainingset to weight the samples. To my surprise I saw that these samples are also used to score the test set. As I have created the weights based on occurence in the training set, samples that are not present in the training set are not scored.

Could you elaborate if this is the expected behaviour in DataIKU and if you see an easy way for me to turn it off?

 

 

0 Kudos
4 Replies
AlexandreL
Dataiker
Dataiker

Hi, 

I tried to reproduce your issue but didn't succeed, could you provide a bit more context information to help me ? To start I would need, if possible:

  • your DSS version
  • the model algorithm on which the issue occurs
  • the settings you used in the "Weighting Strategy" section when you trained your model

Thanks in advance,

Alex

0 Kudos
joostjansenn
Level 1
Author

Hi Alexander,

 

- I have DataIKU version 8.0.2 installed

- I have split the training and test sets already before I open a model SVM-model Recipe. For the Weighing strategy I have selected 'class and sample weights'.

- For train and test I have selected 'Explicit extracts from two dataset', with 'no sampling method' for both.

It was already a surprise to me when I didn't select the sample weight column as a feature for my model, but I was not able to train the model, because I got an error message that the column was not present in the test set. After I trained the model and looked into the predicted data I saw the issue.

 

0 Kudos
AlexandreL
Dataiker
Dataiker

Thanks for the answer. Just for clarification, is your issue linked to the fact that you have the weights column in the dataset displayed in the "predicted data" tab of your model report ? In this case  this dataset is just the test set with the prediction columns added to it, so it's normal that you see your weights column in it.

Have you tried scoring new data using a Score recipe and a dataset that doesn't have the weight column in it ? Works fine on my 8.0.2 instance and the weights column is not needed

0 Kudos
joostjansenn
Level 1
Author

Hi Alexandre, You are right if I score the testset in the flow it works as expected.

However then I still think the results showing in the training recipe is a bug. The weights are used to generate the confusion matrix as well, and I suspect this also happens for calculating the metrics to evaluate the test set.

0 Kudos
A banner prompting to get Dataiku DSS