Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Currently I am using one of my columns in the trainingset to weight the samples. To my surprise I saw that these samples are also used to score the test set. As I have created the weights based on occurence in the training set, samples that are not present in the training set are not scored.
Could you elaborate if this is the expected behaviour in DataIKU and if you see an easy way for me to turn it off?
I tried to reproduce your issue but didn't succeed, could you provide a bit more context information to help me ? To start I would need, if possible:
Thanks in advance,
- I have DataIKU version 8.0.2 installed
- I have split the training and test sets already before I open a model SVM-model Recipe. For the Weighing strategy I have selected 'class and sample weights'.
- For train and test I have selected 'Explicit extracts from two dataset', with 'no sampling method' for both.
It was already a surprise to me when I didn't select the sample weight column as a feature for my model, but I was not able to train the model, because I got an error message that the column was not present in the test set. After I trained the model and looked into the predicted data I saw the issue.
Thanks for the answer. Just for clarification, is your issue linked to the fact that you have the weights column in the dataset displayed in the "predicted data" tab of your model report ? In this case this dataset is just the test set with the prediction columns added to it, so it's normal that you see your weights column in it.
Have you tried scoring new data using a Score recipe and a dataset that doesn't have the weight column in it ? Works fine on my 8.0.2 instance and the weights column is not needed
Hi Alexandre, You are right if I score the testset in the flow it works as expected.
However then I still think the results showing in the training recipe is a bug. The weights are used to generate the confusion matrix as well, and I suspect this also happens for calculating the metrics to evaluate the test set.