Manual Mappings on Multi-class classification
I'm looking for some help/advice on how to use the manual mappings on the multi-class classifier. Here's the setup:
I have a dataset where the target variable is given in percentages... so immediately you would guess this is a regression problem. However, the actual values are decimated in increments of 10%... so if I use regression, I'd have to post-process and round to the nearest 10% anyway. I decided to try a classifier since at this point I have only 10 classes to deal with, it should be an easy win. When I configure a multi-class classifier, it gives me a preview graph of the target classes, with their proportions. However, some of my classes are missing (example, I have 214 examples of 40%, but it does not appear in this preview at all). If I train the model with this configuration and being careful to select no sampling in training (to guarantee I have all data), when I examine the confusion matrix, the 40% class is completely missing, and the training logs contain no complaints at all about the missing classes.
In the setup screen, below the class proportions, there is a hyperlink that says "manually edit the mapping (advanced)" and if I use this link, I get a box where some JSON can be entered, and is pre-populated with the current classes and ratios (though the values of the ratios makes no sense). It would seem then that it is a simple task to add the new classes here and et voila, problem solved. However, When I try to add my classes here I have a few problems: a) I don't know what these ratio numbers actually represent, so I'm not sure what values to give. b) They aren't given as percentages - as shown in the graph above, and they don't represent the actual number of available samples. c) if I try a few values to see if I can "fake it 'till I make it," in every case the training fails.
Here are the reasons I think this is happening: a) my dataset is extremely imbalanced, with 0% being much bigger than all other classes. b) to fix the imbalance I am running some python to select all non-zero percent samples (up to 10k examples) and then randomly selecting an equal number of 0% samples. c) I write the zero percent items first, then the positive cases. This means the data in it's natural order, and when the model takes a guess sample, it sees the negative cases first. Critically, I don't think it is using the actual settings for the model sampling (the train/test set link on the left), because in that location I am selecting no sampling (because I have already subsampled the data). Therefore it would seem that the sampling used to determine the valid class labels is not accessible to me to change, and the configuration is therefore breaking.
To make matters worse, this is a partitioned model, and I have several partitions. The datasets for each are built sequentially meaning that if there are undiscovered classes in later partitions that are not represented in the first one or two, they are excluded completely.
I've searched around a bit and can find no documentation of this feature anywhere. Can anyone help? Is there a different way to configure this? My fear is, even if I am successful in "hacking" the configuration, it will become permenant, meaning when I retrain the model, the ratios and class identifiers will be permenantly set to whatever I configured in the manual mapping field.
Thanks,
-Jason
Operating system used: Windows
Answers
-
Hi Jason,
You are correct that editing this JSON is the way to achieve what you are looking for.
The reason why some of your classes are missing from this JSON is that Dataiku attempts to guess good parameters for your model to save you time, and for this guessing it uses a small sample of your training data. If your training data is very unbalanced, this guess sample might not contain all your classes, so they are not included in the mapping. But thankfully, we can manually edit the mapping, via this JSON, to correct exactly this situation.
The JSON is a list of objects, with the following keys:
- sourceValue: A value from your dataset.
- mappedValue: An incrementing number, that the source value will be transformed into.
- sampleFreq: The number of times this value was seen in the guess sample.
So, to correctly edit this JSON, you need to make sure that:
- For every class in your data, you have one JSON object with the appropriate sourceValue.
- Every JSON object has a different mappedValue, and these are incrementing numbers (0, 1, 2, 3, ...).
- The value of sampleFreq doesn't matter. You can set this to any positive number. It is not used during training. It is only used to render the chart that you see on this page.
When retraining your model, the same mapping file will indeed be used, but in your case this should be ok if you always have the same classes (increments of 10%). If, for example, you changed to increments of 5%, then you would need to re-write the mapping JSON. But the frequencies of the classes in your data can change, as the sampleFreq values are not used during training.
Here is an example of a valid JSON file. I'm not sure how your classes are represented in your dataset, so you will need to modify the sourceValues in this file to match your data, and add enough objects for all of your classes.
[ { "sourceValue": "0%", "mappedValue": 0, "sampleFreq": 660 }, { "sourceValue": "10%", "mappedValue": 1, "sampleFreq": 357 }, { "sourceValue": "20%", "mappedValue": 2, "sampleFreq": 217 }, { "sourceValue": "30%", "mappedValue": 3, "sampleFreq": 35 } ]
I hope that helps. If anything is not clear, please feel free to ask. And if you have any further issues then please post your JSON here.