Probability calibration in Dataiku
Hello,
I'm working with a client that needs probability calibration in Dataiku. You can learn about probability calibration from the sklearn documentation. Basically, I need to instantiate an object of the class sklearn.calibration.CalibratedClassifierCV from the classifier trained in Dataiku.
My understanding is that the only way to do this is by creating a custom Python model. This is not the best option since the users are not Python developers and are only using Dataiku through the visual machine learning interface.
Did I miss another way to modify a classifier in Dataiku once it is trained?
Answers
-
Hello Simon,
It's a true that as of current version of DSS (4.2), you need to use a Custom Python model in the machine learning interface to leverage sklearn probability calibration model.
Even if the users are not Python developers, it is just a matter of using 4 lines of code if the Custom Python model editor, for instance:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.calibration import CalibratedClassifierCV
c = AdaBoostClassifier(n_estimators=20)
clf = CalibratedClassifierCV(c, cv=2, method='isotonic')Full screen shot:
Note that it can be used as a snippet of code to be reused and adapted easily by your client.
-
Note that logistic regression is calibrated by default: http://scikit-learn.org/stable/modules/calibration.html. This is also what we apply in DSS based on scikit-learn.
-
Hello,
Thanks for the reply. I made it work, and I'll explain to my client how to use and tune it, but I'm having a hard time to understand how to use CalibratedClassifierCV with GridSearchCV.
Can you confirm that in the code below, grid search with cross validation will be performed before model calibration?
Furthermore, I find it weird to calibrate the model after the grid search. Shouldn't the calibration process be performed at each iteration of the grid search? Or maybe that the calibration process is monotonic, so it has only an impact on the threshold?
import xgboost as xgb
from sklearn.calibration import CalibratedClassifierCV
from sklearn.grid_search import GridSearchCV
# set search grid parameters here
grid_parameters = {
'n_estimators': [100, 300],
'max_depth': [1,2,5],
'learning_rate' : [.02, .1, .2],
'colsample_bytree': [.5, .75, 1]
}
# instantiate base classifier here
clf_base = xgb.XGBClassifier()
# set cross validation parameters here
clf_grid = GridSearchCV(
clf_base,
param_grid = grid_parameters,
scoring='roc_auc',
fit_params=None,
n_jobs=2,
iid=True,
refit=True,
cv=5,
verbose=0,
pre_dispatch='2*n_jobs'
)
# calibration
clf = CalibratedClassifierCV(clf_grid, cv = 5) -
When passing a "clf" object in the custom Python models screen, we call the fit method on the entire object. So it will fit the full pipeline of CalibratedClassifier(GridSearchCV(clf_base))). Then the fitted pipeline is applied (by the predict method) to the test set.
Note that the scikit-learn documentation advises *not* to use the same data for fitting and calibration: http://scikit-learn.org/stable/modules/generated/sklearn.calibration.CalibratedClassifierCV.html. So the cleanest way from a statistical point of view would be not to make a pipeline of CalibratedClassifierCV with a Classifier on the same data. Instead, you can train from the visual interface, then export to a Jupyter notebook, and use the notebook as a starting template to calibrate your classifier on new data.
Another simpler answer in the cases when score calibration is important is to advise the users to use logistic regression.