Probability calibration in Dataiku

Options
UserBird
UserBird Dataiker, Alpha Tester Posts: 535 Dataiker

Hello,

I'm working with a client that needs probability calibration in Dataiku. You can learn about probability calibration from the sklearn documentation. Basically, I need to instantiate an object of the class sklearn.calibration.CalibratedClassifierCV from the classifier trained in Dataiku.

My understanding is that the only way to do this is by creating a custom Python model. This is not the best option since the users are not Python developers and are only using Dataiku through the visual machine learning interface.

Did I miss another way to modify a classifier in Dataiku once it is trained?

Answers

  • Thomas
    Thomas Dataiker Alumni Posts: 19 ✭✭✭✭✭
    edited July 17
    Options

    Hello Simon,

    It's a true that as of current version of DSS (4.2), you need to use a Custom Python model in the machine learning interface to leverage sklearn probability calibration model.

    Even if the users are not Python developers, it is just a matter of using 4 lines of code if the Custom Python model editor, for instance:


    from sklearn.ensemble import AdaBoostClassifier
    from sklearn.calibration import CalibratedClassifierCV

    c = AdaBoostClassifier(n_estimators=20)
    clf = CalibratedClassifierCV(c, cv=2, method='isotonic')

    Full screen shot:

    Note that it can be used as a snippet of code to be reused and adapted easily by your client.

  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭
    Options
    Note that logistic regression is calibrated by default: http://scikit-learn.org/stable/modules/calibration.html. This is also what we apply in DSS based on scikit-learn.
  • UserBird
    UserBird Dataiker, Alpha Tester Posts: 535 Dataiker
    Options
    Hello,

    Thanks for the reply. I made it work, and I'll explain to my client how to use and tune it, but I'm having a hard time to understand how to use CalibratedClassifierCV with GridSearchCV.

    Can you confirm that in the code below, grid search with cross validation will be performed before model calibration?

    Furthermore, I find it weird to calibrate the model after the grid search. Shouldn't the calibration process be performed at each iteration of the grid search? Or maybe that the calibration process is monotonic, so it has only an impact on the threshold?



    import xgboost as xgb
    from sklearn.calibration import CalibratedClassifierCV
    from sklearn.grid_search import GridSearchCV

    # set search grid parameters here
    grid_parameters = {
    'n_estimators': [100, 300],
    'max_depth': [1,2,5],
    'learning_rate' : [.02, .1, .2],
    'colsample_bytree': [.5, .75, 1]
    }

    # instantiate base classifier here
    clf_base = xgb.XGBClassifier()

    # set cross validation parameters here
    clf_grid = GridSearchCV(
    clf_base,
    param_grid = grid_parameters,
    scoring='roc_auc',
    fit_params=None,
    n_jobs=2,
    iid=True,
    refit=True,
    cv=5,
    verbose=0,
    pre_dispatch='2*n_jobs'
    )

    # calibration
    clf = CalibratedClassifierCV(clf_grid, cv = 5)
  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭
    Options
    When passing a "clf" object in the custom Python models screen, we call the fit method on the entire object. So it will fit the full pipeline of CalibratedClassifier(GridSearchCV(clf_base))). Then the fitted pipeline is applied (by the predict method) to the test set.

    Note that the scikit-learn documentation advises *not* to use the same data for fitting and calibration: http://scikit-learn.org/stable/modules/generated/sklearn.calibration.CalibratedClassifierCV.html. So the cleanest way from a statistical point of view would be not to make a pipeline of CalibratedClassifierCV with a Classifier on the same data. Instead, you can train from the visual interface, then export to a Jupyter notebook, and use the notebook as a starting template to calibrate your classifier on new data.

    Another simpler answer in the cases when score calibration is important is to advise the users to use logistic regression.
Setup Info
    Tags
      Help me…