Help Shape the Future of Dataiku

# Custom Code Metric

Solved!
Level 3
###### Custom Code Metric

I'm attempting to make the best use of the "Custom Code" option for hyperparameter optimization and have a few questions. For reference, here are the comments on how to write the custom function:

``````# - y_pred is a numpy ndarray with shape:
# - (nb_records,) for regression problems and classification problems
# where 'needs probas' (see below) is false
# (for classification, the values are the numeric class indexes)
# - (nb_records, nb_classes) for classification problems where 'needs probas' is true``````
• No real issues when "needs_probas" is false. By appending y_pred as a column to X_valid, I'm able to see which rows were predicted as "True" for my binary classification problem (for a given threshold)

With "needs_probas" set to true, I run into some problems.

• It appears that the shape of y_pred is different depending on whether the model is training or scoring. Here's the code I've implemented that seems to solve the problem (again for a binary classification problem). I'm wondering if this should be necessary or if I'm missing something?
`````` if len(np.shape(y_pred)) == 2:
# scoring the model
ds['probas'] = y_pred[:,1]
else:
# training the model
ds['probas'] = y_pred[0:]``````
• With "needs_probas" false, the scoring seems to be dependent on the threshold (a row's prediction will be "true" when the proba is above a threshold) and with "needs_probas" false, it appears that the threshold is not provided to the scoring function. Is this correct and expected or am I missing something? Maybe the "needs_threshold" property just isn't implemented in DSS for binary classifications? (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html)

1 Solution
Dataiker

Hi,

you stumbled indeed on a not-so-nice behavior of the custom scoring handling, which passes the full output of predict_probas when doing the final scoring, and only the second column (the positive case) when doing hyperparameter search and k-fold. Your solution is essentially the best one can come up with.

You needn't worry about the threshold, as it is computed after scoring with your code: DSS will try different threshold values and call your scoring code with a single-column corresponding to the positive case.

2 Replies
Dataiker

Hi,

you stumbled indeed on a not-so-nice behavior of the custom scoring handling, which passes the full output of predict_probas when doing the final scoring, and only the second column (the positive case) when doing hyperparameter search and k-fold. Your solution is essentially the best one can come up with.

You needn't worry about the threshold, as it is computed after scoring with your code: DSS will try different threshold values and call your scoring code with a single-column corresponding to the positive case.

Dataiker

Hi @rmoore ,

This has been fixed in release 8.0.2 :

More precisely, the custom metric function can now correctly assume a `y_pred` shape of `(N, 2)` in the case of binary classification with `needs_proba == True`, when performing a hyperparameters search

Cheers