## Sign up to take part

Registered users can ask their own questions, contribute to discussions, and be part of the Community!

This website uses cookies. By clicking OK, you consent to the use of cookies. Read our cookie policy.

Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results forย

Registered users can ask their own questions, contribute to discussions, and be part of the Community!

- Community
- ยป
- Discussions
- ยป
- Using Dataiku
- ยป

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Solved!

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Custom Code Metric

I'm attempting to make the best use of the "Custom Code" option for hyperparameter optimization and have a few questions. For reference, here are the comments on how to write the custom function:

```
# - y_pred is a numpy ndarray with shape:
# - (nb_records,) for regression problems and classification problems
# where 'needs probas' (see below) is false
# (for classification, the values are the numeric class indexes)
# - (nb_records, nb_classes) for classification problems where 'needs probas' is true
```

- No real issues when "needs_probas" is false. By appending y_pred as a column to X_valid, I'm able to see which rows were predicted as "True" for my binary classification problem (for a given threshold)

With "needs_probas" set to true, I run into some problems.

- It appears that the shape of y_pred is different depending on whether the model is training or scoring. Here's the code I've implemented that seems to solve the problem (again for a binary classification problem). I'm wondering if this should be necessary or if I'm missing something?

```
if len(np.shape(y_pred)) == 2:
# scoring the model
ds['probas'] = y_pred[:,1]
else:
# training the model
ds['probas'] = y_pred[0:]
```

- With "needs_probas" false, the scoring seems to be dependent on the threshold (a row's prediction will be "true" when the proba is above a threshold) and with "needs_probas" false, it appears that the threshold is not provided to the scoring function. Is this correct and expected or am I missing something? Maybe the "needs_threshold" property just isn't implemented in DSS for binary classifications? (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html)

1 Solution

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi,

you stumbled indeed on a not-so-nice behavior of the custom scoring handling, which passes the full output of predict_probas when doing the final scoring, and only the second column (the positive case) when doing hyperparameter search and k-fold. Your solution is essentially the best one can come up with.

You needn't worry about the threshold, as it is computed after scoring with your code: DSS will try different threshold values and call your scoring code with a single-column corresponding to the positive case.

Solutions shown first - Read whole discussion

6 Replies

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi,

you stumbled indeed on a not-so-nice behavior of the custom scoring handling, which passes the full output of predict_probas when doing the final scoring, and only the second column (the positive case) when doing hyperparameter search and k-fold. Your solution is essentially the best one can come up with.

You needn't worry about the threshold, as it is computed after scoring with your code: DSS will try different threshold values and call your scoring code with a single-column corresponding to the positive case.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi @rmoore ,

This has been fixed in release 8.0.2 :

More precisely, the custom metric function can now correctly assume a `y_pred`

shape of `(N, 2)`

in the case of binary classification with `needs_proba == True`

, when performing a hyperparameters search

Cheers

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hello, I am trying to create customer metrics to return precision score for first 100 predictions. Code is below:

```
import pandas as pd
from sklearn.metrics import precision_score
def score(y_valid, y_pred):
"""
Custom scoring function.
Must return a float quantifying the estimator prediction quality.
- y_valid is a pandas Series
- y_pred is a numpy ndarray with shape:
- (nb_records,) for regression problems and classification problems
where 'needs probas' (see below) is false
(for classification, the values are the numeric class indexes)
- (nb_records, nb_classes) for classification problems where
'needs probas' is true
- [optional] X_valid is a dataframe with shape (nb_records, nb_input_features)
- [optional] sample_weight is a numpy ndarray with shape (nb_records,)
NB: this option requires a variable set as "Sample weights"
"""
scoring = pd.DataFrame()
scoring['actual'] = y_valid
scoring['probability'] = y_pred[:, 1]
scoring = scoring.sort_values(by = 'probability', ascending = False)
top_100 = scoring.iloc[:100]
pr_score = precision_score(top_100['actual'], top_100['probability'])
return pr_score
```

I am getting following error:

File "<string>", line 23, in score IndexError: too many indices for array

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hello, where you pass needs_proba parameter?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi, from contacting Dataiku I have found that this feature will be available within Dataiku 11.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Didn't Find What You Needed?