Using a dataset column as a custom metric

f00stx
f00stx Registered Posts: 6 ✭✭✭✭
edited July 16 in Using Dataiku

Hi

I'm building a model to estimate betting exchange trades, and as part of the scoring metric, I want to specify the amount paid in the trade as the gain for the cost matrix, and the cost of the trade (if it loses, i.e falls below buying price) as the negative gain (-1 in this case, assuming $1 per trade).

For example, if I "bet" on an outcome for $1, and it pays $5, I want to use $5 as the "correct prediction" gain, as opposed to using a fixed value. Likewise, the next row in the dataset might pay $15, and so that should be the gain. A losing trade would have a gain of -1. The "amount paid" (in the case of a correct prediction) is available as a field in the dataset.

Is there a code sample that demonstrates how this can be done as a custom scoring function? From what I understand, this is the method I need to flesh out:

def score(y_valid, y_pred):
    """
    Custom scoring function.
    Must return a float quantifying the estimator prediction quality.
      - y_valid is a pandas Series
      - y_pred is a numpy ndarray with shape:
           - (nb_records,) for regression problems and classification problems
             where 'needs probas' (see below) is false
             (for classification, the values are the numeric class indexes)
           - (nb_records, nb_classes) for classification problems where
             'needs probas' is true
      - [optional] X_valid is a dataframe with shape (nb_records, nb_input_features)
      - [optional] sample_weight is a numpy ndarray with shape (nb_records,)
                   NB: this option requires a variable set as "Sample weights"
    """

Cheers!

Best Answer

  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭
    Answer ✓

    Hi,

    Unfortunately the use of X_valid in custom metric code is not supported for the Keras backend. I have logged your request in our backlog.

    What is your use case about? Could you use the regular ML backend (scikit-learn/xgboost) instead?

    Best regards,

    Alex

Answers

  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭

    Hi,

    Thanks for the description of this interesting use case. There is no built-in code sample to do this, but the idea is fairly simple to implement:

    1. Use y_valid and y_pred to detect winning/losing trades

    2. Use the optional parameter X_valid to retrieve the "amount paid" in case of a correct prediction

    3. Combine 1. and 2. to compute an array of gains per trade (I assume that one row = one trade)

    4. Aggregate it into sum/average of gains for all trades

    Hope it helps,

    Alex

  • f00stx
    f00stx Registered Posts: 6 ✭✭✭✭
    edited July 17

    Hi Alex,

    I know it's been a loooong time since this was raised, but I can't figure out how to get X_valid populated. If I just throw it into the cost function definition as a third parameter, it's empty, i.e. doing this:

    def score(y_valid, y_pred, X_valid):

    ... yields `None`.

    Is there something I'm missing with getting `X_valid` into the scoring function and available so I can pull a value from the data? I've tried to find examples of custom cost functions that make use of `X_valid` but I've come up blank.

    Cheers

  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭

    Hi,

    What code are you using after your function definition? If there is no return statement, it is expected that a function will yield "None".

    Best regards,

    Alex

  • f00stx
    f00stx Registered Posts: 6 ✭✭✭✭
    edited July 17

    Hi Alex,

    Sorry, I should have clarified. I'm essentially just attempting to dump `X_valid` to a CSV file to inspect its contents:

    def score(y_valid, y_pred, X_valid):
         X_valid.to_csv(r'/home/richard/xvalid.csv', sep=',', header=true)
         # ... etc

    This throws an error upon training (obviously I don't care that the training failed, I'm simply trying to determine whether X_valid is present and contains data):

    Screen Shot 2020-12-10 at 7.17.38 am.png

    This may seem daft but without comprehensive examples, or a way to debug cost functions in a notebook (plus Python not being the primary language that I work with, so my knowledge is a little sketchy) I'm flying blind.

    In a nutshell, I need to be able to use a value from the input dataset for each row that goes through this function. It seems `X_valid` is meant to be the way to go about doing that, but I'm at a loss as to how it can be accessed and used inside the `score` function.

  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭

    Hi,

    Thanks for clarifying. We will log the need for code samples of custom metric code.

    Can you confirm what version of DSS you are using? I tried reproducing your error, but on the latest DSS version 8.0.4 the line X_valid.to_csv() is executed successfully.

    Best regards,

    Alex

  • f00stx
    f00stx Registered Posts: 6 ✭✭✭✭

    I'm running 7.0.2.

    I'll upgrade to 8.0.4 and report back.

    Cheers!

  • f00stx
    f00stx Registered Posts: 6 ✭✭✭✭
    edited July 17

    No dice I'm afraid - I've upgraded to the latest available version (8.0.2, for Linux) and I still get the same issue. Not sure if it makes a difference, but this is for a Keras model using 2-class classification.

    from numpy import asarray
    from numpy import savetxt
    
    def score(y_valid, y_pred, X_valid):
        X_valid.to_csv(r'/home/richard/xvalid.csv', sep=',', header=true)
        """
        Custom scoring function.
        Must return a float quantifying the estimator prediction quality.
          - y_valid is a pandas Series
          - y_pred is a numpy ndarray with shape:
               - (nb_records,) for regression problems and classification problems
                 where 'needs probas' (see below) is false
                 (for classification, the values are the numeric class indexes)
               - (nb_records, nb_classes) for classification problems where
                 'needs probas' is true
          - [optional] X_valid is a dataframe with shape (nb_records, nb_input_features)
          - [optional] sample_weight is a numpy ndarray with shape (nb_records,)
                       NB: this option requires a variable set as "Sample weights"
        """
        return 0.0

    n.b. I'm returning 0.0 as I haven't determined whether I can actually do what I'm hoping to do...

    Cheers

  • f00stx
    f00stx Registered Posts: 6 ✭✭✭✭

    Hi Alex,

    I can confirm it works with XGBoost, ANN etc - thank you very much for your help (and patience).

    Being able to do this with Keras would be awesome, I'd love to see that in a future version of Dataiku.

    Cheers

  • powellmenezes
    powellmenezes Registered Posts: 2

    Hey, can we get X_pred here? I need the train predictions.

Setup Info
    Tags
      Help me…