Using catboost as custom python model

tjh
tjh Registered Posts: 20 ✭✭✭✭
Hi I would like to use catboost (https://tech.yandex.com/catboost/doc/dg/concepts/python-installation-docpage/). The minimum required configuration is to tell the constructor which are the categorical features.

How can I specify these correctly, when I have no access to the X matrix?

Also how can I prevent dataiku of transforming this categorical features?

Thanks for your help,



Thomas.

Answers

  • tjh
    tjh Registered Posts: 20 ✭✭✭✭
    If the above does not work, can use custom encoding for categorical features using a target encoder ?
  • tjh
    tjh Registered Posts: 20 ✭✭✭✭
    Here is my target encoder ...

    Unfortunately it seems that fit is only called with X ... so this will not work.


    import sklearn

    class TargetEncoder(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin):
    def __init__(self, min_samples_leaf=1, smoothing=1, noise_level=0):

    self.dict_averages = {}
    self.dict_priors = {}

    self.min_samples_leaf = min_samples_leaf
    self.smoothing = smoothing
    self.noise_level = noise_level


    def fit(self, X, y=None):
    assert y is not None
    target = y
    self.y_col = y.name

    trn_series = X
    col = X.name

    temp = pd.concat([trn_series, target], axis=1)
    # Compute target mean
    averages = temp.groupby(by=trn_series.name)[target.name].agg(["mean", "count"])
    # Compute smoothing
    smoothing = 1 / (1 + np.exp(-(averages["count"] - self.min_samples_leaf) / self.smoothing))
    # Apply average function to all target data
    prior = target.mean()
    # The bigger the count the less full_avg is taken into account
    averages[target.name] = prior * (1 - smoothing) + averages["mean"] * smoothing
    averages.drop(["mean", "count"], axis=1, inplace=True)
    self.dict_averages.update({col: averages})
    self.dict_priors.update({col: prior})
    return self

    def transform(self, X):
    trn_series = X
    col = X.name
    ft_trn_series = pd.merge(
    trn_series.to_frame(trn_series.name),
    self.dict_averages[col].reset_index().rename(columns={'index': self.y_col, self.y_col: 'average'}),
    on=trn_series.name, how='left')['average'].rename(trn_series.name).fillna(self.dict_priors[col])
    # pd.merge does not keep the index so restore it
    ft_trn_series.index = trn_series.index
    X = ft_trn_series
    return X

    processor = TargetEncoder()
  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭
    Hi, It is not currently possible to change the way the visual ML interface of Dataiku processes categorical variables. This request has already been logged. I would advise to use the categorical variable handling of Dataiku and then catboost as a custom python model, without specific code for categorical variable handling. Otherwise, another option if you want something fully custom is to code your own processing and ML pipeline in a Python recipe/notebook. Hope it helps, Alexandre
  • tjh
    tjh Registered Posts: 20 ✭✭✭✭
    Hi ALex,

    unfortunately catboost needs as input unprocessed categorical variables. A do nothing processor in the visual ML interface does not exist.

    As mentioned I could use my do nothing with catboost in the visual ML interface. But somehow during prediction the output has 0 rows.

    Can you support catboost in future versions natively in the visual ML interface?
  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭
    The request for custom categorical variable handling has been logged. I will log a specific request for catboost support.
  • OrsonWelles
    OrsonWelles Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 3 ✭✭✭

    Dear Alex,

    Did you manage to solve this issue since then ?

    Thanks best

Setup Info
    Tags
      Help me…