Using catboost as custom python model

Highlighted
tjh
Level 3
Using catboost as custom python model
Hi I would like to use catboost (https://tech.yandex.com/catboost/doc/dg/concepts/python-installation-docpage/). The minimum required configuration is to tell the constructor which are the categorical features.

How can I specify these correctly, when I have no access to the X matrix?

Also how can I prevent dataiku of transforming this categorical features?

Thanks for your help,



Thomas.
0 Kudos
5 Replies
tjh
Level 3
Re: Using catboost as custom python model
If the above does not work, can use custom encoding for categorical features using a target encoder ?
0 Kudos
tjh
Level 3
Re: Using catboost as custom python model
Here is my target encoder ...

Unfortunately it seems that fit is only called with X ... so this will not work.


import sklearn

class TargetEncoder(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin):
def __init__(self, min_samples_leaf=1, smoothing=1, noise_level=0):

self.dict_averages = {}
self.dict_priors = {}

self.min_samples_leaf = min_samples_leaf
self.smoothing = smoothing
self.noise_level = noise_level


def fit(self, X, y=None):
assert y is not None
target = y
self.y_col = y.name

trn_series = X
col = X.name

temp = pd.concat([trn_series, target], axis=1)
# Compute target mean
averages = temp.groupby(by=trn_series.name)[target.name].agg(["mean", "count"])
# Compute smoothing
smoothing = 1 / (1 + np.exp(-(averages["count"] - self.min_samples_leaf) / self.smoothing))
# Apply average function to all target data
prior = target.mean()
# The bigger the count the less full_avg is taken into account
averages[target.name] = prior * (1 - smoothing) + averages["mean"] * smoothing
averages.drop(["mean", "count"], axis=1, inplace=True)
self.dict_averages.update({col: averages})
self.dict_priors.update({col: prior})
return self

def transform(self, X):
trn_series = X
col = X.name
ft_trn_series = pd.merge(
trn_series.to_frame(trn_series.name),
self.dict_averages[col].reset_index().rename(columns={'index': self.y_col, self.y_col: 'average'}),
on=trn_series.name, how='left')['average'].rename(trn_series.name).fillna(self.dict_priors[col])
# pd.merge does not keep the index so restore it
ft_trn_series.index = trn_series.index
X = ft_trn_series
return X

processor = TargetEncoder()
0 Kudos
Alex_Combessie Dataiker
Dataiker
Re: Using catboost as custom python model
Hi, It is not currently possible to change the way the visual ML interface of Dataiku processes categorical variables. This request has already been logged. I would advise to use the categorical variable handling of Dataiku and then catboost as a custom python model, without specific code for categorical variable handling. Otherwise, another option if you want something fully custom is to code your own processing and ML pipeline in a Python recipe/notebook. Hope it helps, Alexandre
0 Kudos
tjh
Level 3
Re: Using catboost as custom python model
Hi ALex,

unfortunately catboost needs as input unprocessed categorical variables. A do nothing processor in the visual ML interface does not exist.

As mentioned I could use my do nothing with catboost in the visual ML interface. But somehow during prediction the output has 0 rows.

Can you support catboost in future versions natively in the visual ML interface?
0 Kudos
Alex_Combessie Dataiker
Dataiker
Re: Using catboost as custom python model
The request for custom categorical variable handling has been logged. I will log a specific request for catboost support.
0 Kudos
Labels (3)