Using catboost as custom python model

tjh
Level 3
Using catboost as custom python model
Hi I would like to use catboost (https://tech.yandex.com/catboost/doc/dg/concepts/python-installation-docpage/). The minimum required configuration is to tell the constructor which are the categorical features.

How can I specify these correctly, when I have no access to the X matrix?

Also how can I prevent dataiku of transforming this categorical features?

Thanks for your help,



Thomas.
0 Kudos
6 Replies
tjh
Level 3
Author
If the above does not work, can use custom encoding for categorical features using a target encoder ?
0 Kudos
tjh
Level 3
Author
Here is my target encoder ...

Unfortunately it seems that fit is only called with X ... so this will not work.


import sklearn

class TargetEncoder(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin):
def __init__(self, min_samples_leaf=1, smoothing=1, noise_level=0):

self.dict_averages = {}
self.dict_priors = {}

self.min_samples_leaf = min_samples_leaf
self.smoothing = smoothing
self.noise_level = noise_level


def fit(self, X, y=None):
assert y is not None
target = y
self.y_col = y.name

trn_series = X
col = X.name

temp = pd.concat([trn_series, target], axis=1)
# Compute target mean
averages = temp.groupby(by=trn_series.name)[target.name].agg(["mean", "count"])
# Compute smoothing
smoothing = 1 / (1 + np.exp(-(averages["count"] - self.min_samples_leaf) / self.smoothing))
# Apply average function to all target data
prior = target.mean()
# The bigger the count the less full_avg is taken into account
averages[target.name] = prior * (1 - smoothing) + averages["mean"] * smoothing
averages.drop(["mean", "count"], axis=1, inplace=True)
self.dict_averages.update({col: averages})
self.dict_priors.update({col: prior})
return self

def transform(self, X):
trn_series = X
col = X.name
ft_trn_series = pd.merge(
trn_series.to_frame(trn_series.name),
self.dict_averages[col].reset_index().rename(columns={'index': self.y_col, self.y_col: 'average'}),
on=trn_series.name, how='left')['average'].rename(trn_series.name).fillna(self.dict_priors[col])
# pd.merge does not keep the index so restore it
ft_trn_series.index = trn_series.index
X = ft_trn_series
return X

processor = TargetEncoder()
0 Kudos
Alex_Combessie
Dataiker Alumni
Hi, It is not currently possible to change the way the visual ML interface of Dataiku processes categorical variables. This request has already been logged. I would advise to use the categorical variable handling of Dataiku and then catboost as a custom python model, without specific code for categorical variable handling. Otherwise, another option if you want something fully custom is to code your own processing and ML pipeline in a Python recipe/notebook. Hope it helps, Alexandre
0 Kudos
tjh
Level 3
Author
Hi ALex,

unfortunately catboost needs as input unprocessed categorical variables. A do nothing processor in the visual ML interface does not exist.

As mentioned I could use my do nothing with catboost in the visual ML interface. But somehow during prediction the output has 0 rows.

Can you support catboost in future versions natively in the visual ML interface?
0 Kudos
Alex_Combessie
Dataiker Alumni
The request for custom categorical variable handling has been logged. I will log a specific request for catboost support.
0 Kudos
OrsonWelles
Level 2

Dear Alex,

Did you manage to solve this issue since then ?

Thanks best

0 Kudos