Advanced Designer Learning Path is now live! Read More

Using catboost as custom python model

Level 3
Using catboost as custom python model
Hi I would like to use catboost ( The minimum required configuration is to tell the constructor which are the categorical features.

How can I specify these correctly, when I have no access to the X matrix?

Also how can I prevent dataiku of transforming this categorical features?

Thanks for your help,

0 Kudos
6 Replies
Level 3
If the above does not work, can use custom encoding for categorical features using a target encoder ?
0 Kudos
Level 3
Here is my target encoder ...

Unfortunately it seems that fit is only called with X ... so this will not work.

import sklearn

class TargetEncoder(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin):
def __init__(self, min_samples_leaf=1, smoothing=1, noise_level=0):

self.dict_averages = {}
self.dict_priors = {}

self.min_samples_leaf = min_samples_leaf
self.smoothing = smoothing
self.noise_level = noise_level

def fit(self, X, y=None):
assert y is not None
target = y
self.y_col =

trn_series = X
col =

temp = pd.concat([trn_series, target], axis=1)
# Compute target mean
averages = temp.groupby([].agg(["mean", "count"])
# Compute smoothing
smoothing = 1 / (1 + np.exp(-(averages["count"] - self.min_samples_leaf) / self.smoothing))
# Apply average function to all target data
prior = target.mean()
# The bigger the count the less full_avg is taken into account
averages[] = prior * (1 - smoothing) + averages["mean"] * smoothing
averages.drop(["mean", "count"], axis=1, inplace=True)
self.dict_averages.update({col: averages})
self.dict_priors.update({col: prior})
return self

def transform(self, X):
trn_series = X
col =
ft_trn_series = pd.merge(
self.dict_averages[col].reset_index().rename(columns={'index': self.y_col, self.y_col: 'average'}),, how='left')['average'].rename([col])
# pd.merge does not keep the index so restore it
ft_trn_series.index = trn_series.index
X = ft_trn_series
return X

processor = TargetEncoder()
0 Kudos
Hi, It is not currently possible to change the way the visual ML interface of Dataiku processes categorical variables. This request has already been logged. I would advise to use the categorical variable handling of Dataiku and then catboost as a custom python model, without specific code for categorical variable handling. Otherwise, another option if you want something fully custom is to code your own processing and ML pipeline in a Python recipe/notebook. Hope it helps, Alexandre
0 Kudos
Level 3
Hi ALex,

unfortunately catboost needs as input unprocessed categorical variables. A do nothing processor in the visual ML interface does not exist.

As mentioned I could use my do nothing with catboost in the visual ML interface. But somehow during prediction the output has 0 rows.

Can you support catboost in future versions natively in the visual ML interface?
0 Kudos
The request for custom categorical variable handling has been logged. I will log a specific request for catboost support.
0 Kudos
Level 2

Dear Alex,

Did you manage to solve this issue since then ?

Thanks best

0 Kudos
Labels (3)
A banner prompting to get Dataiku DSS