Using catboost as custom python model
tjh
Registered Posts: 20 ✭✭✭✭
Hi I would like to use catboost (https://tech.yandex.com/catboost/doc/dg/concepts/python-installation-docpage/). The minimum required configuration is to tell the constructor which are the categorical features.
How can I specify these correctly, when I have no access to the X matrix?
Also how can I prevent dataiku of transforming this categorical features?
Thanks for your help,
Thomas.
How can I specify these correctly, when I have no access to the X matrix?
Also how can I prevent dataiku of transforming this categorical features?
Thanks for your help,
Thomas.
Tagged:
Answers
-
If the above does not work, can use custom encoding for categorical features using a target encoder ?
-
Here is my target encoder ...
Unfortunately it seems that fit is only called with X ... so this will not work.
import sklearn
class TargetEncoder(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin):
def __init__(self, min_samples_leaf=1, smoothing=1, noise_level=0):
self.dict_averages = {}
self.dict_priors = {}
self.min_samples_leaf = min_samples_leaf
self.smoothing = smoothing
self.noise_level = noise_level
def fit(self, X, y=None):
assert y is not None
target = y
self.y_col = y.name
trn_series = X
col = X.name
temp = pd.concat([trn_series, target], axis=1)
# Compute target mean
averages = temp.groupby(by=trn_series.name)[target.name].agg(["mean", "count"])
# Compute smoothing
smoothing = 1 / (1 + np.exp(-(averages["count"] - self.min_samples_leaf) / self.smoothing))
# Apply average function to all target data
prior = target.mean()
# The bigger the count the less full_avg is taken into account
averages[target.name] = prior * (1 - smoothing) + averages["mean"] * smoothing
averages.drop(["mean", "count"], axis=1, inplace=True)
self.dict_averages.update({col: averages})
self.dict_priors.update({col: prior})
return self
def transform(self, X):
trn_series = X
col = X.name
ft_trn_series = pd.merge(
trn_series.to_frame(trn_series.name),
self.dict_averages[col].reset_index().rename(columns={'index': self.y_col, self.y_col: 'average'}),
on=trn_series.name, how='left')['average'].rename(trn_series.name).fillna(self.dict_priors[col])
# pd.merge does not keep the index so restore it
ft_trn_series.index = trn_series.index
X = ft_trn_series
return X
processor = TargetEncoder() -
Hi, It is not currently possible to change the way the visual ML interface of Dataiku processes categorical variables. This request has already been logged. I would advise to use the categorical variable handling of Dataiku and then catboost as a custom python model, without specific code for categorical variable handling. Otherwise, another option if you want something fully custom is to code your own processing and ML pipeline in a Python recipe/notebook. Hope it helps, Alexandre
-
Hi ALex,
unfortunately catboost needs as input unprocessed categorical variables. A do nothing processor in the visual ML interface does not exist.
As mentioned I could use my do nothing with catboost in the visual ML interface. But somehow during prediction the output has 0 rows.
Can you support catboost in future versions natively in the visual ML interface? -
The request for custom categorical variable handling has been logged. I will log a specific request for catboost support.
-
Dear Alex,
Did you manage to solve this issue since then ?
Thanks best