Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I am trying to create a custom Python model. As a example, consider the following simplified code,
from sklearn.base import BaseEstimator
class MyRegressor(BaseEstimator):
def __init__(self):
self.y = None
def fit(self, X, y):
self.y = y
return self
def predict(self, X):
return self.y
clf = MyRegressor()
The documentation indicates the following requirements,
* your code must create a 'clf' variable. This clf must be a scikit-learn compatible model, ie, it should:
1. have at least fit(X,y) and predict(X) methods
2. inherit sklearn.base.BaseEstimator
3. handle the attributes in the __init__ function
# See: https://doc.dataiku.com/dss/latest/machine-learning/custom-models.ht
As far as I can see, the code above satisfied the requirements. However, it fails. The documentation links dead. The log file contains absolutely no useful information,
[2021/04/13-06:13:08.650] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.prediction] - ******************************************
[2021/04/13-06:13:08.650] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.prediction] - ** Start train session s33
[2021/04/13-06:13:08.650] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.prediction] - ******************************************
[2021/04/13-06:13:08.650] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.prediction.strat.prns] T-1QlduuEb - Preparing base & partitions splits
[2021/04/13-06:13:08.652] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.splits] T-1QlduuEb - [ct: 3] Search for split: p=type=SPLIT_SINGLE_DATASET,split=SORTED,splitBeforePrepare=true,ds=train_pfcid,sel=(method=full,parts=CantabriaPART1),r=0.8,c=time,ascending=true i=105930c5d72eb560e7560ae6d1615a76-1-part-CantabriaPART1
[2021/04/13-06:13:08.654] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.splits] T-1QlduuEb - [ct: 5] Checking if splits are up to date. Policy: type=SPLIT_SINGLE_DATASET,split=SORTED,splitBeforePrepare=true,ds=train_pfcid,sel=(method=full,parts=CantabriaPART1),r=0.8,c=time,ascending=true, instance id: 105930c5d72eb560e7560ae6d1615a76-1-part-CantabriaPART1
[2021/04/13-06:13:08.654] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.splits] T-1QlduuEb - [ct: 5] Search for split: p=type=SPLIT_SINGLE_DATASET,split=SORTED,splitBeforePrepare=true,ds=train_pfcid,sel=(method=full,parts=CantabriaPART1),r=0.8,c=time,ascending=true i=105930c5d72eb560e7560ae6d1615a76-1-part-CantabriaPART1
[2021/04/13-06:13:08.655] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.splits] T-1QlduuEb - [ct: 6] Search for split: p=type=SPLIT_SINGLE_DATASET,split=SORTED,splitBeforePrepare=true,ds=train_pfcid,sel=(method=full,parts=CantabriaPART1),r=0.8,c=time,ascending=true i=105930c5d72eb560e7560ae6d1615a76-1-part-CantabriaPART1
[2021/04/13-06:13:08.656] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.splits] T-1QlduuEb - [ct: 7] Checking if splits are up to date. Policy: type=SPLIT_SINGLE_DATASET,split=SORTED,splitBeforePrepare=true,ds=train_pfcid,sel=(method=full,parts=CantabriaPART1),r=0.8,c=time,ascending=true, instance id: 105930c5d72eb560e7560ae6d1615a76-1-part-CantabriaPART1
[2021/04/13-06:13:08.657] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.splits] T-1QlduuEb - [ct: 8] Search for split: p=type=SPLIT_SINGLE_DATASET,split=SORTED,splitBeforePrepare=true,ds=train_pfcid,sel=(method=full,parts=CantabriaPART1),r=0.8,c=time,ascending=true i=105930c5d72eb560e7560ae6d1615a76-1-part-CantabriaPART1
[2021/04/13-06:13:08.658] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.splits] T-1QlduuEb - [ct: 9] Search for split: p=type=SPLIT_SINGLE_DATASET,split=SORTED,splitBeforePrepare=true,ds=train_pfcid,sel=(method=full,parts=CantabriaPART1),r=0.8,c=time,ascending=true i=105930c5d72eb560e7560ae6d1615a76-1-part-CantabriaPART1
[2021/04/13-06:13:08.659] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.ml] T-1QlduuEb - Locking model train info file /data/dataiku/design/analysis-data/VMP_PFC_ADLS_TEST_TAC/QHvGYSC2/1QlduuEb/sessions/s33/pp1-base/m1/train_info.json
[2021/04/13-06:13:08.659] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.ml] T-1QlduuEb - Unlocking model train info file /data/dataiku/design/analysis-data/VMP_PFC_ADLS_TEST_TAC/QHvGYSC2/1QlduuEb/sessions/s33/pp1-base/m1/train_info.json
[2021/04/13-06:13:08.659] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.prediction.strat.prns] T-1QlduuEb - Launching the training threads
[2021/04/13-06:13:08.660] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.ml.python] T-1QlduuEb - Joining processing thread ...
[2021/04/13-06:13:08.660] [MRT-1402320] [INFO] [dku.analysis.prediction.strat] - StratPredictionTrainAdditionalThread done
[2021/04/13-06:13:08.660] [MRT-1402320] [INFO] [dku.analysis.ml.python] - TrainAdditionalThread done
[2021/04/13-06:13:10.557] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.ml.python] T-1QlduuEb - Processing thread joined ...
[2021/04/13-06:13:10.557] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.ml.python] T-1QlduuEb - Joining processing thread ...
[2021/04/13-06:13:10.557] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.ml.python] T-1QlduuEb - Processing thread joined ...
[2021/04/13-06:13:10.557] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.prediction.strat.prns] T-1QlduuEb - Train done
[2021/04/13-06:13:10.557] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.prediction] T-1QlduuEb - Train done
[2021/04/13-06:13:10.560] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.trainingdetails] T-1QlduuEb - Publishing mltask-train-done reflected event
and the code runs just fine in a Jupyter Notebook in dataiku with the same code environment. Is there any was to debug what is going on? Where can I find the appropriate log information?
Apparently you cannot debug a partitioned model in dataiku. Hence the only way to get debug info is to convert the model to a non-partitioned model.
Upon doing that I found out that *args and **kwargs are not allowed in the constructor. Removing them, the model works as intended.
Hello,
Custom models should be defined in a dss library and then loaded in the Models > Design > Algorithms tab as suggested in the doc.
Can you confirm that you defined the model in a library ?
Could you also share a screenshot of the error when training your custom model ?
Hope it helps,
Arnaud
Hi @arnaudde ,
I tried moving the model into the project library also, but it didn't make any difference. This error looks like this,
 
It seems that my problems might be related to the fact that I am using a partitioned model.
Apparently you cannot debug a partitioned model in dataiku. Hence the only way to get debug info is to convert the model to a non-partitioned model.
Upon doing that I found out that *args and **kwargs are not allowed in the constructor. Removing them, the model works as intended.
Hello,
I think that the problem with your first model is that the predict method returns a pandas serie that has as many lines as the training dataset whereas it should output a pandas serie of size of the test set.
def predict(self, X):
return self.y
If you use the Random Regressor from the doc you should be fined for a regular model and a partitioned model. I encourage you to test with the sample from the doc.
from sklearn.base import BaseEstimator
import numpy as np
import pandas as pd
class MyRandomRegressor(BaseEstimator):
"""This model predicts random values between the mininimum and the maximum of y"""
def fit(self, X, y):
self.y_range = [np.min(y), np.max(y)]
def predict(self, X):
return pd.Series(np.random.uniform(self.y_range[0], self.y_range[1], size=X.shape[0]))
We will try to improve the logging for partitioned models.
Best,
Arnaud