How to debug custom Python models?

emher
emher Registered Posts: 32 ✭✭✭✭✭
edited July 16 in Using Dataiku

I am trying to create a custom Python model. As a example, consider the following simplified code,

from sklearn.base import BaseEstimator

class MyRegressor(BaseEstimator):
    def __init__(self):
        self.y = None

    def fit(self, X, y):
        self.y = y
        return self

    def predict(self, X):
        return self.y 

clf = MyRegressor()

The documentation indicates the following requirements,

* your code must create a 'clf' variable. This clf must be a scikit-learn compatible model, ie, it should:
1. have at least fit(X,y) and predict(X) methods
2. inherit sklearn.base.BaseEstimator
3. handle the attributes in the __init__ function
# See: https://doc.dataiku.com/dss/latest/machine-learning/custom-models.ht

As far as I can see, the code above satisfied the requirements. However, it fails. The documentation links dead. The log file contains absolutely no useful information,

[2021/04/13-06:13:08.650] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.prediction]  - ******************************************
[2021/04/13-06:13:08.650] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.prediction]  - ** Start train session s33
[2021/04/13-06:13:08.650] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.prediction]  - ******************************************
[2021/04/13-06:13:08.650] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.prediction.strat.prns] T-1QlduuEb - Preparing base & partitions splits
[2021/04/13-06:13:08.652] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.splits] T-1QlduuEb - [ct: 3] Search for split: p=type=SPLIT_SINGLE_DATASET,split=SORTED,splitBeforePrepare=true,ds=train_pfcid,sel=(method=full,parts=CantabriaPART1),r=0.8,c=time,ascending=true i=105930c5d72eb560e7560ae6d1615a76-1-part-CantabriaPART1
[2021/04/13-06:13:08.654] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.splits] T-1QlduuEb - [ct: 5] Checking if splits are up to date. Policy: type=SPLIT_SINGLE_DATASET,split=SORTED,splitBeforePrepare=true,ds=train_pfcid,sel=(method=full,parts=CantabriaPART1),r=0.8,c=time,ascending=true, instance id: 105930c5d72eb560e7560ae6d1615a76-1-part-CantabriaPART1
[2021/04/13-06:13:08.654] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.splits] T-1QlduuEb - [ct: 5] Search for split: p=type=SPLIT_SINGLE_DATASET,split=SORTED,splitBeforePrepare=true,ds=train_pfcid,sel=(method=full,parts=CantabriaPART1),r=0.8,c=time,ascending=true i=105930c5d72eb560e7560ae6d1615a76-1-part-CantabriaPART1
[2021/04/13-06:13:08.655] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.splits] T-1QlduuEb - [ct: 6] Search for split: p=type=SPLIT_SINGLE_DATASET,split=SORTED,splitBeforePrepare=true,ds=train_pfcid,sel=(method=full,parts=CantabriaPART1),r=0.8,c=time,ascending=true i=105930c5d72eb560e7560ae6d1615a76-1-part-CantabriaPART1
[2021/04/13-06:13:08.656] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.splits] T-1QlduuEb - [ct: 7] Checking if splits are up to date. Policy: type=SPLIT_SINGLE_DATASET,split=SORTED,splitBeforePrepare=true,ds=train_pfcid,sel=(method=full,parts=CantabriaPART1),r=0.8,c=time,ascending=true, instance id: 105930c5d72eb560e7560ae6d1615a76-1-part-CantabriaPART1
[2021/04/13-06:13:08.657] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.splits] T-1QlduuEb - [ct: 8] Search for split: p=type=SPLIT_SINGLE_DATASET,split=SORTED,splitBeforePrepare=true,ds=train_pfcid,sel=(method=full,parts=CantabriaPART1),r=0.8,c=time,ascending=true i=105930c5d72eb560e7560ae6d1615a76-1-part-CantabriaPART1
[2021/04/13-06:13:08.658] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.splits] T-1QlduuEb - [ct: 9] Search for split: p=type=SPLIT_SINGLE_DATASET,split=SORTED,splitBeforePrepare=true,ds=train_pfcid,sel=(method=full,parts=CantabriaPART1),r=0.8,c=time,ascending=true i=105930c5d72eb560e7560ae6d1615a76-1-part-CantabriaPART1
[2021/04/13-06:13:08.659] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.ml] T-1QlduuEb - Locking model train info file /data/dataiku/design/analysis-data/VMP_PFC_ADLS_TEST_TAC/QHvGYSC2/1QlduuEb/sessions/s33/pp1-base/m1/train_info.json
[2021/04/13-06:13:08.659] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.ml] T-1QlduuEb - Unlocking model train info file /data/dataiku/design/analysis-data/VMP_PFC_ADLS_TEST_TAC/QHvGYSC2/1QlduuEb/sessions/s33/pp1-base/m1/train_info.json
[2021/04/13-06:13:08.659] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.prediction.strat.prns] T-1QlduuEb - Launching the training threads
[2021/04/13-06:13:08.660] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.ml.python] T-1QlduuEb - Joining processing thread ...
[2021/04/13-06:13:08.660] [MRT-1402320] [INFO] [dku.analysis.prediction.strat]  - StratPredictionTrainAdditionalThread done
[2021/04/13-06:13:08.660] [MRT-1402320] [INFO] [dku.analysis.ml.python]  - TrainAdditionalThread done
[2021/04/13-06:13:10.557] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.ml.python] T-1QlduuEb - Processing thread joined ...
[2021/04/13-06:13:10.557] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.ml.python] T-1QlduuEb - Joining processing thread ...
[2021/04/13-06:13:10.557] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.ml.python] T-1QlduuEb - Processing thread joined ...
[2021/04/13-06:13:10.557] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.prediction.strat.prns] T-1QlduuEb - Train done
[2021/04/13-06:13:10.557] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.prediction] T-1QlduuEb - Train done
[2021/04/13-06:13:10.560] [FT-TrainWorkThread-zFuQRxFh-1402318] [INFO] [dku.analysis.trainingdetails] T-1QlduuEb - Publishing mltask-train-done reflected event

and the code runs just fine in a Jupyter Notebook in dataiku with the same code environment. Is there any was to debug what is going on? Where can I find the appropriate log information?

Best Answer

  • emher
    emher Registered Posts: 32 ✭✭✭✭✭
    Answer ✓

    Apparently you cannot debug a partitioned model in dataiku. Hence the only way to get debug info is to convert the model to a non-partitioned model.

    Upon doing that I found out that *args and **kwargs are not allowed in the constructor. Removing them, the model works as intended.

Answers

  • arnaudde
    arnaudde Dataiker Posts: 52 Dataiker

    Hello,
    Custom models should be defined in a dss library and then loaded in the Models > Design > Algorithms tab as suggested in the doc.

    Can you confirm that you defined the model in a library ?
    Could you also share a screenshot of the error when training your custom model ?

    Hope it helps,

    Arnaud

  • emher
    emher Registered Posts: 32 ✭✭✭✭✭

    Hi @arnaudde
    ,

    I tried moving the model into the project library also, but it didn't make any difference. This error looks like this,

    Error.png

  • emher
    emher Registered Posts: 32 ✭✭✭✭✭

    It seems that my problems might be related to the fact that I am using a partitioned model.

  • arnaudde
    arnaudde Dataiker Posts: 52 Dataiker
    edited July 17

    Hello,
    I think that the problem with your first model is that the predict method returns a pandas serie that has as many lines as the training dataset whereas it should output a pandas serie of size of the test set.

        def predict(self, X):
            return self.y 

    If you use the Random Regressor from the doc you should be fined for a regular model and a partitioned model. I encourage you to test with the sample from the doc.

    from sklearn.base import BaseEstimator
    import numpy as np
    import pandas as pd
    
    class MyRandomRegressor(BaseEstimator):
        """This model predicts random values between the mininimum and the maximum of y"""
    
        def fit(self, X, y):
            self.y_range = [np.min(y), np.max(y)]
    
        def predict(self, X):
            return pd.Series(np.random.uniform(self.y_range[0], self.y_range[1], size=X.shape[0]))
    


    We will try to improve the logging for partitioned models.

    Best,
    Arnaud

Setup Info
    Tags
      Help me…