DSS add rows to dataset when training models?

Frank
Frank Registered Posts: 11 ✭✭✭✭
I noticed that when I modified the model settings the number of rows(train + test) is larger than the original dataset.

Answers

  • UserBird
    UserBird Dataiker, Alpha Tester Posts: 535 Dataiker
    Hi,

    DSS can not add rows to dataset out of the blue. Can you share us the settings that you used and some screenshot of the problem ?

    Cheers,
  • Frank
    Frank Registered Posts: 11 ✭✭✭✭
    I have checked many times. After you train the model, on the "Result" page, it will show something like
    "Train set 3xxxx rows
    Test set 8xxx rows
    Train time about 6 seconds"
    but the origianl dataset is like 9000 rows something . I used default settings. 0.8-0.2 split.
    Could be a bug.

    I searched Q&A and it seemed one guy had the similar problem.
  • Frank
    Frank Registered Posts: 11 ✭✭✭✭
    Forget to add: I checked the python code for the model and rows numbers are correct for both train set and test set. Don't know what is going on.
  • UserBird
    UserBird Dataiker, Alpha Tester Posts: 535 Dataiker
    If you go to the "Status" tab of the dataset, make sure that you display the "Count of records" metric, and click "Compute", how many records does it see ?

    Additionally, coud you retrain your model. In the pre-train modal, make sure to check the "Drop existing sets, recompute new ones" checkbox
  • Frank
    Frank Registered Posts: 11 ✭✭✭✭
    The "Status" tab shows the right record count, which is around 9k. Every time I retrain the model I check the "Drop existing sets, recompute new ones" checkbox. The "result" page still shows more than 30k records in total.
    Back in 2017 someone already had the same problem but no fix was provided. See the link below:
    https://answers.dataiku.com/1155/train-set-for-a-model-has-more-rows-than-actual-dataset?show=1155#q1155
  • Frank
    Frank Registered Posts: 11 ✭✭✭✭
    Is it possible that DSS rebalances samples when deal with imbalanced dataset? So when you rebalance samples you simulate records and as a result it adds records to the dataset for training purpose?
  • Frank
    Frank Registered Posts: 11 ✭✭✭✭
    After I train the model using a different dataset I found the details:
    ALGORITHM DETAILS
    Algorithm Logistic regression
    Penalty None
    C
    TRAINING DATA
    Rows (before preprocessing) 36393 Rows (after preprocessing) 18683
    Columns (before preprocessing) 73 Columns (after preprocessing) 145
    Matrix type dense

    maybe this will help. The number of "after prepoccessing" records is right but I don't get what "Rows (before preprocessing) 36393" mean. How did DSS come up with this number?
  • UserBird
    UserBird Dataiker, Alpha Tester Posts: 535 Dataiker
    can you tell me the version of your DSS ?
    and do you have string column in your dataset ?
Setup Info
    Tags
      Help me…