After I train the model using a different dataset I found the details: ALGORITHM DETAILS Algorithm Logistic regression Penalty None C TRAINING DATA Rows (before preprocessing) 36393 Rows (after preprocessing) 18683 Columns (before preprocessing) 73 Columns (after preprocessing) 145 Matrix type dense
maybe this will help. The number of "after prepoccessing" records is right but I don't get what "Rows (before preprocessing) 36393" mean. How did DSS come up with this number?
The "Status" tab shows the right record count, which is around 9k. Every time I retrain the model I check the "Drop existing sets, recompute new ones" checkbox. The "result" page still shows more than 30k records in total. Back in 2017 someone already had the same problem but no fix was provided. See the link below: https://answers.dataiku.com/1155/train-set-for-a-model-has-more-rows-than-actual-dataset?show=1155#q1155
I have checked many times. After you train the model, on the "Result" page, it will show something like "Train set 3xxxx rows Test set 8xxx rows Train time about 6 seconds" but the origianl dataset is like 9000 rows something . I used default settings. 0.8-0.2 split. Could be a bug.
I searched Q&A and it seemed one guy had the similar problem.