Using Dataiku
- Hi! I created a model using built-in Dataiku models. However, results are quite suspicious. So I would like to ask you some questions. In the attached screenshot you can see that the model I created i…Last answer by Clément_StenacWe confirm that all performance metrics shown in DSS are based on the test set - we currently never show performance on the train test - in the case of KFold, it's the mean of out-of-fold (so test set too)
So you do see a downwards trend when going from "reasonable" to "very deep" random forest (from 0.95 to 0.892) which is indeed probably indicative of overfitting, although it's not as severe as you expected - possibly because: (a) Your train and test sets are very similar (b) The random picking of features adds sufficient diversity to counteract parts of the overfitting effect - it could also happen if you don't have that much data, which means that your trees are not "full"Last answer by Clément_StenacWe confirm that all performance metrics shown in DSS are based on the test set - we currently never show performance on the train test - in the case of KFold, it's the mean of out-of-fold (so test set too)
So you do see a downwards trend when going from "reasonable" to "very deep" random forest (from 0.95 to 0.892) which is indeed probably indicative of overfitting, although it's not as severe as you expected - possibly because: (a) Your train and test sets are very similar (b) The random picking of features adds sufficient diversity to counteract parts of the overfitting effect - it could also happen if you don't have that much data, which means that your trees are not "full" - Hi, I'm wondering if the R2 scores are calculated on a different dataset on the grid search graph vs the R2 score on the detailed metrics. Reasons why I'm curious is because the differ quite a bit. Ki…Last answer by
- Hello, I'm working with a client that needs probability calibration in Dataiku. You can learn about probability calibration from the sklearn documentation. Basically, I need to instantiate an object o…Last answer byLast answer by Alex_CombessieWhen passing a "clf" object in the custom Python models screen, we call the fit method on the entire object. So it will fit the full pipeline of CalibratedClassifier(GridSearchCV(clf_base))). Then the fitted pipeline is applied (by the predict method) to the test set.
Note that the scikit-learn documentation advises *not* to use the same data for fitting and calibration: http://scikit-learn.org/stable/modules/generated/sklearn.calibration.CalibratedClassifierCV.html. So the cleanest way from a statistical point of view would be not to make a pipeline of CalibratedClassifierCV with a Classifier on the same data. Instead, you can train from the visual interface, then export to a Jupyter notebook, and use the notebook as a starting template to calibrate your classifier on new data.
Another simpler answer in the cases when score calibration is important is to advise the users to use logistic regression. - I have a fairly small data set where I'm trying to use RF, LR, and XGB algos to predict a labeled classification column. Other features in CSV are mostly numberical, decimal, integer, and 2 strings. E…
- Hi, when using nested cross validation for hyper parameter search/performance estimation, it's not clear how the best hyper parameters are chosen? As in nested cv, we can actually have different "best…Solution bySolution by Alex_Combessie
Hello Rui,
This is a good question and important topic indeed.
When using K-fold both for hyperparameter search and for testing in DSS visual ML interface, what happens for one given model type follows these steps:
-
Hyperparameter search: The dataset is split into K_hyperparam random parts with stratification with respect to the target. For each combination of hyperparameter in the grid, a model is trained K times to find the best combination. Finally, the model with the best combination is retrained on the entire dataset. This will be the model used for deployment purposes.
-
Example for 3 folds:
-
for each combination of hyperparameter:
-
train the model on folds 1+2 then evaluate on fold 3,
-
train the model on folds 1+3 then evaluate on fold 2,
-
train the model on folds 2+3 then evaluate on fold 1
-
-
Choose the combination of hyperparameters that maximizes the average of the chosen performance metric on all 3 folds
-
-
-
Test: The dataset is split again into K_test random parts, independently from the previous randomization, and with no stratification with respect to the target. The model with the best hyperparameter combination of Step 1 is trained and evaluated on the new test folds in a similar way as previously. The reported performance metrics are averaged across folds.
Hence formulas for the number of model trainings are the following:
- For a given model type:
- For all model types selected:
A few important precisions:
- The hyperparameter search and test are done independently, but rely on the same random folds across model types.
- By choice, this process is different from a "nested" strategy combining hyperparameter search and test sequentially (see https://sebastianraschka.com/faq/docs/evaluate-a-model.html - Scenario 3). We chose that by design to avoid the following drawbacks of this strategy
-
high computational cost: number of models to train ~ K_hyperparam x K_test instead of K_hyperparam + K_test
-
smaller folds, which require having a lot of labels, potentially failing for rare classes
-
Cheers,
Alex
-
- Hi, just a clarification, on the gbt partial depende plots, what data/model are being used for the pdp? Ex; when using test kfold cross validation. Is it using a final model fitted with all the data? …Solution by
- Currently doesn't seem to be an way for that? trying to reuse an existing model for predicting another column. how can I do that on dataiku? Would be very useful. seem very basic need, probably someth…Last answer byLast answer by Thomas_KI second this. After manually creating a model, configuring lots of stuff before training, I now want to test the model performance on a lower-level target variable (think city-level instead of state level). In code, I would simply change the target variable string to "cities" instead of "states". In DataIku, I would have to create a new model and then go through all the configuration steps again (select explaining vars, change their interpretation, set their levels, set tree depths...).
- I am currently leading a statistical analysis on absenteism data. In this study, I am studying the influence of multiple factors on employees' presence at work. But anytime i use the logistic regressi…Solution bySolution by Clément_StenacHi,
DSS only shows p-values when there are less than 1000 coefficients (after preprocessing - so each categorical value becomes a coefficient). Even if you have less than 1000 coefficients, computing p-values is not always possible due to numerical issues.
Beware that logistic regression in DSS is always regularized, and p-values are not strictly defined for regularized regressions - I did Clustering with K-MEANS model and I wish to understand how the variables importance percentages in the histogram are calculated? what does it measure? ThanksSolution by
- From the log: python(5787,0x70000fcc5000) malloc: *** error for object 0x7ff8fe7317e0: incorrect checksum for freed object - object was probably modified after being freed. [2018/01/24-14:07:02.303] […Last answer byLast answer by Clément_StenacHi,
This looks like a memory corrumption bug in one of the underlying numerical computation libraries (numpy, pandas, blas,....). Is it reproducible ? Reproducible with other algorithms on this dataset ? Could you share details about your setup ? Are you at a liberty to share this dataset ?
Top Tags
Trending Discussions
- Answered2
- Answered ✓7
Leaderboard
Member | Points |
Turribeach | 3702 |
tgb417 | 2515 |
Ignacio_Toledo | 1082 |