Using Dataiku

Sort by:

61 - 70 of 119

Interpretation of model performance in Dataiku built-in models
Hi! I created a model using built-in Dataiku models. However, results are quite suspicious. So I would like to ask you some questions. In the attached screenshot you can see that the model I created i…
Question
Machine Learning
Advanced ML
Started by Povilas
Most recent by Clément_Stenac
Apr 23, 2018
0
3
Last answer by Clément_Stenac
We confirm that all performance metrics shown in DSS are based on the test set - we currently never show performance on the train test - in the case of KFold, it's the mean of out-of-fold (so test set too)

So you do see a downwards trend when going from "reasonable" to "very deep" random forest (from 0.95 to 0.892) which is indeed probably indicative of overfitting, although it's not as severe as you expected - possibly because: (a) Your train and test sets are very similar (b) The random picking of features adds sufficient diversity to counteract parts of the overfitting effect - it could also happen if you don't have that much data, which means that your trees are not "full"
Last answer by Clément_Stenac
We confirm that all performance metrics shown in DSS are based on the test set - we currently never show performance on the train test - in the case of KFold, it's the mean of out-of-fold (so test set too)

So you do see a downwards trend when going from "reasonable" to "very deep" random forest (from 0.95 to 0.892) which is indeed probably indicative of overfitting, although it's not as severe as you expected - possibly because: (a) Your train and test sets are very similar (b) The random picking of features adds sufficient diversity to counteract parts of the overfitting effect - it could also happen if you don't have that much data, which means that your trees are not "full"
Reply to Discussion
Reply to Discussion
Meaning R2 scores on grid search graph vs R2 score on 'detailed metrics'
Hi, I'm wondering if the R2 scores are calculated on a different dataset on the grid search graph vs the R2 score on the detailed metrics. Reasons why I'm curious is because the differ quite a bit. Ki…
Question
Machine Learning
Started by nv
Most recent by Alex_Combessie
Apr 19, 2018
1
1
Last answer by
Last answer by Alex_Combessie
Hi,

Metrics reported in the graph are on the hyperparameter grid-search set configuration (a k-fold by default), while metrics in the "Detailed metrics" tab are based on the test set configuration. Indeed they are meant to be different by design.

Cheers,

Alex
Reply to Discussion
Reply to Discussion
Probability calibration in Dataiku
Hello, I'm working with a client that needs probability calibration in Dataiku. You can learn about probability calibration from the sklearn documentation. Basically, I need to instantiate an object o…
Question
Machine Learning
Advanced ML
Started by UserBird
Most recent by Alex_Combessie
Apr 17, 2018
0
4
Last answer by
Last answer by Alex_Combessie
When passing a "clf" object in the custom Python models screen, we call the fit method on the entire object. So it will fit the full pipeline of CalibratedClassifier(GridSearchCV(clf_base))). Then the fitted pipeline is applied (by the predict method) to the test set.

Note that the scikit-learn documentation advises *not* to use the same data for fitting and calibration: http://scikit-learn.org/stable/modules/generated/sklearn.calibration.CalibratedClassifierCV.html. So the cleanest way from a statistical point of view would be not to make a pipeline of CalibratedClassifierCV with a Classifier on the same data. Instead, you can train from the visual interface, then export to a Jupyter notebook, and use the notebook as a starting template to calibrate your classifier on new data.

Another simpler answer in the cases when score calibration is important is to advise the users to use logistic regression.
Reply to Discussion
Reply to Discussion
Getting error on training SocketBlockLinkKernelException: Failed to train exceptions.IndexError
I have a fairly small data set where I'm trying to use RF, LR, and XGB algos to predict a labeled classification column. Other features in CSV are mostly numberical, decimal, integer, and 2 strings. E…
Question
Machine Learning
Troubleshooting
Started by UserBird
Apr 7, 2018
0
Reply to Discussion
Reply to Discussion
Nested cross validation and chosen best parameter from grid search by algorithm?
Hi, when using nested cross validation for hyper parameter search/performance estimation, it's not clear how the best hyper parameters are chosen? As in nested cv, we can actually have different "best…
Answered ✓
Machine Learning
Started by UserBird
Most recent by UserBird
Apr 5, 2018
0
4
Solution by
Solution by Alex_Combessie
Hello Rui,

This is a good question and important topic indeed.

When using K-fold both for hyperparameter search and for testing in DSS visual ML interface, what happens for one given model type follows these steps:

Hyperparameter search: The dataset is split into K_hyperparam random parts with stratification with respect to the target. For each combination of hyperparameter in the grid, a model is trained K times to find the best combination. Finally, the model with the best combination is retrained on the entire dataset. This will be the model used for deployment purposes.

Example for 3 folds:

for each combination of hyperparameter:

train the model on folds 1+2 then evaluate on fold 3,

train the model on folds 1+3 then evaluate on fold 2,

train the model on folds 2+3 then evaluate on fold 1

Choose the combination of hyperparameters that maximizes the average of the chosen performance metric on all 3 folds

Test: The dataset is split again into K_test random parts, independently from the previous randomization, and with no stratification with respect to the target. The model with the best hyperparameter combination of Step 1 is trained and evaluated on the new test folds in a similar way as previously. The reported performance metrics are averaged across folds.

Hence formulas for the number of model trainings are the following:

For a given model type:

For all model types selected:

A few important precisions:

The hyperparameter search and test are done independently, but rely on the same random folds across model types.

By choice, this process is different from a "nested" strategy combining hyperparameter search and test sequentially (see https://sebastianraschka.com/faq/docs/evaluate-a-model.html - Scenario 3). We chose that by design to avoid the following drawbacks of this strategy

high computational cost: number of models to train ~ K_hyperparam x K_test instead of K_hyperparam + K_test

smaller folds, which require having a lot of labels, potentially failing for rare classes

Cheers,

Alex
Reply to Discussion
Reply to Discussion
Partial depence plots calculation (gbt)? What data is it using?
Hi, just a clarification, on the gbt partial depende plots, what data/model are being used for the pdp? Ex; when using test kfold cross validation. Is it using a final model fitted with all the data? …
Answered ✓
Machine Learning
Advanced ML
Started by UserBird
Most recent by Alex_Combessie
Mar 22, 2018
0
1
Solution by
Solution by Alex_Combessie
Hi,

We fit the model on the full train data and then compute the partial dependency plots.

Cheers,

Alex
Reply to Discussion
Reply to Discussion
How to change target column for an existing model configuration?
Currently doesn't seem to be an way for that? trying to reuse an existing model for predicting another column. how can I do that on dataiku? Would be very useful. seem very basic need, probably someth…
Question
Machine Learning
Started by UserBird
Most recent by Thomas_K
Mar 19, 2018
1
3
Last answer by
Last answer by Thomas_K
I second this. After manually creating a model, configuring lots of stuff before training, I now want to test the model performance on a lower-level target variable (think city-level instead of state level). In code, I would simply change the target variable string to "cities" instead of "states". In DataIku, I would have to create a new model and then go through all the configuration steps again (select explaining vars, change their interpretation, set their levels, set tree depths...).
Reply to Discussion
Reply to Discussion
How can I add p-values estimation to my logistic regressions
I am currently leading a statistical analysis on absenteism data. In this study, I am studying the influence of multiple factors on employees' presence at work. But anytime i use the logistic regressi…
Answered ✓
Machine Learning
Started by SimonDeschamps
Most recent by Alex_Combessie
Feb 10, 2018
1
3
Solution by
Solution by Clément_Stenac
Hi,

DSS only shows p-values when there are less than 1000 coefficients (after preprocessing - so each categorical value becomes a coefficient). Even if you have less than 1000 coefficients, computing p-values is not always possible due to numerical issues.

Beware that logistic regression in DSS is always regularized, and p-values are not strictly defined for regularized regressions
Reply to Discussion
Reply to Discussion
Variables importance
I did Clustering with K-MEANS model and I wish to understand how the variables importance percentages in the histogram are calculated? what does it measure? Thanks
Answered ✓
Machine Learning
Started by tifo
Most recent by Alex_Combessie
Feb 5, 2018
1
3
Solution by
Solution by Alex_Combessie
We fit a simple random forest supervised model to the output classes of the kmeans. This allows us to derive variable importances, as per the random forest standard method (implemented in scikit-learn).
Reply to Discussion
Reply to Discussion
ML Process died (exit code 139)
From the log: python(5787,0x70000fcc5000) malloc: *** error for object 0x7ff8fe7317e0: incorrect checksum for freed object - object was probably modified after being freed. [2018/01/24-14:07:02.303] […
Question
Machine Learning
Started by UserBird
Most recent by Clément_Stenac
Jan 25, 2018
0
1
Last answer by
Last answer by Clément_Stenac
Hi,

This looks like a memory corrumption bug in one of the underlying numerical computation libraries (numpy, pandas, blas,....). Is it reproducible ? Reproducible with other algorithms on this dataset ? Could you share details about your setup ? Are you at a liberty to share this dataset ?
Reply to Discussion
Reply to Discussion