Using Dataiku

Sort by:
61 - 70 of 119
  • Hi! I created a model using built-in Dataiku models. However, results are quite suspicious. So I would like to ask you some questions. In the attached screenshot you can see that the model I created i…
    Question
    Started by Povilas
    Most recent by Clément_Stenac
    0
    3
    Clément_Stenac
    Last answer by Clément_Stenac
    We confirm that all performance metrics shown in DSS are based on the test set - we currently never show performance on the train test - in the case of KFold, it's the mean of out-of-fold (so test set too)

    So you do see a downwards trend when going from "reasonable" to "very deep" random forest (from 0.95 to 0.892) which is indeed probably indicative of overfitting, although it's not as severe as you expected - possibly because: (a) Your train and test sets are very similar (b) The random picking of features adds sufficient diversity to counteract parts of the overfitting effect - it could also happen if you don't have that much data, which means that your trees are not "full"
    Clément_Stenac
    Last answer by Clément_Stenac
    We confirm that all performance metrics shown in DSS are based on the test set - we currently never show performance on the train test - in the case of KFold, it's the mean of out-of-fold (so test set too)

    So you do see a downwards trend when going from "reasonable" to "very deep" random forest (from 0.95 to 0.892) which is indeed probably indicative of overfitting, although it's not as severe as you expected - possibly because: (a) Your train and test sets are very similar (b) The random picking of features adds sufficient diversity to counteract parts of the overfitting effect - it could also happen if you don't have that much data, which means that your trees are not "full"
  • Hi, I'm wondering if the R2 scores are calculated on a different dataset on the grid search graph vs the R2 score on the detailed metrics. Reasons why I'm curious is because the differ quite a bit. Ki…
    Question
    Started by nv
    Most recent by Alex_Combessie
    1
    1
    Last answer by
    Alex_Combessie
    Last answer by Alex_Combessie
    Hi,

    Metrics reported in the graph are on the hyperparameter grid-search set configuration (a k-fold by default), while metrics in the "Detailed metrics" tab are based on the test set configuration. Indeed they are meant to be different by design.

    Cheers,

    Alex
  • Hello, I'm working with a client that needs probability calibration in Dataiku. You can learn about probability calibration from the sklearn documentation. Basically, I need to instantiate an object o…
    Question
    Started by UserBird
    Most recent by Alex_Combessie
    0
    4
    Last answer by
    Alex_Combessie
    Last answer by Alex_Combessie
    When passing a "clf" object in the custom Python models screen, we call the fit method on the entire object. So it will fit the full pipeline of CalibratedClassifier(GridSearchCV(clf_base))). Then the fitted pipeline is applied (by the predict method) to the test set.

    Note that the scikit-learn documentation advises *not* to use the same data for fitting and calibration: http://scikit-learn.org/stable/modules/generated/sklearn.calibration.CalibratedClassifierCV.html. So the cleanest way from a statistical point of view would be not to make a pipeline of CalibratedClassifierCV with a Classifier on the same data. Instead, you can train from the visual interface, then export to a Jupyter notebook, and use the notebook as a starting template to calibrate your classifier on new data.

    Another simpler answer in the cases when score calibration is important is to advise the users to use logistic regression.
  • I have a fairly small data set where I'm trying to use RF, LR, and XGB algos to predict a labeled classification column. Other features in CSV are mostly numberical, decimal, integer, and 2 strings. E…
    Question
    Started by UserBird
    0
  • Hi, when using nested cross validation for hyper parameter search/performance estimation, it's not clear how the best hyper parameters are chosen? As in nested cv, we can actually have different "best…
    Answered ✓
    Started by UserBird
    Most recent by UserBird
    0
    4
    Solution by
    Alex_Combessie
    Solution by Alex_Combessie

    Hello Rui,

    This is a good question and important topic indeed.

    When using K-fold both for hyperparameter search and for testing in DSS visual ML interface, what happens for one given model type follows these steps:

    1. Hyperparameter search: The dataset is split into K_hyperparam random parts with stratification with respect to the target. For each combination of hyperparameter in the grid, a model is trained K times to find the best combination. Finally, the model with the best combination is retrained on the entire dataset. This will be the model used for deployment purposes.

      1. Example for 3 folds:

        • for each combination of hyperparameter:

          • train the model on folds 1+2 then evaluate on fold 3,

          • train the model on folds 1+3 then evaluate on fold 2,

          • train the model on folds 2+3 then evaluate on fold 1

        • Choose the combination of hyperparameters that maximizes the average of the chosen performance metric on all 3 folds

    2. Test: The dataset is split again into K_test random parts, independently from the previous randomization, and with no stratification with respect to the target. The model with the best hyperparameter combination of Step 1 is trained and evaluated on the new test folds in a similar way as previously. The reported performance metrics are averaged across folds.

    Hence formulas for the number of model trainings are the following:

    • For a given model type:

    • For all model types selected:

    A few important precisions:

    • The hyperparameter search and test are done independently, but rely on the same random folds across model types.
    • By choice, this process is different from a "nested" strategy combining hyperparameter search and test sequentially (see https://sebastianraschka.com/faq/docs/evaluate-a-model.html - Scenario 3). We chose that by design to avoid the following drawbacks of this strategy
      • high computational cost: number of models to train ~ K_hyperparam x K_test instead of K_hyperparam + K_test

      • smaller folds, which require having a lot of labels, potentially failing for rare classes

    Cheers,

    Alex

  • Hi, just a clarification, on the gbt partial depende plots, what data/model are being used for the pdp? Ex; when using test kfold cross validation. Is it using a final model fitted with all the data? …
    Answered ✓
    Started by UserBird
    Most recent by Alex_Combessie
    0
    1
    Solution by
    Alex_Combessie
    Solution by Alex_Combessie
    Hi,

    We fit the model on the full train data and then compute the partial dependency plots.

    Cheers,

    Alex
  • Currently doesn't seem to be an way for that? trying to reuse an existing model for predicting another column. how can I do that on dataiku? Would be very useful. seem very basic need, probably someth…
    Question
    Started by UserBird
    Most recent by Thomas_K
    1
    3
    Last answer by
    Thomas_K
    Last answer by Thomas_K
    I second this. After manually creating a model, configuring lots of stuff before training, I now want to test the model performance on a lower-level target variable (think city-level instead of state level). In code, I would simply change the target variable string to "cities" instead of "states". In DataIku, I would have to create a new model and then go through all the configuration steps again (select explaining vars, change their interpretation, set their levels, set tree depths...).
  • I am currently leading a statistical analysis on absenteism data. In this study, I am studying the influence of multiple factors on employees' presence at work. But anytime i use the logistic regressi…
    Answered ✓
    Started by SimonDeschamps
    Most recent by Alex_Combessie
    1
    3
    Solution by
    Clément_Stenac
    Solution by Clément_Stenac
    Hi,

    DSS only shows p-values when there are less than 1000 coefficients (after preprocessing - so each categorical value becomes a coefficient). Even if you have less than 1000 coefficients, computing p-values is not always possible due to numerical issues.

    Beware that logistic regression in DSS is always regularized, and p-values are not strictly defined for regularized regressions
  • I did Clustering with K-MEANS model and I wish to understand how the variables importance percentages in the histogram are calculated? what does it measure? Thanks
    Answered ✓
    Started by tifo
    Most recent by Alex_Combessie
    1
    3
    Solution by
    Alex_Combessie
    Solution by Alex_Combessie
    We fit a simple random forest supervised model to the output classes of the kmeans. This allows us to derive variable importances, as per the random forest standard method (implemented in scikit-learn).
  • From the log: python(5787,0x70000fcc5000) malloc: *** error for object 0x7ff8fe7317e0: incorrect checksum for freed object - object was probably modified after being freed. [2018/01/24-14:07:02.303] […
    Question
    Started by UserBird
    Most recent by Clément_Stenac
    0
    1
    Last answer by
    Clément_Stenac
    Last answer by Clément_Stenac
    Hi,

    This looks like a memory corrumption bug in one of the underlying numerical computation libraries (numpy, pandas, blas,....). Is it reproducible ? Reproducible with other algorithms on this dataset ? Could you share details about your setup ? Are you at a liberty to share this dataset ?
61 - 70 of 1197