Stratified split

Cristinel
Cristinel Registered Posts: 3

So I got all Dataiku certification in less that 3 week, it was a great way to get familiar with the product.

Now I want to go off script a bit.

I want to see how the AutoML is doing with Iris dataset. I wanted to split in about same proportion between classes in train and test. The Split recipe doesn't have a stratified option. It this because a have a free andTrainng edition or it is just not there?

BTW, the Dataiku version goes up to 13 below, it is already 14 now.

It looks like it's not maintained.

Answers

  • Sean
    Sean Dataiker, Alpha Tester, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Moderator Posts: 180 Dataiker

    Hi @Cristinel , I'm not 100% I've understood your question, but it doesn't have anything to do with the free version.

    There is a difference between the visual Split recipe and the split that happens within the design of an AutoML task. An AutoML task include its own division into train and test datasets. You can define these settings in the "Train / Test set" panel within the Design tab of an AutoML task. If you want to do your own split of the data in the Flow, you might look at the "Randomly dispatch data" option and then set the Dispatch mode to a "Random subset of column values".

    What material are you referring to that might not be maintained?

  • Cristinel
    Cristinel Registered Posts: 3

    When you click on Ask a question and fill in details, there is a drop-down for Dataiku version.
    That only goes up to 13.

    I want to split the data between Train and Test, and I want to keep the same proportion of classes in test set as is in the train set (as is in the all dataset). There is no visual option to do that, I need a python recipe.

    It looked like this has been asked before 6 years ago:
    https://community.dataiku.com/discussion/2151/split-dataset-by-stratified-sampling/p1

    I do not know if it has been implemented.

Setup Info
    Tags
      Help me…