Extract train/test/validation sets from visual ML

Options
yashpuranik
yashpuranik Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2022, Neuron 2023 Posts: 69 Neuron

Hi,

I am looking for ways to extract the exact train/test/validation sets used in visual ML. This not only implies the data splits, but also datasets that include all new features created as a result of the data processing in visual ML. For example, if dummy encoding is used for a text column, I would like the train/test/validation sets to include all the additional columns created before the datasets are passed to an ML model.

While exporting the model as a notebook exports some of the steps involved required to create these sets, it does not include all.

The use case is carrying out additional model stress testing/explainability analysis for which we are using custom code. For this analysis, in addition to the ML model, we also require the exact train/test/validation sets that were passed to the ML model.

Thanks,

Yash

Tagged:

Answers

  • HarizoR
    HarizoR Dataiker, Alpha Tester, Registered Posts: 138 Dataiker
    Options

    Hi,

    Exporting train/test preprocessed data is not currently possible in Dataiku but we are looking into it. We'll take note of this post and will keep you posted if future product improvements can help solving your issue.

    Best,

    Harizo

  • shashank
    shashank Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 27 Dataiker
    Options

    @yashpuranik
    : If you are unaware, the Train and Test Export Dataset in Visual ML is available from v11.2 onwards.

    It's under Model information --> Training Information

    https://doc.dataiku.com/dss/latest/release_notes/11.html#id25

  • yashpuranik
    yashpuranik Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2022, Neuron 2023 Posts: 69 Neuron
    Options

    @shashank
    , @HarizoR
    : Thank you! Yes, I have noticed the updates. This is directionally helpful, but not fully what I was hoping for. Let me explain.

    The feature helps identify which rows were in the training set and which were in the test set, which is great. But it does not help identify the exact data used for training/testing.

    If we look at the design in this project, customerID column was rejected. gender column was dummy encoded. We don't see the result of those transformations.

    Granted, these transformations are easy to achieve, and can be easily computed externally if needed. However, it is not sufficient for the full level of transparency needed for auditing, which is required for specific applications.

  • Samruda
    Samruda Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered, Frontrunner 2022 Participant Posts: 2 ✭✭✭
    Options

    I see the option to export models under auto ml prediction but not under auto ml clustering sessions. Is that part of the design? Is there a way to keep track of datasets used in building the clustering models?

Setup Info
    Tags
      Help me…