Extract train/test/validation sets from visual ML

Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2022, Neuron 2023 Posts: 69 Neuron

Hi,

I am looking for ways to extract the exact train/test/validation sets used in visual ML. This not only implies the data splits, but also datasets that include all new features created as a result of the data processing in visual ML. For example, if dummy encoding is used for a text column, I would like the train/test/validation sets to include all the additional columns created before the datasets are passed to an ML model.

While exporting the model as a notebook exports some of the steps involved required to create these sets, it does not include all.

The use case is carrying out additional model stress testing/explainability analysis for which we are using custom code. For this analysis, in addition to the ML model, we also require the exact train/test/validation sets that were passed to the ML model.

Thanks,

Yash

Welcome!

It looks like you're new here. Sign in or register to get started.

Answers

  • Dataiker, Alpha Tester, Registered Posts: 138 Dataiker

    Hi,

    Exporting train/test preprocessed data is not currently possible in Dataiku but we are looking into it. We'll take note of this post and will keep you posted if future product improvements can help solving your issue.

    Best,

    Harizo

  • Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 28 Dataiker

    @yashpuranik
    : If you are unaware, the Train and Test Export Dataset in Visual ML is available from v11.2 onwards.

    It's under Model information --> Training Information

    https://doc.dataiku.com/dss/latest/release_notes/11.html#id25

  • Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2022, Neuron 2023 Posts: 69 Neuron

    @shashank
    , @HarizoR
    : Thank you! Yes, I have noticed the updates. This is directionally helpful, but not fully what I was hoping for. Let me explain.

    The feature helps identify which rows were in the training set and which were in the test set, which is great. But it does not help identify the exact data used for training/testing.

    If we look at the design in this project, customerID column was rejected. gender column was dummy encoded. We don't see the result of those transformations.

    Granted, these transformations are easy to achieve, and can be easily computed externally if needed. However, it is not sufficient for the full level of transparency needed for auditing, which is required for specific applications.

  • Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered, Frontrunner 2022 Participant Posts: 2 ✭✭✭

    I see the option to export models under auto ml prediction but not under auto ml clustering sessions. Is that part of the design? Is there a way to keep track of datasets used in building the clustering models?

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.