Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I am looking for ways to extract the exact train/test/validation sets used in visual ML. This not only implies the data splits, but also datasets that include all new features created as a result of the data processing in visual ML. For example, if dummy encoding is used for a text column, I would like the train/test/validation sets to include all the additional columns created before the datasets are passed to an ML model.
While exporting the model as a notebook exports some of the steps involved required to create these sets, it does not include all.
The use case is carrying out additional model stress testing/explainability analysis for which we are using custom code. For this analysis, in addition to the ML model, we also require the exact train/test/validation sets that were passed to the ML model.
Exporting train/test preprocessed data is not currently possible in Dataiku but we are looking into it. We'll take note of this post and will keep you posted if future product improvements can help solving your issue.
@yashpuranik : If you are unaware, the Train and Test Export Dataset in Visual ML is available from v11.2 onwards.
It's under Model information --> Training Information
The feature helps identify which rows were in the training set and which were in the test set, which is great. But it does not help identify the exact data used for training/testing.
If we look at the design in this project, customerID column was rejected. gender column was dummy encoded. We don't see the result of those transformations.
Granted, these transformations are easy to achieve, and can be easily computed externally if needed. However, it is not sufficient for the full level of transparency needed for auditing, which is required for specific applications.
I see the option to export models under auto ml prediction but not under auto ml clustering sessions. Is that part of the design? Is there a way to keep track of datasets used in building the clustering models?