Replicating Statistics Tab analyses on a different dataset

Doga
Doga Registered Posts: 18 ✭✭✭

Hi,

I am working on a correlation analysis. I prepared my data and started using the statistics tab in the dataset to generate correlation matrix and other statistical summaries. As I was looking at the scatter plot results, I realized that there were outliers in the data that I wanted to clip. So, I decided to run a Python Recipe to determine the 95th percentile for each column and clip the outliers.

Since this additional recipe will create another dataset, I will be losing all the statistical analyses performed on the previous dataset and will need to recreate all of them, even though the data schema is exactly the same across both datasets.

Is there a way to copy the analyses in the statistical tab from one dataset to another dataset that has the same schema? If not do you have any recommendations so that I don't need to redo all analyses all over again?

Best,

Doga

Best Answer

  • AlexandreD
    AlexandreD Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 3 Dataiker
    Answer ✓

    Hello, just to add on top of the previous response, there is a way to copy the statistical analyses to another dataset :

    1. open the statistics worksheet you want to copy
    2. click on the worksheet name in the top bar - just below the dataset name, it is "Worksheet" by default
    3. click on "Duplicate"
    4. in the modal that shows up, you can select the target project and dataset

    Hope this helps,

Answers

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,973 Neuron

    I don't believe there is a way to copy the Statistics tab of a dataset but you can easily swap around the datasets if you want. So suppose this is your flow:

    recipe1 => dataset1 => Python recipe => dataset2

    And you want to swap dataset1 and dataset2 so that you end with your dataset1 Statistics tab in the output of the Python recipe. You need to do the following:

    1. Go to the Python recipe Inputs/Outputs and delete the recipe input and output clicking on the rubbish icon. Save and go back to the flow.
    2. Go to the recipe before the Python recipe (in my sample recipe1) and in Inputs/Outputs tab change the output from dataset1 to dataset2 (which should now be stand alone in the flow). Use the "Use Existing" option. Save and go back to the flow.
    3. Now back to the Python recipe (which should now be stand alone in the flow) and add the dataset2 as input and dataset1 as output ("Use Existing" option for the output).

    All done! It's also worth noting that you can do a lot of these analysis filtering in the PRepare recipe in the Analyse column option. Finally in v12 you can export a univariate analysis card from the Statistics tab into its own recipe, for more automation and flexibility.

Setup Info
    Tags
      Help me…