Publish result of statistics card to flow

sebastien_zinn
sebastien_zinn Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 2 Partner

Hello,

Is it possible to publish the resullt of a statistics card to the flow?

I am performing a two-sample t-test in order to see if the difference between the means of two populations is significant. I would like to create a new table in my flow, composed of one single column and one single row, and gets populated with a 0 or 1 depending on the result of the statistics test.

Is that feasible ? Are there any alternative techniques?

Thank you,

Sébastien Zinn

Best Answer

  • tim-wright
    tim-wright Partner, L2 Designer, Snowflake Advanced, Neuron 2020, Registered, Neuron 2021, Neuron 2022 Posts: 77 Partner
    edited July 17 Answer ✓

    @sebastien_zinn
    Can I ask what it is you are trying to do? I am unaware of a way to publish a statistics card to a flow dataset (in an easy fashion). In any event, here is a python recipe that will get you close to what you want.

    For the python recipe I used the dataset on which the statistics are calculated (TRIPS_prepared) as the input and a created a new dataset as the output. This will make it cleaner to see the relationship between the resultant dataset on the flow, but the recipe never actually reads the input dataset.

    # -*- coding: utf-8 -*-
    import dataiku
    import pandas as pd, numpy as np
    
    # Instantiate client and get this project
    client=dataiku.api_client()
    proj = client.get_default_project()
    
    # ------------------------------------------------------------------------ #
    # -------------   START HELPER FUNCTIONS --------------------------------- #
    
    def get_worksheet(dataset_name, worksheet_name):
        """ 
        Helper function to return the Worksheet object for a particular dataset and worksheet
        """
        dataset = proj.get_dataset(dataset_name)
        worksheets = dataset.list_statistics_worksheets()
        # Get the worksheet you want (by name)
        my_worksheet = None
        for worksheet in worksheets:
            ws_settings = worksheet.get_settings()
            if ws_settings.get_raw()['name'] == worksheet_name:
                my_worksheet = worksheet
                break
        if not my_worksheet:
            raise ValueError('Worksheet by name of \'{}\' not found for dataset: {}'.format(worksheet_name, dataset_name))
        return my_worksheet
    
    
    def get_2sampttest_df(worksheet):
        """
        Helper function to run the statstical analyses present in the worksheet and
        return a dataframe consisting of the pvalue and test statistic for all the
        two sample t-tests present in that worksheet.
        """
        # Run the worksheet and get the results
        statistic_card_result = worksheet.run_worksheet()
        results = statistic_card_result.get_raw()['results']
        data=[]
        
        # Iterate over the results getting the pvalue and statistic for each t-test
        for result in results:
            if result['type'] == 'ttest_2samp':
                data.append({'pvalue': result['pvalue'], 'statistic': result['statistic']})
        if len(data) == 0:
            raise TypeError('No 2Sample T Tests were configured in the worksheet')
        else:
            return pd.DataFrame(data)
    # ------------------------------------------------------------------------ #
    # --------------   END HELPER FUNCTIONS ---------------------------------- #
    
    
    # Get the worksheet that has the 2Sample T-Test and build a DataFrame with pvalue and statistic values present
    worksheet = get_worksheet('TRIPS_prepared', 'Worksheet')
    t_test_output_df = get_2sampttest_df(worksheet)
        
    # Write recipe outputs
    t_test_output = dataiku.Dataset("t_test_output")
    t_test_output.write_with_schema(t_test_output_df)

    I tested it on a contrived example on my DSS instance and it appeared to work just fine. You will need to know your dataset name and the name of your worksheet containing the 2 sample t-test. If you have a worksheet that runs multiple 2-sample t-tests, you will want to modify the code to keep track of which results correspond to which test.

    Depending on how you are planning to use this information it is probable you may want to move this logic into a custom metric/check on your dataset or possibly as a step in a scenario (or a check on the dataset and then reference that check in a scenario).

Answers

  • sebastien_zinn
    sebastien_zinn Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 2 Partner
    edited July 17

    @tim-wright
    thank you so much for your help, it worked like a charm!

    I just had to edit the code slightly and replace

    proj = client.get_default_project()

    with the following line:

    proj = client.get_project(dataiku.default_project_key())

    The result of your code is a table containing the p-value and the t-statistic of my test, exactly what I needed

    The reason why I needed it in a table is because I want to display in Tableau (our B.I tool) an alert message if the test is not statistically significant. I am now able to do that.

    Thanks again,

    Sébastien

Setup Info
    Tags
      Help me…