Publish result of statistics card to flow

sebastien_zinn · ‎12-01-2020

Hello,

Is it possible to publish the resullt of a statistics card to the flow?

I am performing a two-sample t-test in order to see if the difference between the means of two populations is significant. I would like to create a new table in my flow, composed of one single column and one single row, and gets populated with a 0 or 1 depending on the result of the statistics test.

Is that feasible ? Are there any alternative techniques?

Thank you,

Sébastien Zinn

tim-wright · ‎12-01-2020

@sebastien_zinn Can I ask what it is you are trying to do? I am unaware of a way to publish a statistics card to a flow dataset (in an easy fashion). In any event, here is a python recipe that will get you close to what you want.

For the python recipe I used the dataset on which the statistics are calculated (TRIPS_prepared) as the input and a created a new dataset as the output. This will make it cleaner to see the relationship between the resultant dataset on the flow, but the recipe never actually reads the input dataset.

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np

# Instantiate client and get this project
client=dataiku.api_client()
proj = client.get_default_project()

# ------------------------------------------------------------------------ #
# -------------   START HELPER FUNCTIONS --------------------------------- #

def get_worksheet(dataset_name, worksheet_name):
    """ 
    Helper function to return the Worksheet object for a particular dataset and worksheet
    """
    dataset = proj.get_dataset(dataset_name)
    worksheets = dataset.list_statistics_worksheets()
    # Get the worksheet you want (by name)
    my_worksheet = None
    for worksheet in worksheets:
        ws_settings = worksheet.get_settings()
        if ws_settings.get_raw()['name'] == worksheet_name:
            my_worksheet = worksheet
            break
    if not my_worksheet:
        raise ValueError('Worksheet by name of \'{}\' not found for dataset: {}'.format(worksheet_name, dataset_name))
    return my_worksheet


def get_2sampttest_df(worksheet):
    """
    Helper function to run the statstical analyses present in the worksheet and
    return a dataframe consisting of the pvalue and test statistic for all the
    two sample t-tests present in that worksheet.
    """
    # Run the worksheet and get the results
    statistic_card_result = worksheet.run_worksheet()
    results = statistic_card_result.get_raw()['results']
    data=[]
    
    # Iterate over the results getting the pvalue and statistic for each t-test
    for result in results:
        if result['type'] == 'ttest_2samp':
            data.append({'pvalue': result['pvalue'], 'statistic': result['statistic']})
    if len(data) == 0:
        raise TypeError('No 2Sample T Tests were configured in the worksheet')
    else:
        return pd.DataFrame(data)
# ------------------------------------------------------------------------ #
# --------------   END HELPER FUNCTIONS ---------------------------------- #


# Get the worksheet that has the 2Sample T-Test and build a DataFrame with pvalue and statistic values present
worksheet = get_worksheet('TRIPS_prepared', 'Worksheet')
t_test_output_df = get_2sampttest_df(worksheet)
    
# Write recipe outputs
t_test_output = dataiku.Dataset("t_test_output")
t_test_output.write_with_schema(t_test_output_df)

I tested it on a contrived example on my DSS instance and it appeared to work just fine. You will need to know your dataset name and the name of your worksheet containing the 2 sample t-test. If you have a worksheet that runs multiple 2-sample t-tests, you will want to modify the code to keep track of which results correspond to which test.

Depending on how you are planning to use this information it is probable you may want to move this logic into a custom metric/check on your dataset or possibly as a step in a scenario (or a check on the dataset and then reference that check in a scenario).

View solution in original post

tim-wright · ‎12-01-2020

@sebastien_zinn Can I ask what it is you are trying to do? I am unaware of a way to publish a statistics card to a flow dataset (in an easy fashion). In any event, here is a python recipe that will get you close to what you want.

For the python recipe I used the dataset on which the statistics are calculated (TRIPS_prepared) as the input and a created a new dataset as the output. This will make it cleaner to see the relationship between the resultant dataset on the flow, but the recipe never actually reads the input dataset.

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np

# Instantiate client and get this project
client=dataiku.api_client()
proj = client.get_default_project()

# ------------------------------------------------------------------------ #
# -------------   START HELPER FUNCTIONS --------------------------------- #

def get_worksheet(dataset_name, worksheet_name):
    """ 
    Helper function to return the Worksheet object for a particular dataset and worksheet
    """
    dataset = proj.get_dataset(dataset_name)
    worksheets = dataset.list_statistics_worksheets()
    # Get the worksheet you want (by name)
    my_worksheet = None
    for worksheet in worksheets:
        ws_settings = worksheet.get_settings()
        if ws_settings.get_raw()['name'] == worksheet_name:
            my_worksheet = worksheet
            break
    if not my_worksheet:
        raise ValueError('Worksheet by name of \'{}\' not found for dataset: {}'.format(worksheet_name, dataset_name))
    return my_worksheet


def get_2sampttest_df(worksheet):
    """
    Helper function to run the statstical analyses present in the worksheet and
    return a dataframe consisting of the pvalue and test statistic for all the
    two sample t-tests present in that worksheet.
    """
    # Run the worksheet and get the results
    statistic_card_result = worksheet.run_worksheet()
    results = statistic_card_result.get_raw()['results']
    data=[]
    
    # Iterate over the results getting the pvalue and statistic for each t-test
    for result in results:
        if result['type'] == 'ttest_2samp':
            data.append({'pvalue': result['pvalue'], 'statistic': result['statistic']})
    if len(data) == 0:
        raise TypeError('No 2Sample T Tests were configured in the worksheet')
    else:
        return pd.DataFrame(data)
# ------------------------------------------------------------------------ #
# --------------   END HELPER FUNCTIONS ---------------------------------- #


# Get the worksheet that has the 2Sample T-Test and build a DataFrame with pvalue and statistic values present
worksheet = get_worksheet('TRIPS_prepared', 'Worksheet')
t_test_output_df = get_2sampttest_df(worksheet)
    
# Write recipe outputs
t_test_output = dataiku.Dataset("t_test_output")
t_test_output.write_with_schema(t_test_output_df)

I tested it on a contrived example on my DSS instance and it appeared to work just fine. You will need to know your dataset name and the name of your worksheet containing the 2 sample t-test. If you have a worksheet that runs multiple 2-sample t-tests, you will want to modify the code to keep track of which results correspond to which test.

Depending on how you are planning to use this information it is probable you may want to move this logic into a custom metric/check on your dataset or possibly as a step in a scenario (or a check on the dataset and then reference that check in a scenario).

sebastien_zinn · ‎12-03-2020

@tim-wright thank you so much for your help, it worked like a charm!

I just had to edit the code slightly and replace

proj = client.get_default_project()

with the following line:

proj = client.get_project(dataiku.default_project_key())

The result of your code is a table containing the p-value and the t-statistic of my test, exactly what I needed 🙂

The reason why I needed it in a table is because I want to display in Tableau (our B.I tool) an alert message if the test is not statistically significant. I am now able to do that.

Thanks again,

Sébastien

Sign up to take part

Publish result of statistics card to flow

Publish result of statistics card to flow