Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hello,
Is it possible to publish the resullt of a statistics card to the flow?
I am performing a two-sample t-test in order to see if the difference between the means of two populations is significant. I would like to create a new table in my flow, composed of one single column and one single row, and gets populated with a 0 or 1 depending on the result of the statistics test.
Is that feasible ? Are there any alternative techniques?
Thank you,
SƩbastien Zinn
@sebastien_zinn Can I ask what it is you are trying to do? I am unaware of a way to publish a statistics card to a flow dataset (in an easy fashion). In any event, here is a python recipe that will get you close to what you want.
For the python recipe I used the dataset on which the statistics are calculated (TRIPS_prepared) as the input and a created a new dataset as the output. This will make it cleaner to see the relationship between the resultant dataset on the flow, but the recipe never actually reads the input dataset.
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
# Instantiate client and get this project
client=dataiku.api_client()
proj = client.get_default_project()
# ------------------------------------------------------------------------ #
# ------------- START HELPER FUNCTIONS --------------------------------- #
def get_worksheet(dataset_name, worksheet_name):
"""
Helper function to return the Worksheet object for a particular dataset and worksheet
"""
dataset = proj.get_dataset(dataset_name)
worksheets = dataset.list_statistics_worksheets()
# Get the worksheet you want (by name)
my_worksheet = None
for worksheet in worksheets:
ws_settings = worksheet.get_settings()
if ws_settings.get_raw()['name'] == worksheet_name:
my_worksheet = worksheet
break
if not my_worksheet:
raise ValueError('Worksheet by name of \'{}\' not found for dataset: {}'.format(worksheet_name, dataset_name))
return my_worksheet
def get_2sampttest_df(worksheet):
"""
Helper function to run the statstical analyses present in the worksheet and
return a dataframe consisting of the pvalue and test statistic for all the
two sample t-tests present in that worksheet.
"""
# Run the worksheet and get the results
statistic_card_result = worksheet.run_worksheet()
results = statistic_card_result.get_raw()['results']
data=[]
# Iterate over the results getting the pvalue and statistic for each t-test
for result in results:
if result['type'] == 'ttest_2samp':
data.append({'pvalue': result['pvalue'], 'statistic': result['statistic']})
if len(data) == 0:
raise TypeError('No 2Sample T Tests were configured in the worksheet')
else:
return pd.DataFrame(data)
# ------------------------------------------------------------------------ #
# -------------- END HELPER FUNCTIONS ---------------------------------- #
# Get the worksheet that has the 2Sample T-Test and build a DataFrame with pvalue and statistic values present
worksheet = get_worksheet('TRIPS_prepared', 'Worksheet')
t_test_output_df = get_2sampttest_df(worksheet)
# Write recipe outputs
t_test_output = dataiku.Dataset("t_test_output")
t_test_output.write_with_schema(t_test_output_df)
I tested it on a contrived example on my DSS instance and it appeared to work just fine. You will need to know your dataset name and the name of your worksheet containing the 2 sample t-test. If you have a worksheet that runs multiple 2-sample t-tests, you will want to modify the code to keep track of which results correspond to which test.
Depending on how you are planning to use this information it is probable you may want to move this logic into a custom metric/check on your dataset or possibly as a step in a scenario (or a check on the dataset and then reference that check in a scenario).
@sebastien_zinn Can I ask what it is you are trying to do? I am unaware of a way to publish a statistics card to a flow dataset (in an easy fashion). In any event, here is a python recipe that will get you close to what you want.
For the python recipe I used the dataset on which the statistics are calculated (TRIPS_prepared) as the input and a created a new dataset as the output. This will make it cleaner to see the relationship between the resultant dataset on the flow, but the recipe never actually reads the input dataset.
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
# Instantiate client and get this project
client=dataiku.api_client()
proj = client.get_default_project()
# ------------------------------------------------------------------------ #
# ------------- START HELPER FUNCTIONS --------------------------------- #
def get_worksheet(dataset_name, worksheet_name):
"""
Helper function to return the Worksheet object for a particular dataset and worksheet
"""
dataset = proj.get_dataset(dataset_name)
worksheets = dataset.list_statistics_worksheets()
# Get the worksheet you want (by name)
my_worksheet = None
for worksheet in worksheets:
ws_settings = worksheet.get_settings()
if ws_settings.get_raw()['name'] == worksheet_name:
my_worksheet = worksheet
break
if not my_worksheet:
raise ValueError('Worksheet by name of \'{}\' not found for dataset: {}'.format(worksheet_name, dataset_name))
return my_worksheet
def get_2sampttest_df(worksheet):
"""
Helper function to run the statstical analyses present in the worksheet and
return a dataframe consisting of the pvalue and test statistic for all the
two sample t-tests present in that worksheet.
"""
# Run the worksheet and get the results
statistic_card_result = worksheet.run_worksheet()
results = statistic_card_result.get_raw()['results']
data=[]
# Iterate over the results getting the pvalue and statistic for each t-test
for result in results:
if result['type'] == 'ttest_2samp':
data.append({'pvalue': result['pvalue'], 'statistic': result['statistic']})
if len(data) == 0:
raise TypeError('No 2Sample T Tests were configured in the worksheet')
else:
return pd.DataFrame(data)
# ------------------------------------------------------------------------ #
# -------------- END HELPER FUNCTIONS ---------------------------------- #
# Get the worksheet that has the 2Sample T-Test and build a DataFrame with pvalue and statistic values present
worksheet = get_worksheet('TRIPS_prepared', 'Worksheet')
t_test_output_df = get_2sampttest_df(worksheet)
# Write recipe outputs
t_test_output = dataiku.Dataset("t_test_output")
t_test_output.write_with_schema(t_test_output_df)
I tested it on a contrived example on my DSS instance and it appeared to work just fine. You will need to know your dataset name and the name of your worksheet containing the 2 sample t-test. If you have a worksheet that runs multiple 2-sample t-tests, you will want to modify the code to keep track of which results correspond to which test.
Depending on how you are planning to use this information it is probable you may want to move this logic into a custom metric/check on your dataset or possibly as a step in a scenario (or a check on the dataset and then reference that check in a scenario).
@tim-wright thank you so much for your help, it worked like a charm!
I just had to edit the code slightly and replace
proj = client.get_default_project()
with the following line:
proj = client.get_project(dataiku.default_project_key())
The result of your code is a table containing the p-value and the t-statistic of my test, exactly what I needed š
The reason why I needed it in a table is because I want to display in Tableau (our B.I tool) an alert message if the test is not statistically significant. I am now able to do that.
Thanks again,
SƩbastien