Publish result of statistics card to flow
Hello,
Is it possible to publish the resullt of a statistics card to the flow?
I am performing a two-sample t-test in order to see if the difference between the means of two populations is significant. I would like to create a new table in my flow, composed of one single column and one single row, and gets populated with a 0 or 1 depending on the result of the statistics test.
Is that feasible ? Are there any alternative techniques?
Thank you,
Sébastien Zinn
Best Answer
-
tim-wright Partner, L2 Designer, Snowflake Advanced, Neuron 2020, Registered, Neuron 2021, Neuron 2022 Posts: 77 Partner
@sebastien_zinn
Can I ask what it is you are trying to do? I am unaware of a way to publish a statistics card to a flow dataset (in an easy fashion). In any event, here is a python recipe that will get you close to what you want.For the python recipe I used the dataset on which the statistics are calculated (TRIPS_prepared) as the input and a created a new dataset as the output. This will make it cleaner to see the relationship between the resultant dataset on the flow, but the recipe never actually reads the input dataset.
# -*- coding: utf-8 -*- import dataiku import pandas as pd, numpy as np # Instantiate client and get this project client=dataiku.api_client() proj = client.get_default_project() # ------------------------------------------------------------------------ # # ------------- START HELPER FUNCTIONS --------------------------------- # def get_worksheet(dataset_name, worksheet_name): """ Helper function to return the Worksheet object for a particular dataset and worksheet """ dataset = proj.get_dataset(dataset_name) worksheets = dataset.list_statistics_worksheets() # Get the worksheet you want (by name) my_worksheet = None for worksheet in worksheets: ws_settings = worksheet.get_settings() if ws_settings.get_raw()['name'] == worksheet_name: my_worksheet = worksheet break if not my_worksheet: raise ValueError('Worksheet by name of \'{}\' not found for dataset: {}'.format(worksheet_name, dataset_name)) return my_worksheet def get_2sampttest_df(worksheet): """ Helper function to run the statstical analyses present in the worksheet and return a dataframe consisting of the pvalue and test statistic for all the two sample t-tests present in that worksheet. """ # Run the worksheet and get the results statistic_card_result = worksheet.run_worksheet() results = statistic_card_result.get_raw()['results'] data=[] # Iterate over the results getting the pvalue and statistic for each t-test for result in results: if result['type'] == 'ttest_2samp': data.append({'pvalue': result['pvalue'], 'statistic': result['statistic']}) if len(data) == 0: raise TypeError('No 2Sample T Tests were configured in the worksheet') else: return pd.DataFrame(data) # ------------------------------------------------------------------------ # # -------------- END HELPER FUNCTIONS ---------------------------------- # # Get the worksheet that has the 2Sample T-Test and build a DataFrame with pvalue and statistic values present worksheet = get_worksheet('TRIPS_prepared', 'Worksheet') t_test_output_df = get_2sampttest_df(worksheet) # Write recipe outputs t_test_output = dataiku.Dataset("t_test_output") t_test_output.write_with_schema(t_test_output_df)
I tested it on a contrived example on my DSS instance and it appeared to work just fine. You will need to know your dataset name and the name of your worksheet containing the 2 sample t-test. If you have a worksheet that runs multiple 2-sample t-tests, you will want to modify the code to keep track of which results correspond to which test.
Depending on how you are planning to use this information it is probable you may want to move this logic into a custom metric/check on your dataset or possibly as a step in a scenario (or a check on the dataset and then reference that check in a scenario).
Answers
-
sebastien_zinn Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 2 Partner
@tim-wright
thank you so much for your help, it worked like a charm!I just had to edit the code slightly and replace
proj = client.get_default_project()
with the following line:
proj = client.get_project(dataiku.default_project_key())
The result of your code is a table containing the p-value and the t-statistic of my test, exactly what I needed
The reason why I needed it in a table is because I want to display in Tableau (our B.I tool) an alert message if the test is not statistically significant. I am now able to do that.
Thanks again,
Sébastien