Using recipes

ttoropov
Level 2
Using recipes

Good day!

I'm struggling with choosing the proper recipe or even the algorithm itself to solve my problem. Could you please give any advice? 

Look, I have a transitional dataset with names, conditions and values (on the top of the picture). It was made as a sum of values per each condition in each name from initial dataset (the same as at the bottom of the picture but with values and without target column). FYI: in initial dataset Conditions can be the same in different names but values per each condition and each name are different.

Based on this transitional dataset I need to prepare another one with target values per each condition (in the middle of the picture). Target is an average value per each condition. I was trying to use group/pivot/prepare types of recipes, but every time in the result I can't get rid of "names" or  extra columns or something like that. Could you please recommend which type of recipe suits here the best? 

And further which recipe would you recommend to use to transfer the defined targets to initial dataset per each condition? Because later on I will need to run Machine learning for target column. 

Or maybe you would recommend a completely different solution for this problem?

Capture.JPG

0 Kudos
3 Replies
SarinaS
Dataiker

Hi @ttoropov,

While it would certainly be possible to do this with visual recipes, since the conditions are a bit complex, it would be hard to do in a single recipe. In a case like this, I would suggest using a Python recipe instead. Here is an example Python code that will take your input and transform it as desired. 

import dataiku
import pandas as pd, numpy as np

input_data = dataiku.Dataset("INPUT_DATASET_NAME")
input_data_df = input_data.get_dataframe()

# replace with the real values of your condition names 
condition_columns = ['condition_1', 'condition_2', 'condition_3', 'condition_4', 'condition_5'] 

# this will compute the "averages" per condition:
averages = input_data_df[condition_columns].mean()

# now in a separate step, let's melt the conditions into a separate row per condition:
melted_df = pd.melt(input_data_df, id_vars=['name'], var_name='condition').dropna()

# now we simply can replace the value column with the corresponding average:
for condition_number in melted_df['condition'].unique():
    melted_df.loc[melted_df.condition == condition_number, 'value'] = averages[condition_number]

 

Screenshot 2023-07-28 at 5.59.24 PM.png
If you were to do this with a visual recipe you indeed would probably need 2-3 recipes to mimic the 3 translations here. 

I hope that helps! 

Thanks,
Sarina

0 Kudos
ttoropov
Level 2
Author

Good day, @SarinaS!

Thank you for your help!

I actually tried and succeed with visual recipes. But the problem appears when I try to detect a train model for "KPI" (target value) column in data_with_KPI. It shows AUC = 1.0. Which I guess happens because I split the dataset for computing KPIs.

Capture1.JPG

That's why I'm thinking of computing KPI in the main dataset by prepare recipe. But I don't know if it's possible to write a formula for this case. Can you suggest any formulas for that?(see below) Or is it impossible here? Or should I do it in Python recipe?

I need to fill the certain rows of "KPI_hrs" column, which meet the required conditions on top of the picture, with the following values: they should be equal to the average value of sums of hours per each "name".

Which means: If same conditions conditions as per picture bellow are true, I need to sum all these values in rows in "hours" column. But it should be done separately per each 'name' in "name" column (for name1 the sum will be equal X hours, for name 2 - Y hours, name 3 - Z hours etc). And after that I need to find the average of [X,Y,Z,...] and put this value to KPI_hrs column in those rows only which follows the same conditions below (it will be same value for all 'names'). 

Capture.JPG

Capture2.JPG

0 Kudos
SarinaS
Dataiker

Hi @ttoropov,

Indeed I don't think this is possible to do in a prepare recipe, as you need to group by the conditions to obtain your sum of hours per group (if I understand correctly). 

Note that sum within a formula sums anything in the formula (i.e. sum(1,2,3) -> 6), so indeed it won't work to sum up rows. 

I think you are effectively looking to instead do a group by recipe that sums over a case when statement (e.g: https://learnsql.com/blog/case-when-with-sum/). In DSS, you can create computed columns in your group by recipe with your "if" conditions, and then sum the computed column to get the sum of the computed column(s). 

if(contains(Phase Code 'clean') && contains(Task, 'runcom') && contains(Code 'p'), hours, 0)


And then you can "sum" on your the computed column in your group by recipe to get the sum of hours that meet these conditions. 

Then, you can join back your final group by recipe result table to your main dataset joining on the conditions referenced. 

Since this is a little complex, indeed doing this in Python may be simpler. 

Thanks,
Sarina


0 Kudos