Need help with Formula language in Custom Python functions in the Prepare Recipe
Hello,
I am using the Custom Python functions in the Prepare Recipe and want to use the Formula language.
'''
def process(row):
if row['Age'] is None and str(row['Pclass']) == '1':
return 111 # I want to actually include the mean age of all rows that have a Pclass=1 in the dataset
elif row['Age'] is None and str(row['Pclass']) == '2':
return 222 # I want to actually include the mean age of all rows that have a Pclass=2 in the dataset
elif row['Age'] is None and str(row['Pclass']) == '3':
return 333 # I want to actually include the mean age of all rows that have a Pclass=3 in the dataset
else:
return row['Age']
'''
How do I get this to work?
Answers
-
Hi,
I won't recommend using this way to do what you want to, as this would apply row by row.
The best way, IMO, is to use a python recipe, and then use classical python functions to do this.
If you really want to use the prepare recipe, you will have some works to do before. Compute the mean for your different classes (with a group recipe, for example). Then you can use the custom python function as describe here https://community.dataiku.com/t5/General-Discussion/Custom-Python-function-in-prepare-recipe/m-p/28538, by "joining" the two datasets.
Hope this helps.
-
Hi @SohilRaceQuant
,Formula language is not suitable for your use case.
One way to achieve what you want is to create a custom metric that calculates the average age for each 'Pclass' group excluding the empty age rows:
import numpy as np def process(dataset, partition_id): df = dataset.get_dataframe() df.Age.fillna(np.nan, inplace=True) df_query = (df .groupby('Pclass') .agg({'Age':np.nansum}) ) d={} for i in range(len(df)): d["Pclass_"+str(df.iloc[i, 1])]=df.iloc[i, 0] return d
After that you can access this metric in a custom Python function using Python API. In order to use dataikuapi package in a custom Python function of a prepare recipe you need to enable the option Use a real Python process (instead of Jython):
This is an example code that shows how to retrieve the custom metric value:
import dataikuapi host="http://localhost:11200" apiKey = "***********************" def process(row): client = dataikuapi.DSSClient(host, apiKey) project = client.get_project("COM31578") dataset = project.get_dataset("input") source_metrics = dataset.get_last_metric_values() if row['Age'] is None: return source_metrics.get_metric_by_id("python:Pclass_"+row['Pclass']+":avg")['lastValues'][0]['value'] else: return row['Age']
I hope this helps.