Discover all of the brand-new features and improvements to existing capabilities in the Dataiku 11.3 updateLET'S GO

Need help with Formula language in Custom Python functions in the Prepare Recipe

SohilRaceQuant
Level 1
Need help with Formula language in Custom Python functions in the Prepare Recipe

Hello,

 

I am using the Custom Python functions in the Prepare Recipe and want to use the Formula language.

 

'''

def process(row):
if row['Age'] is None and str(row['Pclass']) == '1':
return 111 # I want to actually include the mean age of all rows that have a Pclass=1 in the dataset
elif row['Age'] is None and str(row['Pclass']) == '2':
return 222 # I want to actually include the mean age of all rows that have a Pclass=2 in the dataset
elif row['Age'] is None and str(row['Pclass']) == '3':
return 333 # I want to actually include the mean age of all rows that have a Pclass=3 in the dataset

else:
return row['Age']
'''

 

 

How do I get this to work?

0 Kudos
2 Replies
FlorentD
Developer Advocate

Hi,

I won't recommend using this way to do what you want to, as this would apply row by row.

The best way, IMO, is to use a python recipe, and then use classical python functions to do this.

If you really want to use the prepare recipe, you will have some works to do before. Compute the mean for your different classes (with a group recipe, for example). Then you can use the custom python function as describe here https://community.dataiku.com/t5/General-Discussion/Custom-Python-function-in-prepare-recipe/m-p/285... by "joining" the two datasets.

 

Hope this helps.

0 Kudos
CatalinaS
Dataiker

Hi @SohilRaceQuant,

Formula language is not suitable for your use case. 

One way to achieve what you want is to create a custom metric that calculates the average age for each 'Pclass' group excluding the empty age rows:

import numpy as np

def process(dataset, partition_id):
    df = dataset.get_dataframe()
    df.Age.fillna(np.nan, inplace=True)
    df_query = (df
                        .groupby('Pclass')
                        .agg({'Age':np.nansum})
                    )

    d={}
    
    for i in range(len(df)):
        d["Pclass_"+str(df.iloc[i, 1])]=df.iloc[i, 0]

    return d

 

After that you can access this metric in a custom Python function using Python API. In order to use dataikuapi package in a custom Python function of a prepare recipe you need to enable the option Use a real Python process (instead of Jython):

Screenshot 2022-12-28 at 15.14.14.png

This is an example code that shows how to retrieve the custom metric value:

import dataikuapi

host="http://localhost:11200"
apiKey = "***********************"

def process(row):
    client = dataikuapi.DSSClient(host, apiKey) 
    project = client.get_project("COM31578")
    dataset = project.get_dataset("input")
    source_metrics = dataset.get_last_metric_values()
    if row['Age'] is None:
        return source_metrics.get_metric_by_id("python:Pclass_"+row['Pclass']+":avg")['lastValues'][0]['value']
    else:
        return row['Age']

 

I hope this helps.

0 Kudos