Need help with Formula language in Custom Python functions in the Prepare Recipe

SohilRaceQuant
SohilRaceQuant Registered Posts: 1

Hello,

I am using the Custom Python functions in the Prepare Recipe and want to use the Formula language.

'''

def process(row):
if row['Age'] is None and str(row['Pclass']) == '1':
return 111 # I want to actually include the mean age of all rows that have a Pclass=1 in the dataset
elif row['Age'] is None and str(row['Pclass']) == '2':
return 222 # I want to actually include the mean age of all rows that have a Pclass=2 in the dataset
elif row['Age'] is None and str(row['Pclass']) == '3':
return 333 # I want to actually include the mean age of all rows that have a Pclass=3 in the dataset

else:
return row['Age']
'''

How do I get this to work?

Tagged:

Answers

  • FlorentD
    FlorentD Dataiker, Dataiku DSS Core Designer, Registered Posts: 25 Dataiker

    Hi,

    I won't recommend using this way to do what you want to, as this would apply row by row.

    The best way, IMO, is to use a python recipe, and then use classical python functions to do this.

    If you really want to use the prepare recipe, you will have some works to do before. Compute the mean for your different classes (with a group recipe, for example). Then you can use the custom python function as describe here https://community.dataiku.com/t5/General-Discussion/Custom-Python-function-in-prepare-recipe/m-p/28538, by "joining" the two datasets.

    Hope this helps.

  • Catalina
    Catalina Dataiker, Dataiku DSS Core Designer, Registered Posts: 135 Dataiker
    edited July 2024

    Hi @SohilRaceQuant
    ,

    Formula language is not suitable for your use case.

    One way to achieve what you want is to create a custom metric that calculates the average age for each 'Pclass' group excluding the empty age rows:

    import numpy as np
    
    def process(dataset, partition_id):
        df = dataset.get_dataframe()
        df.Age.fillna(np.nan, inplace=True)
        df_query = (df
                            .groupby('Pclass')
                            .agg({'Age':np.nansum})
                        )
    
        d={}
        
        for i in range(len(df)):
            d["Pclass_"+str(df.iloc[i, 1])]=df.iloc[i, 0]
    
        return d

    After that you can access this metric in a custom Python function using Python API. In order to use dataikuapi package in a custom Python function of a prepare recipe you need to enable the option Use a real Python process (instead of Jython):

    Screenshot 2022-12-28 at 15.14.14.png

    This is an example code that shows how to retrieve the custom metric value:

    import dataikuapi
    
    host="http://localhost:11200"
    apiKey = "***********************"
    
    def process(row):
        client = dataikuapi.DSSClient(host, apiKey) 
        project = client.get_project("COM31578")
        dataset = project.get_dataset("input")
        source_metrics = dataset.get_last_metric_values()
        if row['Age'] is None:
            return source_metrics.get_metric_by_id("python:Pclass_"+row['Pclass']+":avg")['lastValues'][0]['value']
        else:
            return row['Age']

    I hope this helps.

Setup Info
    Tags
      Help me…