Custom Python function in prepare recipe

Options
Usersyed
Usersyed Partner, Registered Posts: 29 Partner

I have a dataset on which I want to use a prepare recipe. Based on a given value I need to check which slab it falls in. Refer the sample dataset below

So for example when age is 17, the value corresponding to this age is "Teenager" and when the weight is 44 , the value from dataset 2 is "Good".

The Dataset 2 will be in csv format and the dataset 1 will be coming from a live SQL database connection


Dataset 1

IDAgeWeightage_outputweight_ouptut
ABC1744TeenagerGood
XYZ3552AdultAverage
MNP1025ChildGood

Dataset 2

VariablelowerhigherValue
Age010Child
Age1118Teenager

Age

19100Adult

Weight

050Good

Weight

5070Average

Weight

71100Bulky

How do I achieve this using custom python function present in "Prepare" recipe?

Tagged:

Best Answers

  • Catalina
    Catalina Dataiker, Dataiku DSS Core Designer, Registered Posts: 135 Dataiker
    edited July 17 Answer ✓
    Options

    Hi @Usersyed
    ,

    There is a way to use a prepare recipe with a custom Python function if you enable the option Use a real Python process (instead of Jython) that will allow you to use dataikuapi package.

    Screenshot 2022-09-09 at 14.36.55.png

    Below is an example Python code that can be used to do the classification based on the age and weight.

    import dataikuapi
    
    host="http://localhost:11200"
    apiKey = "*********************"
    
    
    def process(row):
        client = dataikuapi.DSSClient(host, apiKey) 
        project = client.get_project("DATES")
        dataset = project.get_dataset("Dataset2")
        for r in dataset.iter_rows():
            if r[0] in row:
                if(r[1] <= int(row[r[0]]) and r[2] >= int(row[r[0]])):
                    row[r[0]+"_output"] = r[3]
                print(r)
            else:
                print("Missing column for attribute:  "+r[0])
        return row

  • Catalina
    Catalina Dataiker, Dataiku DSS Core Designer, Registered Posts: 135 Dataiker
    Answer ✓
    Options
    Can you try disabling SSL verification using the following code client._session.verify = False?
       client = dataikuapi.DSSClient(host, apiKey)
    client._session.verify = False

Answers

  • Catalina
    Catalina Dataiker, Dataiku DSS Core Designer, Registered Posts: 135 Dataiker
    Options

    Hi @Usersyed
    ,

    Please confirm if the data regarding the classification is going to be stored in a dataset and if the data change. In this case it's better to do a left join between dataset 1 and dataset 2 with multiple conditions:

    Dataset1.Gender = Dataset2.Gender and Dataset1.Age <= Dataset2.higher and Dataset1.Age >= Dataset2.lower. If not rules are matched due to the left join the value field will be left NULL.

    If you use the custom Python function, it will be executed on a per row basis which means executing a select in Dataset2 for each row of Dataset1 that will very likely lead to a lower performance.

    If the rules of Dataset2 can be hard-coded then you can use Prepare recipe with either a custom function or a custom Python function.

  • Usersyed
    Usersyed Partner, Registered Posts: 29 Partner
    Options

    So the data regarding the classification will be stored as a csv (values will be hard coded). So in this case how do I go about writing the custom python function?

  • Usersyed
    Usersyed Partner, Registered Posts: 29 Partner
    Options

    @CatalinaS


    I have updated my post. Please let me know your thoughts on this.

  • Usersyed
    Usersyed Partner, Registered Posts: 29 Partner
    Options

    @CatalinaS

    I tried the approach of joining the datasets which you mentioned but I was getting duplicate rows since there are similar slabs for multiple variables

  • Catalina
    Catalina Dataiker, Dataiku DSS Core Designer, Registered Posts: 135 Dataiker
    Options

    I noticed you added the second variable called Weight. In this case you can add an additional join for each different variable. You have to add each time in join a condition for variable name (Dataset_2.variable = 'variable_name'). In this way you should get only one row for each.

  • Usersyed
    Usersyed Partner, Registered Posts: 29 Partner
    Options

    Ok, got it.
    Thanks for this

    But is there a way of doing this using custom python function?
    I was able to do it using custom python function but the problem was that I had to hard code the if else conditions.

    Is there a way to reference values from another dataset in a custom python function?
    I tried loading the Dataset 2 in my custom function and use the values but it did not work.

    Any alternate approaches for this?

  • Catalina
    Catalina Dataiker, Dataiku DSS Core Designer, Registered Posts: 135 Dataiker
    Options

    Hi @Usersyed
    ,

    It seems not possible to use Prepare recipe with custom Python function without hard coding the Dataset2 since you will not be able to retrieve the values from another dataset.

    Another alternative would be to use a python recipe with both datasets as input.

  • Usersyed
    Usersyed Partner, Registered Posts: 29 Partner
    Options

    What would be the values of API key and host?

  • Catalina
    Catalina Dataiker, Dataiku DSS Core Designer, Registered Posts: 135 Dataiker
    Options

    API key and host are required when you want to use Dataiku APIs outside the platform.

    Host is your DSS URL that consists of DSS_HOST and DSS_PORT. For example the home URL "http://localhost:11200" uses the DSS_HOST localhost and port 11200.

    API key is your own personal API key for use on all projects. API keys can be created in Profile & Settings > API keys.

    This is explained in this tutorial The APIs outside Dataiku DSS

  • Usersyed
    Usersyed Partner, Registered Posts: 29 Partner
    Options

    Ok, thanks for this.
    I was able to create the api key. I am running dataiku in a custom url so will that custom URL be my host?

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    Options

    Yes, URL would be the hostname you use to connect to DSS.

    However since you are using this code internally you don't need to specify the host you can just use :

    client = dataiku.api_client()

    https://doc.dataiku.com/dss/latest/python-api/client.html#creating-a-client-from-inside-dss


  • Usersyed
    Usersyed Partner, Registered Posts: 29 Partner
    Options

    I tried this approach but I am ending up with this error

    class 'requests.exceptions.SSLError' : [Errno None] None: 'None'

    Any idea why this is happening?

  • Usersyed
    Usersyed Partner, Registered Posts: 29 Partner
    Options

    Thanks, it worked!

Setup Info
    Tags
      Help me…