Want to Stop Rebuilding "Expensive" Parts of your Flow? Explicit Builds are the Answer!READ MORE

Custom Python function in prepare recipe

Solved!
Usersyed
Level 3
Level 3
Custom Python function in prepare recipe

I have a dataset on which I want to use a prepare recipe. Based on a given value I need to check which slab it falls in. Refer the sample dataset below

So for example when age is 17, the value corresponding to this age is "Teenager" and when the weight is 44 , the value from dataset 2 is "Good". 

The Dataset 2 will be in csv format and the dataset 1 will be coming from a live SQL database connection


Dataset 1

IDAgeWeightage_outputweight_ouptut
ABC1744TeenagerGood
XYZ3552AdultAverage
MNP1025ChildGood

 

Dataset 2

VariablelowerhigherValue
Age010Child
Age1118Teenager

Age

19100Adult

Weight

050Good

Weight

5070Average

Weight

71100Bulky

 

 

How do I achieve this using custom python function present in "Prepare" recipe?

0 Kudos
2 Solutions
CatalinaS
Dataiker
Dataiker

Hi @Usersyed ,

There is a way to use a prepare recipe with a custom Python function if you enable the option Use a real Python process (instead of Jython) that will allow you to use dataikuapi package.

Screenshot 2022-09-09 at 14.36.55.png

Below is an example Python code that can be used to do the classification based on the age and weight.

import dataikuapi

host="http://localhost:11200"
apiKey = "*********************"


def process(row):
    client = dataikuapi.DSSClient(host, apiKey) 
    project = client.get_project("DATES")
    dataset = project.get_dataset("Dataset2")
    for r in dataset.iter_rows():
        if r[0] in row:
            if(r[1] <= int(row[r[0]]) and r[2] >= int(row[r[0]])):
                row[r[0]+"_output"] = r[3]
            print(r)
        else:
            print("Missing column for attribute:  "+r[0])
    return row

 

View solution in original post

0 Kudos
CatalinaS
Dataiker
Dataiker
Can you try disabling SSL verification using the following code client._session.verify = False?
 
   client = dataikuapi.DSSClient(host, apiKey)
client._session.verify = False

View solution in original post

15 Replies
CatalinaS
Dataiker
Dataiker

Hi @Usersyed,

Please confirm if the data regarding the classification is going to be stored in a dataset and if the data change. In this case it's better to do a left join between dataset 1 and dataset 2 with multiple conditions:

Dataset1.Gender = Dataset2.Gender and Dataset1.Age <=  Dataset2.higher and Dataset1.Age >= Dataset2.lower.  If not rules are matched due to the left join the value field will be left NULL.

If you use the custom Python function, it will be executed on a per row basis which means executing a select in Dataset2 for each row of Dataset1 that will very likely lead to a lower performance.

If the rules of Dataset2 can be hard-coded then you can use Prepare recipe with either a custom function or a custom Python function.

Usersyed
Level 3
Level 3
Author

So the data regarding the classification will be stored as a csv (values will be hard coded). So in this case how do I go about writing the custom python function?

0 Kudos
Usersyed
Level 3
Level 3
Author

@CatalinaS 
I tried the approach of joining the datasets which you mentioned but I was getting duplicate rows since there are  similar slabs for multiple variables

0 Kudos
CatalinaS
Dataiker
Dataiker

I noticed you added the second variable called Weight. In this case you can add an additional join for each different variable. You have to add each time in join a condition for variable name (Dataset_2.variable = 'variable_name'). In this way you should get only one row for each. 

0 Kudos
Usersyed
Level 3
Level 3
Author

Ok, got it.
Thanks for this

But is there a way of doing this using custom python function?
I was able to do it using custom python function but the problem was that I had to hard code the if else conditions.

Is there a way to reference values from another dataset in a custom python function?
I tried loading the Dataset 2 in my custom function and use the values but it did not work.

Any alternate approaches for this? 

0 Kudos
CatalinaS
Dataiker
Dataiker

Hi @Usersyed,

It seems not possible to use Prepare recipe with custom Python function without hard coding the Dataset2 since you will not be able to retrieve the values from another dataset.

Another alternative would be to use a python recipe with both datasets as input. 

0 Kudos
CatalinaS
Dataiker
Dataiker

Hi @Usersyed ,

There is a way to use a prepare recipe with a custom Python function if you enable the option Use a real Python process (instead of Jython) that will allow you to use dataikuapi package.

Screenshot 2022-09-09 at 14.36.55.png

Below is an example Python code that can be used to do the classification based on the age and weight.

import dataikuapi

host="http://localhost:11200"
apiKey = "*********************"


def process(row):
    client = dataikuapi.DSSClient(host, apiKey) 
    project = client.get_project("DATES")
    dataset = project.get_dataset("Dataset2")
    for r in dataset.iter_rows():
        if r[0] in row:
            if(r[1] <= int(row[r[0]]) and r[2] >= int(row[r[0]])):
                row[r[0]+"_output"] = r[3]
            print(r)
        else:
            print("Missing column for attribute:  "+r[0])
    return row

 

0 Kudos
Usersyed
Level 3
Level 3
Author

What would be the values of API key and host?

0 Kudos
CatalinaS
Dataiker
Dataiker

API key and host are required when you want to use Dataiku APIs outside the platform.

Host is your DSS URL that consists of DSS_HOST and DSS_PORT. For example the home URL  "http://localhost:11200" uses the DSS_HOST localhost and port 11200.

API key is your own personal API key for use on all projects. API keys can be created in Profile & Settings > API keys

This is explained in this tutorial The APIs outside Dataiku DSS 

 

0 Kudos
Usersyed
Level 3
Level 3
Author

Ok, thanks for this.
I was able to create the api key. I am running dataiku in a custom url so will that custom URL be my host?

0 Kudos
AlexT
Dataiker
Dataiker

Yes, URL would be the hostname you use to connect to DSS.

However since you are using this code internally you don't need to specify the host you can just use :

client = dataiku.api_client()

https://doc.dataiku.com/dss/latest/python-api/client.html#creating-a-client-from-inside-dss


0 Kudos
Usersyed
Level 3
Level 3
Author

I tried this approach but I am ending up with this error

class 'requests.exceptions.SSLError' : [Errno None] None: 'None'

Any idea why this is happening?

0 Kudos
CatalinaS
Dataiker
Dataiker
Can you try disabling SSL verification using the following code client._session.verify = False?
 
   client = dataikuapi.DSSClient(host, apiKey)
client._session.verify = False
Usersyed
Level 3
Level 3
Author

Thanks, it worked!

0 Kudos
Usersyed
Level 3
Level 3
Author

@CatalinaS 

I have updated my post. Please let me know your thoughts on this.

0 Kudos