Custom Python function in prepare recipe
I have a dataset on which I want to use a prepare recipe. Based on a given value I need to check which slab it falls in. Refer the sample dataset below
So for example when age is 17, the value corresponding to this age is "Teenager" and when the weight is 44 , the value from dataset 2 is "Good".
The Dataset 2 will be in csv format and the dataset 1 will be coming from a live SQL database connection
Dataset 1
ID | Age | Weight | age_output | weight_ouptut |
ABC | 17 | 44 | Teenager | Good |
XYZ | 35 | 52 | Adult | Average |
MNP | 10 | 25 | Child | Good |
Dataset 2
Variable | lower | higher | Value |
Age | 0 | 10 | Child |
Age | 11 | 18 | Teenager |
Age | 19 | 100 | Adult |
Weight | 0 | 50 | Good |
Weight | 50 | 70 | Average |
Weight | 71 | 100 | Bulky |
How do I achieve this using custom python function present in "Prepare" recipe?
Best Answers
-
Hi @Usersyed
,There is a way to use a prepare recipe with a custom Python function if you enable the option Use a real Python process (instead of Jython) that will allow you to use dataikuapi package.
Below is an example Python code that can be used to do the classification based on the age and weight.
import dataikuapi host="http://localhost:11200" apiKey = "*********************" def process(row): client = dataikuapi.DSSClient(host, apiKey) project = client.get_project("DATES") dataset = project.get_dataset("Dataset2") for r in dataset.iter_rows(): if r[0] in row: if(r[1] <= int(row[r[0]]) and r[2] >= int(row[r[0]])): row[r[0]+"_output"] = r[3] print(r) else: print("Missing column for attribute: "+r[0]) return row
-
Can you try disabling SSL verification using the following code client._session.verify = False?
client = dataikuapi.DSSClient(host, apiKey)
client._session.verify = False
Answers
-
Hi @Usersyed
,Please confirm if the data regarding the classification is going to be stored in a dataset and if the data change. In this case it's better to do a left join between dataset 1 and dataset 2 with multiple conditions:
Dataset1.Gender = Dataset2.Gender and Dataset1.Age <= Dataset2.higher and Dataset1.Age >= Dataset2.lower. If not rules are matched due to the left join the value field will be left NULL.
If you use the custom Python function, it will be executed on a per row basis which means executing a select in Dataset2 for each row of Dataset1 that will very likely lead to a lower performance.
If the rules of Dataset2 can be hard-coded then you can use Prepare recipe with either a custom function or a custom Python function.
-
So the data regarding the classification will be stored as a csv (values will be hard coded). So in this case how do I go about writing the custom python function?
-
@CatalinaS
I have updated my post. Please let me know your thoughts on this. -
@CatalinaS
I tried the approach of joining the datasets which you mentioned but I was getting duplicate rows since there are similar slabs for multiple variables -
I noticed you added the second variable called Weight. In this case you can add an additional join for each different variable. You have to add each time in join a condition for variable name (Dataset_2.variable = 'variable_name'). In this way you should get only one row for each.
-
Ok, got it.
Thanks for this
But is there a way of doing this using custom python function?
I was able to do it using custom python function but the problem was that I had to hard code the if else conditions.
Is there a way to reference values from another dataset in a custom python function?
I tried loading the Dataset 2 in my custom function and use the values but it did not work.
Any alternate approaches for this? -
Hi @Usersyed
,It seems not possible to use Prepare recipe with custom Python function without hard coding the Dataset2 since you will not be able to retrieve the values from another dataset.
Another alternative would be to use a python recipe with both datasets as input.
-
What would be the values of API key and host?
-
API key and host are required when you want to use Dataiku APIs outside the platform.
Host is your DSS URL that consists of DSS_HOST and DSS_PORT. For example the home URL "http://localhost:11200" uses the DSS_HOST localhost and port 11200.
API key is your own personal API key for use on all projects. API keys can be created in Profile & Settings > API keys.
This is explained in this tutorial The APIs outside Dataiku DSS
-
Ok, thanks for this.
I was able to create the api key. I am running dataiku in a custom url so will that custom URL be my host? -
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,218 Dataiker
Yes, URL would be the hostname you use to connect to DSS.
However since you are using this code internally you don't need to specify the host you can just use :
client = dataiku.api_client()
https://doc.dataiku.com/dss/latest/python-api/client.html#creating-a-client-from-inside-dss
-
I tried this approach but I am ending up with this error
class 'requests.exceptions.SSLError' : [Errno None] None: 'None'
Any idea why this is happening? -
Thanks, it worked!