get unique value of a column from python api in a webapp without loading entire data set in memory

yjagger
Level 2
get unique value of a column from python api in a webapp without loading entire data set in memory

hi,

In my flow, a python recipe computes the data in 2 datasets and stores the output to another dataset.

Column1, Column2, Column3...Column N

From my webapp in DataIKU, I want to get unique list of Column1 without calling get_dataframe() api  which loads the entire data frame in memory and without using the streaming API using which I will have to implement the logic of iterating and getting unique values of Column1.

Can you suggest a way to do it ?


Operating system used: Windows

0 Kudos
3 Replies
JakeA
Dataiker

Hello!

If you are attempting to only load a single column, and you have a specific column in mind, you could always take advantage of the Prepare recipe and create a separate dataset that has the only column that you need. You can then load in the dataset as a far smaller dataframe that originally was N rows and M columns in size down to becoming N rows and 1 column in size.

Alternatively, you could use SQL to select the specific row you are looking to use, which is another lightweight solution.

It is safe to say that doing this in python would be very cumbersome and not recommended, but the only way that might work would be by utilizing chunking, which you may find some documentation for that here.

0 Kudos
yjagger
Level 2
Author

hi Jake,

Thanks for your answer. I have created a sql recipe to store distinct values from the desired column in another dataset, which can be used in this situation, although I was looking to avoid creating an extra dataset for this.

So, when you suggest "you could use SQL to select the specific row you are looking to use" - I am curious how can I execute the sql query from the client-side (dash webapp running in DataIKU ) to get the result.

If possible can you share any resource or package to use for the same  

0 Kudos
yjagger
Level 2
Author

Just replying for someone else who needs this. Fetching the results from datasets using the sql query in the python code/recipe/webapp looked like the best approach. 

Below is a link which gives an example. Using the Data IKU Sql connector package, results can be returned as DataFrame

https://doc.dataiku.com/dss/latest/python-api/sql.html

0 Kudos