Discover all of the brand-new features and improvements to existing capabilities in the Dataiku 11.3 updateLET'S GO

Selective reading from a dataset into a pandas dataframe

SuhailS7
Level 2
Selective reading from a dataset into a pandas dataframe

I have a big dataiku dataset in the flow which I am then reading into a pandas dataframe using the get_dataframe() method. The problem happens when I have to read 4 such datasets into memory which slows down the processing. I don't need to read the entire dataset, only a subset has to be read into the pandas dataframe. Is there a way I can do this?

 

Currently, I am reading the entire data into pandas and then slicing the pandas dataframe and I would like to avoid doing this.

 

Thanks!

0 Kudos
3 Replies
nmadhu20

Hi @SuhailS7 ,

Yes, you can easily get the desired number of rows with the help of the below code.

mydataset = Dataset("myname")

for df in mydataset.iter_dataframes(chunksize=10000)
        # df is a dataframe of at most 10K rows.

You can refer to the following doc for more info - https://doc.dataiku.com/dss/latest/python-api/datasets-data.html

Best,

Madhuleena 

0 Kudos
SuhailS7
Level 2
Author

Thanks @nmadhu20  but what if I need a specific number of rows for my processing? For eg - I need the rows that only contain a specific value. If the dataiku dataset has 1,000,000 records, there may be only 500 rows that satisfy the condition I am looking for and I don't want to read all 1,000,000 rows into a pandas dataframe for this purpose.

Currently I am performing 2 operations, loading it into memory and then slicing the dataframe, I was hoping if it is possible to do this in one operation.

0 Kudos
nmadhu20

To my knowledge, you can use the iter_rows() function of the Dataset object to iterate over your rows without loading the entire data frame into memory as you would not need get_dataframe(). Then you can go ahead with the required filter but this involves the processing to be at each row-level.

iter_rows(sampling='head'sampling_column=Nonelimit=Noneratio=Nonelog_every=- 1timeout=30columns=None)

mydataset = Dataset("myname")

for row in mydataset.iter_rows()
#required filter on a row basis

This returns each row as a dictionary. More info in this link - https://doc.dataiku.com/dss/latest/python-api/datasets-reference.html

Hope this helps!

Madhuleena 

0 Kudos