Selective reading from a dataset into a pandas dataframe

SuhailS7 · November 2022

I have a big dataiku dataset in the flow which I am then reading into a pandas dataframe using the get_dataframe() method. The problem happens when I have to read 4 such datasets into memory which slows down the processing. I don't need to read the entire dataset, only a subset has to be read into the pandas dataframe. Is there a way I can do this?

Currently, I am reading the entire data into pandas and then slicing the pandas dataframe and I would like to avoid doing this.

Thanks!

nmadhu20 · November 2022

Hi @SuhailS7
,

Yes, you can easily get the desired number of rows with the help of the below code.

mydataset = Dataset("myname")

for df in mydataset.iter_dataframes(chunksize=10000)
        # df is a dataframe of at most 10K rows.

You can refer to the following doc for more info - https://doc.dataiku.com/dss/latest/python-api/datasets-data.html

Best,

Madhuleena

SuhailS7 · November 2022

Thanks @nmadhu20
but what if I need a specific number of rows for my processing? For eg - I need the rows that only contain a specific value. If the dataiku dataset has 1,000,000 records, there may be only 500 rows that satisfy the condition I am looking for and I don't want to read all 1,000,000 rows into a pandas dataframe for this purpose.

Currently I am performing 2 operations, loading it into memory and then slicing the dataframe, I was hoping if it is possible to do this in one operation.

nmadhu20 · November 2022

To my knowledge, you can use the iter_rows() function of the Dataset object to iterate over your rows without loading the entire data frame into memory as you would not need get_dataframe(). Then you can go ahead with the required filter but this involves the processing to be at each row-level.

iter_rows(sampling='head', sampling_column=None, limit=None, ratio=None, log_every=- 1, timeout=30, columns=None)

mydataset = Dataset("myname")

for row in mydataset.iter_rows()
#required filter on a row basis

This returns each row as a dictionary. More info in this link - https://doc.dataiku.com/dss/latest/python-api/datasets-reference.html

Hope this helps!

Madhuleena

Selective reading from a dataset into a pandas dataframe

Answers

Categories

Setup Info

Tags