Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I have a big dataiku dataset in the flow which I am then reading into a pandas dataframe using the get_dataframe() method. The problem happens when I have to read 4 such datasets into memory which slows down the processing. I don't need to read the entire dataset, only a subset has to be read into the pandas dataframe. Is there a way I can do this?
Currently, I am reading the entire data into pandas and then slicing the pandas dataframe and I would like to avoid doing this.
Hi @SuhailS7 ,
Yes, you can easily get the desired number of rows with the help of the below code.
mydataset = Dataset("myname") for df in mydataset.iter_dataframes(chunksize=10000) # df is a dataframe of at most 10K rows.
You can refer to the following doc for more info - https://doc.dataiku.com/dss/latest/python-api/datasets-data.html
Thanks @nmadhu20 but what if I need a specific number of rows for my processing? For eg - I need the rows that only contain a specific value. If the dataiku dataset has 1,000,000 records, there may be only 500 rows that satisfy the condition I am looking for and I don't want to read all 1,000,000 rows into a pandas dataframe for this purpose.
Currently I am performing 2 operations, loading it into memory and then slicing the dataframe, I was hoping if it is possible to do this in one operation.
To my knowledge, you can use the iter_rows() function of the Dataset object to iterate over your rows without loading the entire data frame into memory as you would not need get_dataframe(). Then you can go ahead with the required filter but this involves the processing to be at each row-level.
iter_rows(sampling='head', sampling_column=None, limit=None, ratio=None, log_every=- 1, timeout=30, columns=None)
mydataset = Dataset("myname") for row in mydataset.iter_rows()
#required filter on a row basis
This returns each row as a dictionary. More info in this link - https://doc.dataiku.com/dss/latest/python-api/datasets-reference.html
Hope this helps!