How to read a sample of a very big file using a Python code recipe

Marella
Level 1
How to read a sample of a very big file using a Python code recipe

Hi all,

I have a quite huge file (> 8 GB, on a VM with only 8 GM memory) that I would read within a DSS Python code recipe. Of course, the whole file will not fit in the memory. Could I instead only read a sample of the file, or is this impossible if the file in total is bigger than the memory of my VM?

Looking forward to your answer!
Marella

4 Replies
duphan
Dataiker

Hello, 

When reading a file to a pandas dataframe with dataiku api, you have basically 2 options to deal with data larger than RAM: 

* Sampling, you can just load the first n rows of your dataset for example: https://doc.dataiku.com/dss/latest/python-api/datasets.html#sampling

* Chunk reading: you can iteratively read just n rows of your dataset, do stuff on it, then continue to the next chunk: https://doc.dataiku.com/dss/latest/python-api/datasets.html#chunked-reading-and-writing-with-pandas

Cheers, 

Du

Marella
Level 1
Author

Thanks a lot, duphan!
I suppose, these two options are not possible if I do not use pandas, right?

0 Kudos
duphan
Dataiker

Hi, 

These options (and the dataiku api per se) is indeed based on pandas package. I'm not sure what do you have in mind otherwise ? 

 

0 Kudos
Alex_Combessie
Dataiker Alumni

Hi,

If your file can be read as a DSS Dataset, then you can use chunked reading/writing: https://doc.dataiku.com/dss/latest/python-api/datasets.html#chunked-reading-and-writing-with-pandas

With a small enough chunk size or row-by-row processing, you should be able to overcome your memory limitation for reading.

However, please note what if you have any operation requiring full-sample computation, you will hit the memory limitation again.

Cheers,

Alex

0 Kudos