How to read a sample of a very big file using a Python code recipe

Marella · March 2020

Hi all,

I have a quite huge file (> 8 GB, on a VM with only 8 GM memory) that I would read within a DSS Python code recipe. Of course, the whole file will not fit in the memory. Could I instead only read a sample of the file, or is this impossible if the file in total is bigger than the memory of my VM?

Looking forward to your answer!
Marella

duphan · March 2020

Hello,

When reading a file to a pandas dataframe with dataiku api, you have basically 2 options to deal with data larger than RAM:

* Sampling, you can just load the first n rows of your dataset for example: https://doc.dataiku.com/dss/latest/python-api/datasets.html#sampling

* Chunk reading: you can iteratively read just n rows of your dataset, do stuff on it, then continue to the next chunk: https://doc.dataiku.com/dss/latest/python-api/datasets.html#chunked-reading-and-writing-with-pandas

Cheers,

Du

Alex_Combessie · March 2020

Hi,

If your file can be read as a DSS Dataset, then you can use chunked reading/writing: https://doc.dataiku.com/dss/latest/python-api/datasets.html#chunked-reading-and-writing-with-pandas

With a small enough chunk size or row-by-row processing, you should be able to overcome your memory limitation for reading.

However, please note what if you have any operation requiring full-sample computation, you will hit the memory limitation again.

Cheers,

Alex

Marella · March 2020

Thanks a lot, duphan!
I suppose, these two options are not possible if I do not use pandas, right?

duphan · March 2020

Hi,

These options (and the dataiku api per se) is indeed based on pandas package. I'm not sure what do you have in mind otherwise ?

How to read a sample of a very big file using a Python code recipe

Answers

Categories

Setup Info

Tags