How to read a sample of a very big file using a Python code recipe
Hi all,
I have a quite huge file (> 8 GB, on a VM with only 8 GM memory) that I would read within a DSS Python code recipe. Of course, the whole file will not fit in the memory. Could I instead only read a sample of the file, or is this impossible if the file in total is bigger than the memory of my VM?
Looking forward to your answer!
Marella
Answers
-
Hello,
When reading a file to a pandas dataframe with dataiku api, you have basically 2 options to deal with data larger than RAM:
* Sampling, you can just load the first n rows of your dataset for example: https://doc.dataiku.com/dss/latest/python-api/datasets.html#sampling
* Chunk reading: you can iteratively read just n rows of your dataset, do stuff on it, then continue to the next chunk: https://doc.dataiku.com/dss/latest/python-api/datasets.html#chunked-reading-and-writing-with-pandas
Cheers,
Du
-
Hi,
If your file can be read as a DSS Dataset, then you can use chunked reading/writing: https://doc.dataiku.com/dss/latest/python-api/datasets.html#chunked-reading-and-writing-with-pandas
With a small enough chunk size or row-by-row processing, you should be able to overcome your memory limitation for reading.
However, please note what if you have any operation requiring full-sample computation, you will hit the memory limitation again.
Cheers,
Alex
-
Thanks a lot, duphan!
I suppose, these two options are not possible if I do not use pandas, right? -
Hi,
These options (and the dataiku api per se) is indeed based on pandas package. I'm not sure what do you have in mind otherwise ?