Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hi all,
I have a quite huge file (> 8 GB, on a VM with only 8 GM memory) that I would read within a DSS Python code recipe. Of course, the whole file will not fit in the memory. Could I instead only read a sample of the file, or is this impossible if the file in total is bigger than the memory of my VM?
Looking forward to your answer!
Marella
Hello,
When reading a file to a pandas dataframe with dataiku api, you have basically 2 options to deal with data larger than RAM:
* Sampling, you can just load the first n rows of your dataset for example: https://doc.dataiku.com/dss/latest/python-api/datasets.html#sampling
* Chunk reading: you can iteratively read just n rows of your dataset, do stuff on it, then continue to the next chunk: https://doc.dataiku.com/dss/latest/python-api/datasets.html#chunked-reading-and-writing-with-pandas
Cheers,
Du
Thanks a lot, duphan!
I suppose, these two options are not possible if I do not use pandas, right?
Hi,
These options (and the dataiku api per se) is indeed based on pandas package. I'm not sure what do you have in mind otherwise ?
Hi,
If your file can be read as a DSS Dataset, then you can use chunked reading/writing: https://doc.dataiku.com/dss/latest/python-api/datasets.html#chunked-reading-and-writing-with-pandas
With a small enough chunk size or row-by-row processing, you should be able to overcome your memory limitation for reading.
However, please note what if you have any operation requiring full-sample computation, you will hit the memory limitation again.
Cheers,
Alex