I have a quite huge file (> 8 GB, on a VM with only 8 GM memory) that I would read within a DSS Python code recipe. Of course, the whole file will not fit in the memory. Could I instead only read a sample of the file, or is this impossible if the file in total is bigger than the memory of my VM?
Looking forward to your answer!
When reading a file to a pandas dataframe with dataiku api, you have basically 2 options to deal with data larger than RAM:
* Sampling, you can just load the first n rows of your dataset for example: https://doc.dataiku.com/dss/latest/python-api/datasets.html#sampling
* Chunk reading: you can iteratively read just n rows of your dataset, do stuff on it, then continue to the next chunk: https://doc.dataiku.com/dss/latest/python-api/datasets.html#chunked-reading-and-writing-with-pandas
If your file can be read as a DSS Dataset, then you can use chunked reading/writing: https://doc.dataiku.com/dss/latest/python-api/datasets.html#chunked-reading-and-writing-with-pandas
With a small enough chunk size or row-by-row processing, you should be able to overcome your memory limitation for reading.
However, please note what if you have any operation requiring full-sample computation, you will hit the memory limitation again.