Solutions for faster reading/writing of tabular datasets

Marek · January 2020

Hi,

Have noticed that it takes several minutes to read a few hundreds MB dataset stored as .csv.gz file in a regular managed filesystem. This is very inefficient when it comes to iterative dataset reading/writing. Is it because of Gzip, which for each dkuReadDataset() call has to extract the file contents first?

Would appreciate any hints about how to read/write, or connect to, remarkably faster the tabular datasets.

Clément_Stenac · January 2020

Hi,

dkuReadDataset doesn't use fread underneath, which is the fastest method for reading CSV files. We have made this choice because we've found out that fread tends to sacrifice compatibility and safety for the sake of speed.

What you may want to try is to put your csv.gz file in a managed folder, instead of a dataset, and then use our managed folder API in order to read the file directly from R without going through the Dataiku compatibility layer:

Please see https://doc.dataiku.com/dss/latest/connecting/managed_folders.html#usage-in-r for more details

Solutions for faster reading/writing of tabular datasets

Answers

Categories

Setup Info

Tags