Solutions for faster reading/writing of tabular datasets

Marek
Marek Registered Posts: 17 ✭✭✭✭

Hi,

Have noticed that it takes several minutes to read a few hundreds MB dataset stored as .csv.gz file in a regular managed filesystem. This is very inefficient when it comes to iterative dataset reading/writing. Is it because of Gzip, which for each dkuReadDataset() call has to extract the file contents first?

Would appreciate any hints about how to read/write, or connect to, remarkably faster the tabular datasets.

Answers

  • Clément_Stenac
    Clément_Stenac Dataiker, Dataiku DSS Core Designer, Registered Posts: 753 Dataiker

    Hi,

    dkuReadDataset doesn't use fread underneath, which is the fastest method for reading CSV files. We have made this choice because we've found out that fread tends to sacrifice compatibility and safety for the sake of speed.

    What you may want to try is to put your csv.gz file in a managed folder, instead of a dataset, and then use our managed folder API in order to read the file directly from R without going through the Dataiku compatibility layer:

    Please see https://doc.dataiku.com/dss/latest/connecting/managed_folders.html#usage-in-r for more details

Setup Info
    Tags
      Help me…