We're excited to announce that we're launching the second installment of Dataiku Product Days Register Now

How to use gzip, bz2, lzma compression while writing pickled file in DSS managed folder?

tg_ca
Level 2
How to use gzip, bz2, lzma compression while writing pickled file in DSS managed folder?

I am currently trying to write large (>3.7Gb) serialized pickled file to the DSS managed folder from within a jupyter notebook. However, because of file writer API's connection time-out config constraints, dataiku fails to write files to the DSS managed folder when the files are beyond a certain size. To get around that, I am trying to use compression while writing the file but I am not sure what the appropriate syntax might be while working with the file writer API for the managed folder.

For the normal pickle dump, I am using the following code:

dir_api = dataiku.Folder('foldername')
with dir_api.get_writer('filename') as w:
    pickle.dump(df, w)

If instead of writing to the DSS managed folder with the api, I were writing to the local file system, I would have used the compression like the following:

with gzip.open("filename.gz", "wb") as f:
    pickle.dump(df, f)

How do I accomplish such compression with the dataiku file writer api for the managed folder?

Thanks

0 Kudos
4 Replies
tgb417
Neuron
Neuron

@tg_ca 

Although I've not worked with a single dataset in the 3GB size range, I'm wondering 

Are you using the Community (Free) version or a paid version of DSS?

If you are using the paid version you might take a look at Partitioning a Dataset.  I'm not clear if that will resolve your problems.  However, it is my sense that one of the reasons to partition a dataset in Dataiku DSS is to handle larger data sets.

Just my $0.02.  Hope it helps.

--Tom
0 Kudos
tg_ca
Level 2
Author

@tgb417 

Thanks for your feedback. For more context, we are using Enterprise DSS 5.x. The dataset contains complex objects (multi-dim arrays); so trying to serialize the dataset to a compressed file.  

0 Kudos
tgb417
Neuron
Neuron

@tg_ca 

Thanks for sharing more details.  Really helpful in being able to help out if we can.

If you have paid support, I'd definitely reach out to the Dataiku support team.  They are a great bunch of folks. In my experience willing to dig into even really hard problems. 

That said, given the age of your instance... V5.x ... I would not be surprised if they suggest an upgrade.  There have been lots of changes since V5.

Let us know how it goes for you.

cc: @ATsao 

--Tom
0 Kudos
tg_ca
Level 2
Author

cc: @tgb417 

Turns out the solution was fairly straightforward; all we need to do is pass the api handler to the compressor -  

with dir_api.get_writer(fname) as w:
    with bz2.BZ2File(w, 'wb') as f:
        pickle.dump(file, f)

A banner prompting to get Dataiku DSS