How to use gzip, bz2, lzma compression while writing pickled file in DSS managed folder?

tg_ca · December 2020

I am currently trying to write large (>3.7Gb) serialized pickled file to the DSS managed folder from within a jupyter notebook. However, because of file writer API's connection time-out config constraints, dataiku fails to write files to the DSS managed folder when the files are beyond a certain size. To get around that, I am trying to use compression while writing the file but I am not sure what the appropriate syntax might be while working with the file writer API for the managed folder.

For the normal pickle dump, I am using the following code:

dir_api = dataiku.Folder('foldername')
with dir_api.get_writer('filename') as w:
    pickle.dump(df, w)

If instead of writing to the DSS managed folder with the api, I were writing to the local file system, I would have used the compression like the following:

with gzip.open("filename.gz", "wb") as f:
    pickle.dump(df, f)

How do I accomplish such compression with the dataiku file writer api for the managed folder?

Thanks

tgb417 · December 2020

@tg_ca

Although I've not worked with a single dataset in the 3GB size range, I'm wondering

Are you using the Community (Free) version or a paid version of DSS?

If you are using the paid version you might take a look at Partitioning a Dataset. I'm not clear if that will resolve your problems. However, it is my sense that one of the reasons to partition a dataset in Dataiku DSS is to handle larger data sets.

Just my $0.02. Hope it helps.

tg_ca · December 2020

@tgb417

Thanks for your feedback. For more context, we are using Enterprise DSS 5.x. The dataset contains complex objects (multi-dim arrays); so trying to serialize the dataset to a compressed file.

tgb417 · December 2020

@tg_ca

Thanks for sharing more details. Really helpful in being able to help out if we can.

If you have paid support, I'd definitely reach out to the Dataiku support team. They are a great bunch of folks. In my experience willing to dig into even really hard problems.

That said, given the age of your instance... V5.x ... I would not be surprised if they suggest an upgrade. There have been lots of changes since V5.

Let us know how it goes for you.

cc: @ATsao

tg_ca · December 2020

cc: @tgb417

Turns out the solution was fairly straightforward; all we need to do is pass the api handler to the compressor -

with dir_api.get_writer(fname) as w:
with bz2.BZ2File(w, 'wb') as f:
pickle.dump(file, f)

How to use gzip, bz2, lzma compression while writing pickled file in DSS managed folder?

Answers

Categories

Setup Info

Tags