How to use gzip, bz2, lzma compression while writing pickled file in DSS managed folder?

tg_ca
tg_ca Registered Posts: 3 ✭✭✭✭
edited July 16 in Using Dataiku

I am currently trying to write large (>3.7Gb) serialized pickled file to the DSS managed folder from within a jupyter notebook. However, because of file writer API's connection time-out config constraints, dataiku fails to write files to the DSS managed folder when the files are beyond a certain size. To get around that, I am trying to use compression while writing the file but I am not sure what the appropriate syntax might be while working with the file writer API for the managed folder.

For the normal pickle dump, I am using the following code:

dir_api = dataiku.Folder('foldername')
with dir_api.get_writer('filename') as w:
    pickle.dump(df, w)

If instead of writing to the DSS managed folder with the api, I were writing to the local file system, I would have used the compression like the following:

with gzip.open("filename.gz", "wb") as f:
    pickle.dump(df, f)

How do I accomplish such compression with the dataiku file writer api for the managed folder?

Thanks

Answers

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    @tg_ca

    Although I've not worked with a single dataset in the 3GB size range, I'm wondering

    Are you using the Community (Free) version or a paid version of DSS?

    If you are using the paid version you might take a look at Partitioning a Dataset. I'm not clear if that will resolve your problems. However, it is my sense that one of the reasons to partition a dataset in Dataiku DSS is to handle larger data sets.

    Just my $0.02. Hope it helps.

  • tg_ca
    tg_ca Registered Posts: 3 ✭✭✭✭

    @tgb417

    Thanks for your feedback. For more context, we are using Enterprise DSS 5.x. The dataset contains complex objects (multi-dim arrays); so trying to serialize the dataset to a compressed file.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    @tg_ca

    Thanks for sharing more details. Really helpful in being able to help out if we can.

    If you have paid support, I'd definitely reach out to the Dataiku support team. They are a great bunch of folks. In my experience willing to dig into even really hard problems.

    That said, given the age of your instance... V5.x ... I would not be surprised if they suggest an upgrade. There have been lots of changes since V5.

    Let us know how it goes for you.

    cc: @ATsao

  • tg_ca
    tg_ca Registered Posts: 3 ✭✭✭✭

    cc: @tgb417

    Turns out the solution was fairly straightforward; all we need to do is pass the api handler to the compressor -

    with dir_api.get_writer(fname) as w:
    with bz2.BZ2File(w, 'wb') as f:
    pickle.dump(file, f)

Setup Info
    Tags
      Help me…