I am currently trying to write large (>3.7Gb) serialized pickled file to the DSS managed folder from within a jupyter notebook. However, because of file writer API's connection time-out config constraints, dataiku fails to write files to the DSS managed folder when the files are beyond a certain size. To get around that, I am trying to use compression while writing the file but I am not sure what the appropriate syntax might be while working with the file writer API for the managed folder.
For the normal pickle dump, I am using the following code:
dir_api = dataiku.Folder('foldername') with dir_api.get_writer('filename') as w: pickle.dump(df, w)
If instead of writing to the DSS managed folder with the api, I were writing to the local file system, I would have used the compression like the following:
with gzip.open("filename.gz", "wb") as f: pickle.dump(df, f)
How do I accomplish such compression with the dataiku file writer api for the managed folder?
Although I've not worked with a single dataset in the 3GB size range, I'm wondering
Are you using the Community (Free) version or a paid version of DSS?
If you are using the paid version you might take a look at Partitioning a Dataset. I'm not clear if that will resolve your problems. However, it is my sense that one of the reasons to partition a dataset in Dataiku DSS is to handle larger data sets.
Just my $0.02. Hope it helps.
Thanks for sharing more details. Really helpful in being able to help out if we can.
If you have paid support, I'd definitely reach out to the Dataiku support team. They are a great bunch of folks. In my experience willing to dig into even really hard problems.
That said, given the age of your instance... V5.x ... I would not be surprised if they suggest an upgrade. There have been lots of changes since V5.
Let us know how it goes for you.
Turns out the solution was fairly straightforward; all we need to do is pass the api handler to the compressor -
with dir_api.get_writer(fname) as w:
with bz2.BZ2File(w, 'wb') as f: