How to append data to partition's dataset in Python
I'm treating a flow of data wich I dispatch in partitions, some of my python code run from scenarios to be able to properly switch from reading and writing partitions on the go.
The datas are stored on an azure blob storage, csv-like.
When I have to write additional datas to a partitions I can't find a way to do it as an append by simply adding a file to a partition.
For example, I'm also running a continuous kafka sync recipe, which does exactly what I want since I can list the partitions and get :
On the contrary, my python script in scenario only generate 1 file every time, so I have to reload all the datas from the partition on memory and rewrite everything with the additional datas.
Since i'm switching partitions I can not use a python script in a recipe and simply clic the "append button".
I just want a simple way to say to a writer to put datas into a specific new files at the specific partition location, how is that so difficult ?
And no, pandas-like answers are not valid since they are too time consuming.
Any help ?
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,215 Dataiker
Hi @Alka
,
What dataset type are you writing to local filesystem/ Cloud storage?
You could create a managed folder pointing to that path's datasets and use folder as the output and write the paths in the code.
That way, you can use get_writer and specify the path ( your actual partition in this case)
https://community.dataiku.com/t5/Plugins-Extending-Dataiku/Writing-into-Managed-Folder/td-p/23807
The folder can also be partitioned if you need to use it downstream or you can use files in folder dataset after the managed folder.
Thanks