Writing to dataset iteratively

gjoseph
Level 2
Writing to dataset iteratively

Got a job that is IO-bound and memory intensive and need to write the result(s) iteratively. The job is essentially parsing data from excel files, filter, aggregate, feature engineering etc.

Source: 1 billion records, results 1.2 million records.

I'm using python recipe with multi-threaded asyncio function on source and the Kubernetes job fails with out of memory error. I'm thinking to write to the dataset iteratively. How do I do that? streaming writer?


Operating system used: Windows 11

0 Kudos
2 Replies
Turribeach

With regards to the memory error in Kubernetes there isn't much I can say with the information you provided other than in general you don't multi-threaded processes in containers. The whole idea of a container is to isolate the processing to the minimum possible and have different containers perform different roles. In other words with containers you don't scale vertically (adding more hardware or running more processes in the same hardware), you do it horizontally by adding more containers which run side by side. Whether these containers are in the same hardware doesn't really matter for you. 

With regards to writing to the dataset iteratively I would suggest you don't do it. First of all Dataiku has no methods to do this parallelly so you have to handle that. Secondly you are likely to get better performance using a different approach. I worked in a project where I had to process millions of XML files. Visual recipes were out of the question since Java (which Dataiku's backend runs on) is very slow for direct file system IO given it's high abstraction. Even Python wasn't up to the job as it can struggle to even deal with hundreds of thousands of files. In the end I ended using Shell script recipes for the initial  uncompressing, filtering and basic processing which would then split the files into 3 different folders for different types of data. Each folder would have a branch in my flow which meant that I was effectively parallelising the load as Dataiku could run the 3 flow branches at the same time. After some Python recipes the 3 branches would then write back the output as CSV files back to a single folder (in effect merging the 3 flow branches back). And finally a single Python recipe would consolidate all the CSV files and write them as a single dataset. This design was very efficient and had great performance. And while you certainly can't match the scability of Kubernetes by using flow branches (as DSS itself has limited resources for jobs and activities) I wanted to mention the last part as you can use it as an idea for your writing to the dataset issue. I would have your containers write separate CSV files to a high IO file system. Then once all your container processing it's done merge back all the CSV files with a single Python recipe and do a single write to the output dataset. Python can handle 1.2m rows fairly easy.

0 Kudos
gjoseph
Level 2
Author

Thank you for your reply @Turribeach 

The winning idea we have is to extract the sheet of interest as parquet filea and use spark to read them super efficiently. This is besides the point of my ask though.

We can parallelize code easily, effectively and successfully hence the ask to write iteratively to a dataset. I want to compare the performance and cost of different approaches ingesting data.

UPDATE: Turns out it wasn't a memory issue but too many execution threads requested.

0 Kudos