Submit your inspiring success story or innovative use case to the 2022 Dataiku Frontrunner Awards! ENTER YOUR SUBMISSION

New visual recipe: output dataset to stream

0 Kudos

The new Dataiku streaming support is awesome. In addition to ingesting data in real-time from pubsub networks, which is a great use-case, I'd like to leverage it to process large but finite datasets.

Basically, if I want to use Python today to process a large dataset, without some special configuration, that dataset needs to fit in system memory as a Python dataframe. This means I could find myself waiting for tens of millions of records to move from my database into DSS before I get past the first couple lines of code, and if I run out of memory, I've lost quite a bit of time. Additionally, most of my Python code is already designed to work on a queue internally, and doesn't care about the source of the data. In many cases, I've set up my code to be dataset-agnostic, and end up copying the same Python recipe to read from different datasets. Usually, my use-case is pre-processing and ingestion into an unsupported database.

If I could create a recipe that sends records from a database into a message queue, then subscribe a Python recipe to that message queue, I'd be able to start processing data right away without waiting for a long dataframing operation, then also enjoy the benefit of publishing multiple datasets to the same stream, allowing my recipe to optimize its throughput and add datasets on as needed.

Then, if restarting the stream is something I can call by API, I can basically configure an automated ETL for large datasets into unsupported databases.

Depending on the functionality added to the API, I could even get advanced with it, configuring the stream to only send records or files that are missing from my target database.

Another use-case would be, once more visual recipes are added for processing streams, the possibility of optimized data transformations without storing intermediate datasets, without depending on nested SQL queries (which can often exceed the database's capabilities).

You could also imagine the possibility of blending real-time and static data, for example, performing continuous joins to enrich streamed data before passing it on to further processing, whether that's outputting to another message queue publisher, a Python script, or a dataset. This sort of processing could also enable use-cases like real-time alerting, whether using explicit logic or perhaps a classifier trained in Dataiku, triggering conditional pipeline flows based on real-time data, selectively recording data off a stream that meets particular criteria, streaming in the latest values from a database when messages are received from a message queue, etc.

Basically, I propose a simple new visual recipe similar to the existing continuous sync recipe, except that the input is a dataset or SQL query. Streams could be represented in the flow with a new shape, and the flow could visualize all the subscriptions to streams with the same DAG visualizer already in use for datasets. Then other recipes, such as the ability to join a stream to a dataset, could be added later for even more advanced functionality.