Processing a continuous stream of data

Highlighted
Gustavo_Brian
Level 2
Processing a continuous stream of data

Hi,



I have a service that continuously emits data. You start receiving data once you have connected opening a TCP connection and never stops until you terminate the connection.



I'd like to develop a custom plugin to be able to process that data on Dataiku how I can do that as data never ends?



Will "build" log overload the server?



Thanks



 



UPDATE:



We are loading data from a flight's metasearch service. They expose a data stream we consume polling from a TCP connection (https://github.com/gbrian/Flightmate-Stream). We plan to use Dataiku to parse, sanitize, ... data and the drop into Hadoop apart from applying the corresponding analysis and lab 😉



@alexander Hope this helps

0 Kudos
1 Reply
Alex_Combessie Dataiker
Dataiker
Re: Processing a continuous stream of data
Hi Gustavo,

For this type of use case, we would advise performing the data ingestion outside of Dataiku DSS, with a streaming engine such as Flume or Kafka.

Once the data is ingested, you can perform data transformation and machine learning modelling in DSS in a micro-batch way, using partitions to avoid recomputing on the whole data: https://doc.dataiku.com/dss/latest/partitions/index.html

Cheers,

Alex
0 Kudos
Labels (2)