Community Conundrum 28: News Engagement is live! Read More

Sampling Big Data into Python Webapp

Level 4
Level 4
Sampling Big Data into Python Webapp

For the majority of our plugins we have output tables containing 1-50+ million records. This is far too much for Python and our compute instance to handle in my experience.

Do you have any recommendations for this type of workflow? Right now we are adding an additional aggregated output solely for the purpose of the Webapp. 

It would be nice if we could do the aggregation in the Bokeh script and not require the extra table in the flow.  

1 Reply
Dataiker
Dataiker

Hi @gblack686 

Given that there are several options on how and where the data resides, as well as the desired output you need to display in your webapp, there's no single answer that will work in all situations.

So I'll give you a few items on how I would consider in my design and that would inform the solution implemented.

- Do you need to show 50M records at all times? If so, the load on the front end would be big and likely to crash systems (browser). But more than that, does it actually serve any use case or give users a benefit? I'd argue that it's unlikely.

- Do users instead need to use filters and view a subset of the full 50M records? If so, then your webapp needs to be able to take in some parameters and submit to a backend, the backend queries the data source and sends back the results.

- If we're looking to display a subset of the data given filters, we still need to consider the amount of data displayed (1M points is still a lot), at what resolution does the data is visually helpful to our users?

- We've (potentially) introduced search, so how and where does this happen? If it's in a database then it would be appropriate to have indices on the fields queried. But if this is in a distributed filesystem, then querying in real-time using Spark or others will lead to poor user experience as such jobs are not meant for real-time workloads.

- If we don't have access to fast retrieval perhaps we can create a queuing system where the user submits their filters and is notified where the results are ready. This would involve additional elements in our flow that deal with stored queries and results, etc. Now, the project itself is getting more complex and perhaps will be harder to manage etc.

I hope these few points help in your considerations of the elements your system might require to balance the complex trade-off between user experience, design and budget!

A banner prompting to get Dataiku DSS