Survey banner
Switching to Dataiku - a new area to help users who are transitioning from other tools and diving into Dataiku! CHECK IT OUT

Very big dataset

Level 2
Very big dataset

I have a very large dataset, 16.8billion records and about 8TB. It takes days to do any operation on the data and the project owner want to use all the data and not subset. Dataiku and S3 get into memory errors after several hours of running. Looking for some general guidelines on how to handle this situation.

Thank you.

0 Kudos
4 Replies

S3 is not a good platform for trying to do operations on that amount of data. I suggest you look for technogies that have been designed for querying that amount of data quickly like Snowflake, Databricks or Google BigQuery. Amazon has Redshift but everyone I know who used it has moved away from it.

Failing that you will need to use something like Spark that use parallel compute to process the information faster:


Level 2

I will check out the ideas in your link. I had configured pyspark recipe to leverage spark configs for running but still hit the memory issues. Thanks for the ideas

0 Kudos

Spark by itself only gives you the parallel compute capability, it is still up to you to find where to offload your Spark jobs and configure such compute capability to your needs. You can run Spark jobs in many different places but certainly running them in your DSS instance won't help much. Databricks can create large compute clusters which you can use to run large Spark jobs. See these pages for more info:

But you can run Spark jobs on Kubernetes in EKS or on AWS Spark platform EMR:


0 Kudos
Level 2

I have been running Spark jobs on Kubernetes in EKS.

0 Kudos