Very big dataset

TheMLEngineer · February 2024

I have a very large dataset, 16.8billion records and about 8TB. It takes days to do any operation on the data and the project owner want to use all the data and not subset. Dataiku and S3 get into memory errors after several hours of running. Looking for some general guidelines on how to handle this situation.

Thank you.

Turribeach · February 2024

S3 is not a good platform for trying to do operations on that amount of data. I suggest you look for technogies that have been designed for querying that amount of data quickly like Snowflake, Databricks or Google BigQuery. Amazon has Redshift but everyone I know who used it has moved away from it.

Failing that you will need to use something like Spark that use parallel compute to process the information faster:

https://towardsdatascience.com/leveraging-apache-spark-to-execute-billions-of-operations-on-aws-s3-2f62930d19fd

TheMLEngineer · February 2024

I will check out the ideas in your link. I had configured pyspark recipe to leverage spark configs for running but still hit the memory issues. Thanks for the ideas

Turribeach · February 2024

Spark by itself only gives you the parallel compute capability, it is still up to you to find where to offload your Spark jobs and configure such compute capability to your needs. You can run Spark jobs in many different places but certainly running them in your DSS instance won't help much. Databricks can create large compute clusters which you can use to run large Spark jobs. See these pages for more info:

https://blog.dataiku.com/databricks-integration

https://doc.dataiku.com/dss/latest/connecting/sql/databricks.html

But you can run Spark jobs on Kubernetes in EKS or on AWS Spark platform EMR:

https://doc.dataiku.com/dss/latest/containers/eks/index.html

https://aws.amazon.com/eks/

https://doc.dataiku.com/dss/latest/hadoop/distributions/emr.html

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark.html