Very big dataset
I have a very large dataset, 16.8billion records and about 8TB. It takes days to do any operation on the data and the project owner want to use all the data and not subset. Dataiku and S3 get into memory errors after several hours of running. Looking for some general guidelines on how to handle this situation.
Thank you.
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,979 Neuron
S3 is not a good platform for trying to do operations on that amount of data. I suggest you look for technogies that have been designed for querying that amount of data quickly like Snowflake, Databricks or Google BigQuery. Amazon has Redshift but everyone I know who used it has moved away from it.
Failing that you will need to use something like Spark that use parallel compute to process the information faster:
-
I will check out the ideas in your link. I had configured pyspark recipe to leverage spark configs for running but still hit the memory issues. Thanks for the ideas
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,979 Neuron
Spark by itself only gives you the parallel compute capability, it is still up to you to find where to offload your Spark jobs and configure such compute capability to your needs. You can run Spark jobs in many different places but certainly running them in your DSS instance won't help much. Databricks can create large compute clusters which you can use to run large Spark jobs. See these pages for more info:
https://blog.dataiku.com/databricks-integration
https://doc.dataiku.com/dss/latest/connecting/sql/databricks.html
But you can run Spark jobs on Kubernetes in EKS or on AWS Spark platform EMR:
https://doc.dataiku.com/dss/latest/containers/eks/index.html
https://doc.dataiku.com/dss/latest/hadoop/distributions/emr.html
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark.html
-
I have been running Spark jobs on Kubernetes in EKS.