Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
All,
I've been working with a file partitioned data set that I'm adding hourly partitions to. The Over the past 2.5 months the dataset has grown to about ~2.8 million records. And the way I'm currently using the dataset will continue to grow.
When I started with this approach the build time for 100,000 or so records was taking ~8 minutes. The build time has grown to ~16 minutes.
Over the long term this will not be sustainable. Over the short term I don't have the resources to spin up a Spark Cluster. I could try to move the file based store to a Postgres database. Some questions for discussion:
Open to thoughts and discussion on these points.
Operating system used: Mac OS Ventura