I've been working with a file partitioned data set that I'm adding hourly partitions to. The Over the past 2.5 months the dataset has grown to about ~2.8 million records. And the way I'm currently using the dataset will continue to grow.
When I started with this approach the build time for 100,000 or so records was taking ~8 minutes. The build time has grown to ~16 minutes.
Over the long term this will not be sustainable. Over the short term I don't have the resources to spin up a Spark Cluster. I could try to move the file based store to a Postgres database. Some questions for discussion:
How are folks managing their partitioned data sets?
Is there an advantage to moving to a Postgres database rather than file system for a partitioned dataset.
The partitions are being created currently so we can make incremental pulls of data from our rather slow external datasource. How are folks manage such data sources?
How do you prune such datasources while maintaining consistency with the upstream datasource.