Performance of Disk based partitioned datasets
![tgb417](https://us.v-cdn.net/6038231/uploads/lithium_attachments/n1980i7BE7BDB445EC1955.jpg)
All,
I've been working with a file partitioned data set that I'm adding hourly partitions to. The Over the past 2.5 months the dataset has grown to about ~2.8 million records. And the way I'm currently using the dataset will continue to grow.
When I started with this approach the build time for 100,000 or so records was taking ~8 minutes. The build time has grown to ~16 minutes.
Over the long term this will not be sustainable. Over the short term I don't have the resources to spin up a Spark Cluster. I could try to move the file based store to a Postgres database. Some questions for discussion:
- How are folks managing their partitioned data sets?
- Is there an advantage to moving to a Postgres database rather than file system for a partitioned dataset.
- The partitions are being created currently so we can make incremental pulls of data from our rather slow external datasource. How are folks manage such data sources?
- How do you prune such datasources while maintaining consistency with the upstream datasource.
Open to thoughts and discussion on these points.
Operating system used: Mac OS Ventura