Performance of Disk based partitioned datasets

tgb417 · March 2023

All,

I've been working with a file partitioned data set that I'm adding hourly partitions to. The Over the past 2.5 months the dataset has grown to about ~2.8 million records. And the way I'm currently using the dataset will continue to grow.

When I started with this approach the build time for 100,000 or so records was taking ~8 minutes. The build time has grown to ~16 minutes.

Over the long term this will not be sustainable. Over the short term I don't have the resources to spin up a Spark Cluster. I could try to move the file based store to a Postgres database. Some questions for discussion:

How are folks managing their partitioned data sets?
Is there an advantage to moving to a Postgres database rather than file system for a partitioned dataset.
The partitions are being created currently so we can make incremental pulls of data from our rather slow external datasource. How are folks manage such data sources?
- How do you prune such datasources while maintaining consistency with the upstream datasource.

Open to thoughts and discussion on these points.

Operating system used: Mac OS Ventura

Performance of Disk based partitioned datasets

Categories

Setup Info

Tags