Performance of Disk based partitioned datasets


I've been working with a file partitioned data set that I'm adding hourly partitions to.  The Over the past 2.5 months the dataset has grown to about ~2.8 million records.  And the way I'm currently using the dataset will continue to grow.

When I started with this approach the build time for 100,000 or so records was taking ~8 minutes.  The build time has grown to ~16 minutes.

Over the long term this will not be sustainable.  Over the short term I don't have the resources to spin up a Spark Cluster.  I could try to move the file based store to a Postgres database.  Some questions for discussion:

  • How are folks managing their partitioned data sets?
  • Is there an advantage to moving to a Postgres database rather than file system for a partitioned dataset.
  • The partitions are being created currently so we can make incremental pulls of data from our rather slow external datasource.  How are folks manage such data sources?
    • How do you prune such datasources while maintaining consistency with the upstream datasource.

Open to thoughts and discussion on these points.

Operating system used: Mac OS Ventura

