Performance of Disk based partitioned datasets

Options
tgb417
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron

All,

I've been working with a file partitioned data set that I'm adding hourly partitions to. The Over the past 2.5 months the dataset has grown to about ~2.8 million records. And the way I'm currently using the dataset will continue to grow.

When I started with this approach the build time for 100,000 or so records was taking ~8 minutes. The build time has grown to ~16 minutes.

Over the long term this will not be sustainable. Over the short term I don't have the resources to spin up a Spark Cluster. I could try to move the file based store to a Postgres database. Some questions for discussion:

  • How are folks managing their partitioned data sets?
  • Is there an advantage to moving to a Postgres database rather than file system for a partitioned dataset.
  • The partitions are being created currently so we can make incremental pulls of data from our rather slow external datasource. How are folks manage such data sources?
    • How do you prune such datasources while maintaining consistency with the upstream datasource.

Open to thoughts and discussion on these points.


Operating system used: Mac OS Ventura

Tagged:
Setup Info
    Tags
      Help me…