Submit your use case or success story to the 2023 edition of the Dataiku Frontrunner Awards ENTER YOUR SUBMISSION

Disk-based partitioned dataset performance

Level 1
Disk-based partitioned dataset performance

Dear all,

I am adding hourly partitions to a file-partitioned data set. In the past 2.5 months, the dataset has grown to about 2.8 million records. I will continue to use the dataset in the way I am currently doing so.

When I started using this approach, building 100,000 records took eight minutes. The build time has increased to 16 minutes.

This will not be sustainable in the long run. I do not have the resources to spin up a Spark cluster in the short term. It might be possible to move the file-based store to a Postgres database. Here are some questions for discussion:

  • What is the best way to manage partitioned data sets?
  • A partitioned dataset can be stored in a Postgres database instead of a file system.
  • Partitions are being created to pull data incrementally from our relatively slow external data source. What is the best way to manage such data sources?
    • How do you prune such data sources while maintaining consistency with their upstream counterparts?

I am open to your thoughts and discussion on these points.




























Operating system used: 10

0 Kudos
0 Replies