Disk-based partitioned dataset performance
Dear all,
I am adding hourly partitions to a file-partitioned data set. In the past 2.5 months, the dataset has grown to about 2.8 million records. I will continue to use the dataset in the way I am currently doing so.
When I started using this approach, building 100,000 records took eight minutes. The build time has increased to 16 minutes.
This will not be sustainable in the long run. I do not have the resources to spin up a Spark cluster in the short term. It might be possible to move the file-based store to a Postgres database. Here are some questions for discussion:
- What is the best way to manage partitioned data sets?
- A partitioned dataset can be stored in a Postgres database instead of a file system.
- Partitions are being created to pull data incrementally from our relatively slow external data source. What is the best way to manage such data sources?
- How do you prune such data sources while maintaining consistency with their upstream counterparts?
I am open to your thoughts and discussion on these points.
Operating system used: 10
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi @btsmerchandise
,
When dealing with partitions especially if it's partitioned by date, you would only build the previous day or hour, etc. So build time should not increase if you always run a single partition with similar data if data in a single partition. If the build time has grown and you always building a single partition, it's unlikely due to partitions but another cause. If you have job logs from before and now you can raise a support ticket to review further.
If your whole flow is partitioned or at least to a point, you only really need to keep partitions that are being built. You can always clear older partitions in a source dataset :