Disk-based partitioned dataset performance

btsmerchandise · March 2023

Dear all,

I am adding hourly partitions to a file-partitioned data set. In the past 2.5 months, the dataset has grown to about 2.8 million records. I will continue to use the dataset in the way I am currently doing so.

When I started using this approach, building 100,000 records took eight minutes. The build time has increased to 16 minutes.

This will not be sustainable in the long run. I do not have the resources to spin up a Spark cluster in the short term. It might be possible to move the file-based store to a Postgres database. Here are some questions for discussion:

What is the best way to manage partitioned data sets?
A partitioned dataset can be stored in a Postgres database instead of a file system.
Partitions are being created to pull data incrementally from our relatively slow external data source. What is the best way to manage such data sources?

How do you prune such data sources while maintaining consistency with their upstream counterparts?

I am open to your thoughts and discussion on these points.

Operating system used: 10

Alexandru · May 2023

Hi @btsmerchandise
,
When dealing with partitions especially if it's partitioned by date, you would only build the previous day or hour, etc. So build time should not increase if you always run a single partition with similar data if data in a single partition. If the build time has grown and you always building a single partition, it's unlikely due to partitions but another cause. If you have job logs from before and now you can raise a support ticket to review further.

If your whole flow is partitioned or at least to a point, you only really need to keep partitions that are being built. You can always clear older partitions in a source dataset :

Screenshot 2023-05-19 at 10.19.42 PM.png

Disk-based partitioned dataset performance

Answers

Categories

Setup Info

Tags