Disk-based partitioned dataset performance

btsmerchandise
Level 1
Disk-based partitioned dataset performance

Dear all,

I am adding hourly partitions to a file-partitioned data set. In the past 2.5 months, the dataset has grown to about 2.8 million records. I will continue to use the dataset in the way I am currently doing so.

When I started using this approach, building 100,000 records took eight minutes. The build time has increased to 16 minutes.

This will not be sustainable in the long run. I do not have the resources to spin up a Spark cluster in the short term. It might be possible to move the file-based store to a Postgres database. Here are some questions for discussion:

  • What is the best way to manage partitioned data sets?
  • A partitioned dataset can be stored in a Postgres database instead of a file system.
  • Partitions are being created to pull data incrementally from our relatively slow external data source. What is the best way to manage such data sources?
    • How do you prune such data sources while maintaining consistency with their upstream counterparts?

I am open to your thoughts and discussion on these points.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


Operating system used: 10

0 Kudos
1 Reply
AlexT
Dataiker

Hi @btsmerchandise ,
When dealing with partitions especially if it's partitioned by date, you would only build the previous day or hour, etc. So build time should not increase if you always run a single partition with similar data if data in a single partition. If the build time has grown and you always building a single partition, it's unlikely due to partitions but another cause. If you have job logs from before and now you can raise a support ticket to review further. 

If your whole flow is partitioned or at least to a point, you only really need to keep partitions that are being built. You can always clear older partitions in a source dataset :

Screenshot 2023-05-19 at 10.19.42 PM.png


0 Kudos