Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I am adding hourly partitions to a file-partitioned data set. In the past 2.5 months, the dataset has grown to about 2.8 million records. I will continue to use the dataset in the way I am currently doing so.
When I started using this approach, building 100,000 records took eight minutes. The build time has increased to 16 minutes.
This will not be sustainable in the long run. I do not have the resources to spin up a Spark cluster in the short term. It might be possible to move the file-based store to a Postgres database. Here are some questions for discussion:
I am open to your thoughts and discussion on these points.
Operating system used: 10
Hi @btsmerchandise ,
When dealing with partitions especially if it's partitioned by date, you would only build the previous day or hour, etc. So build time should not increase if you always run a single partition with similar data if data in a single partition. If the build time has grown and you always building a single partition, it's unlikely due to partitions but another cause. If you have job logs from before and now you can raise a support ticket to review further.
If your whole flow is partitioned or at least to a point, you only really need to keep partitions that are being built. You can always clear older partitions in a source dataset :