Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Dear all,
I am adding hourly partitions to a file-partitioned data set. In the past 2.5 months, the dataset has grown to about 2.8 million records. I will continue to use the dataset in the way I am currently doing so.
When I started using this approach, building 100,000 records took eight minutes. The build time has increased to 16 minutes.
This will not be sustainable in the long run. I do not have the resources to spin up a Spark cluster in the short term. It might be possible to move the file-based store to a Postgres database. Here are some questions for discussion:
I am open to your thoughts and discussion on these points.
Operating system used: 10
Hi @btsmerchandise ,
When dealing with partitions especially if it's partitioned by date, you would only build the previous day or hour, etc. So build time should not increase if you always run a single partition with similar data if data in a single partition. If the build time has grown and you always building a single partition, it's unlikely due to partitions but another cause. If you have job logs from before and now you can raise a support ticket to review further.
If your whole flow is partitioned or at least to a point, you only really need to keep partitions that are being built. You can always clear older partitions in a source dataset :