maintaining time series dataset - adding data file every month

vivekkumar · January 2024

We have a use case where we need to maintain monthly data in Hive table for reporting

- Every month data file is sourced manually

- Data file has a column to save date (month end date)

- Requirement is to store monthly data in Hive table

- Hive table should be partitioned by date (month end date)

- Its like stacking new data into Hive table

There is also requirement to occasionally override monthly data if an monthly override data file arrives

Please suggest a suitable solution

Alexandru · January 2024

Hi @vivekkumar
,
Sounds like you could just use redispath.

Simply add your new data files to the input dataset, add a sync recipe with "redispatch" mode, and output dataset will be partitioned by month.

Re-run the sync recipe every month after adding your manually sourced files to the input dataset list of files, you can either use a folder or edit or add an existing file to an existing dataset.

https://knowledge.dataiku.com/latest/mlops-o16n/partitioning/concept-redispatch.html

Kind Regards,

maintaining time series dataset - adding data file every month

Answers

Categories

Setup Info

Tags