maintaining time series dataset - adding data file every month
We have a use case where we need to maintain monthly data in Hive table for reporting
- Every month data file is sourced manually
- Data file has a column to save date (month end date)
- Requirement is to store monthly data in Hive table
- Hive table should be partitioned by date (month end date)
- Its like stacking new data into Hive table
There is also requirement to occasionally override monthly data if an monthly override data file arrives
Please suggest a suitable solution
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi @vivekkumar
,
Sounds like you could just use redispath.
Simply add your new data files to the input dataset, add a sync recipe with "redispatch" mode, and output dataset will be partitioned by month.
Re-run the sync recipe every month after adding your manually sourced files to the input dataset list of files, you can either use a folder or edit or add an existing file to an existing dataset.
https://knowledge.dataiku.com/latest/mlops-o16n/partitioning/concept-redispatch.html
Kind Regards,