Add missing dates
I have a dataset of package orders in a warehouse and I have missing dates because there were 0 orders that day, but I want to have all the dates and the count of orders of 0. Does anyone know how to add missing dates?
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,112 Neuron
That is called data densification and there are many ways to do that in Dataiku depending on what technologies you have available. My preference would be to create the date range using pandas in a output only Python recipe:
# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE # -*- coding: utf-8 -*- import dataiku import pandas as pd, numpy as np from dataiku import pandasutils as pdu # -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE All_Dates_df = pd.DataFrame(pd.date_range(start='2020-01-01', end='2024-12-31'), columns=['pd_date']) # -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE # Write recipe outputs All_Dates = dataiku.Dataset("All_Dates") All_Dates.write_with_schema(All_Dates_df)
Then use a Join recipe to join back this new All_Dates dataset to your dataset using the date column as the joining column and select the join as Outer join to get all matches plus the missing dates from All_Dates. Finally in the Post-join computed columns section of the Join recipe create a new computed column to "merge" the original_date and the All_Dates column into a one column:
if(isNull(original_date), pd_date, original_date)
which you can call "combined_date". Result: you now have a dataset with a combined_date column which densified.