Add missing dates

Santi
Santi Registered Posts: 5

I have a dataset of package orders in a warehouse and I have missing dates because there were 0 orders that day, but I want to have all the dates and the count of orders of 0. Does anyone know how to add missing dates?

Answers

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,112 Neuron
    edited July 17

    That is called data densification and there are many ways to do that in Dataiku depending on what technologies you have available. My preference would be to create the date range using pandas in a output only Python recipe:

    # -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
    # -*- coding: utf-8 -*-
    import dataiku
    import pandas as pd, numpy as np
    from dataiku import pandasutils as pdu
    
    # -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
    All_Dates_df = pd.DataFrame(pd.date_range(start='2020-01-01', end='2024-12-31'), columns=['pd_date'])
    
    # -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
    # Write recipe outputs
    All_Dates = dataiku.Dataset("All_Dates")
    All_Dates.write_with_schema(All_Dates_df)

    Then use a Join recipe to join back this new All_Dates dataset to your dataset using the date column as the joining column and select the join as Outer join to get all matches plus the missing dates from All_Dates. Finally in the Post-join computed columns section of the Join recipe create a new computed column to "merge" the original_date and the All_Dates column into a one column:

    if(isNull(original_date), pd_date, original_date)

    which you can call "combined_date". Result: you now have a dataset with a combined_date column which densified.

Setup Info
    Tags
      Help me…