How to Handle Missing Data for Seasonal Analysis in Dataiku?

raihanhd
raihanhd Registered Posts: 7 ✭✭✭
edited January 6 in General Discussion

Hi Dataiku Community,

I’m working on a dataset containing daily commodity prices over multiple years (2021-2024). However, there are significant gaps in the data, which are affecting my ability to analyze trends.

Here are the details:

  1. Daily Data
    • The dataset records daily prices for various commodities, but many days are missing.
    • Example :
      Commodity | Date| Price (Rp)
      Example A | 2021-03-17 | 13,000
      Example A | 2021-12-26 | 12,440
      Example A | 2023-02-14 | 12,480
      Example A | 2024-04-18 | 17,030
      • Some commodities have no data at all for certain years (e.g., Example B has no data for 2021-2022).
  2. Yearly Average Data
    • I calculated yearly averages for each commodity to summarize the data.
    • Example:
      Commodity | Year | Average Price (Rp) | Days Counted
      Example A | 2021 | NULL | 365
      Example A | 2022 | NULL | 365
      Example A | 2023 | 13,500 | 365
      Example A | 2024 | NULL | 365
      • Even with yearly averages, some commodities still have null values due to completely missing raw data for certain years.

My Questions:

  1. What is the best method to handle missing seasonal data like this in Dataiku?
  2. Are there any built-in features or plugins in Dataiku that can help fill missing data based on seasonal trends or interpolation?
  3. If interpolation or seasonal averages aren’t feasible due to limited data points, how can I effectively flag and handle these missing values for further analysis?

I’d really appreciate your insights and recommendations. Thanks in advance! 😊

Answers

  • Yasmine_T
    Yasmine_T Registered Posts: 4

    Hi raihanhd,

    I hope that you are doing well.

    Let's cover your questions one by one here.

    What is the best method to handle missing seasonal data like this in Dataiku?

    This would depend on your use case.

    a. If you would like to remove all rows containing missing dates you can use the remove/keep rows where cell is empty processor ( https://doc.dataiku.com/dss/latest/preparation/processors/remove-empty.html ) within the prepare recipe.

    b. See below

    Are there any built-in features or plugins in Dataiku that can help fill missing data based on seasonal trends or interpolation?

    Yes, The time series preparation plugin (https://www.dataiku.com/product/plugins/timeseries-preparation/) should help to fulfill your needs here. (See: https://doc.dataiku.com/dss/latest/time-series/time-series-preparation/index.html )

    Before using time series data for analysis or forecasting, it is often necessary to perform one or more preparation steps on the data:

    • Use the which would parse strings containing dates in any format into the standard ISO 8601 format (yyyy-MM-ddTHH:mm:ss.SSSZ) to work with them in DSS. You can use the Smart Dates Parser to get semi-automatic date parsing with the assistance of DSS.

    Note that this would be a requirement here as we would need the column to be of date type.

    The processor and it's capabilities is covered in: https://doc.dataiku.com/dss/latest/time-series/time-series-preparation/decomposition.html

    If interpolation or seasonal averages aren’t feasible due to limited data points, how can I effectively flag and handle these missing values for further analysis?

    If the above doesn't satisfy your needs could you detail your request here and provide the following information :

    • How are you creating/handling your dataset within the flow?
    • When you report the missing date are you referring dates missing from the original dataset?

    Note: One thing that is important to note here is that in the visual data preparation, all transformation steps that you define are executed on a sample of your dataset (by default ⇐10k rows) and the results are presented to you right away. When you export the processed data or create a recipe to insert your preparation script in your data flow, the whole input dataset is processed. If you would like for the whole dataset to be considered/handled in the visual data preparation you would need to specify this as it might impact the results your are seeing doing the above.

    Let me know if you also need me to dive deeper/cover the information regarding resampling or any of the provided information above in more in details. If that is the case could you please give me more details regarding how you would like to handle missing values so we can look into the best option for you here.

    Best regards,

    Yasmine

  • raihanhd
    raihanhd Registered Posts: 7 ✭✭✭

    Hi yasmin thank you for helping. do you want to check the results that I have done? is it correct or not? or if you want to give suggestions that should be corrected, yes

Setup Info
    Tags
      Help me…