Create a new Date column

Options
Alexé
Alexé Registered Posts: 2 ✭✭✭✭

Hi!

Would you please tell me how can I create a new date column containing values from a specified start to end date in Dataiku?

Thanks

Answers

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Options

    @Alexé

    Welcome to the Dataiku Community.

    I'm not clear what you mean when you say "new date column containing values from a specified start to end date"

    Here are some of my questions:

    • Do you want to concatenate the start and end date into a single column? For display purposes?
    • Do you have the Start and end date already in columns and want to do date arithmetic to calculate age or something like that?

    Can you clarify a bit more what you are trying to do?

    • What data do you currently have?
    • How would you like to transform that data?
    • Once transformed what would the data ideally look like?
  • Alexé
    Alexé Registered Posts: 2 ✭✭✭✭
    Options

    Hi @tgb417
    !

    thank you!

    In fact, I have a start date for instance "2005-01" and an end date "now" and I would like to add a Date column to my dataset containing timestamps from the start date to the end date with format "yyyy-MM".

    2005-01

    2005-02

    2005-03

    ...........

    2021-04

    .

  • StanG
    StanG Dataiker, Registered Posts: 52 Dataiker
    Options

    Hi,
    If you have a column with values corresponding to the number of month after your start date, you can create a date column in a prepare recipe with the Formula processor using the formula inc("2005-01", month_number, "month"). To get a specific date format, you can use the Format date with custom format processor and enter yyyy-MM as the Date format.

    Input:
    month_number
    0
    1
    2

    Output:
    month_number, date
    0, 2005-01
    1, 2005-02
    2, 2005-03

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Options

    @Alexé
    ,

    @StanG
    answer could be useful to you. However, I have to say, I'm still personally not completely clear what the table you are looking for, looks like.

    In reading what you have said I'm envisioning a table like the one below. Are you wanting to create a table that sort of looks like this?

    Start_DateNew_Date_Column
    2021-01[2021-01, 2021-02,2021-03, 2021-04]
    2021-02[2021-02,2021-03, 2021-04]
    2021-03[2021-03, 2021-04]
    2021-04[2021-04]

    And in May of 2021 when "now" would looks like this 2021-05 the table will be like this?

    Start_DateNew_Date_Column
    2021-01[2021-01, 2021-02,2021-03, 2021-04, 2021-05]
    2021-02[2021-02,2021-03, 2021-04, 2021-05]
    2021-03[2021-03, 2021-04, 2021-05]
    2021-04[2021-04, 2021-05]
    2021-05[2021-05]

    Based on your description this is what I currently understand.

    What also might be helpful is a bit more context about how you plan to use the data.

    Let us know.

  • VitaliyD
    VitaliyD Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer Posts: 102 Dataiker
    edited July 17
    Options

    Hi Alexé,

    It is still a bit confusing what you are trying to achieve. Let's assume you have a dataset that has a Date column in 'yyyy-MM' format and you don't have all Date values from the start date until today in your dataset. Example below.

    ValueDateValue 2
    12005-011000
    22006-031001
    32007-031002
    42020-011003

    If you want to modify the dataset to include all Dates from the start Date until today (insert missing Dates into dataset), then one of the ways to achieve this is to use the Python Code recipe.

    Screenshot 2021-04-22 at 00.50.28.png

    Below is the code I wrote to complete this task.

    # -*- coding: utf-8 -*-
    import dataiku
    import pandas as pd
    from datetime import datetime
    
    # Read recipe inputs
    dates = dataiku.Dataset("input_dataset") #input dataset
    dates_df = dates.get_dataframe()
    
    #sort by date
    sorted_df = dates_df.sort_values(by='Date', ascending=True)
    #get start date
    start_date = sorted_df.iloc[0]['Date']
    
    #generate date range
    year_month_range = pd.date_range(start_date, datetime.today().strftime("%Y-%m"), freq='MS')
    # create DataFrame
    df_year_month_range = pd.DataFrame(year_month_range)
    #rename column to merge two datasets
    df_year_month_range.rename(columns = {0:'Date'}, inplace = True)
    #format date to the same fromat for merging
    df_year_month_range['Date'] = df_year_month_range['Date'].dt.strftime('%Y-%m')
    #merge two DataFrames
    df_merged = pd.merge(df_year_month_range, dates_df, how="left", on=["Date"])
    
    # Write recipe outputs
    dates_added = dataiku.Dataset("output_dataset")
    dates_added.write_with_schema(df_merged)

    The above code reads input dataset, then gets start Date from column Date, then creates date range DataFrame with all Dates from the start Date until today in 'yyyy-MM' format, then merge it with DataFrame created from input dataset, and then saves merged DataFrame into output dataset.

    As the result, you will get a dataset that will include all dates. To better understand what has been done, you can find input and output datasets attached in form of CSV files.

    I hope this will be useful,

    Vitaliy

Setup Info
    Tags
      Help me…