Create a new Date column
Hi!
Would you please tell me how can I create a new date column containing values from a specified start to end date in Dataiku?
Thanks
Answers
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
Welcome to the Dataiku Community.
I'm not clear what you mean when you say "new date column containing values from a specified start to end date"
Here are some of my questions:
- Do you want to concatenate the start and end date into a single column? For display purposes?
- Do you have the Start and end date already in columns and want to do date arithmetic to calculate age or something like that?
Can you clarify a bit more what you are trying to do?
- What data do you currently have?
- How would you like to transform that data?
- Once transformed what would the data ideally look like?
-
Hi @tgb417
!thank you!
In fact, I have a start date for instance "2005-01" and an end date "now" and I would like to add a Date column to my dataset containing timestamps from the start date to the end date with format "yyyy-MM".
2005-01
2005-02
2005-03
...........
2021-04
.
-
Hi,
If you have a column with values corresponding to the number of month after your start date, you can create a date column in a prepare recipe with the Formula processor using the formula inc("2005-01", month_number, "month"). To get a specific date format, you can use the Format date with custom format processor and enter yyyy-MM as the Date format.
Input:
month_number
0
1
2
Output:
month_number, date
0, 2005-01
1, 2005-02
2, 2005-03 -
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
@Alexé
,@StanG
answer could be useful to you. However, I have to say, I'm still personally not completely clear what the table you are looking for, looks like.In reading what you have said I'm envisioning a table like the one below. Are you wanting to create a table that sort of looks like this?
Start_Date New_Date_Column 2021-01 [2021-01, 2021-02,2021-03, 2021-04] 2021-02 [2021-02,2021-03, 2021-04] 2021-03 [2021-03, 2021-04] 2021-04 [2021-04] And in May of 2021 when "now" would looks like this 2021-05 the table will be like this?
Start_Date New_Date_Column 2021-01 [2021-01, 2021-02,2021-03, 2021-04, 2021-05] 2021-02 [2021-02,2021-03, 2021-04, 2021-05] 2021-03 [2021-03, 2021-04, 2021-05] 2021-04 [2021-04, 2021-05] 2021-05 [2021-05] Based on your description this is what I currently understand.
What also might be helpful is a bit more context about how you plan to use the data.
Let us know.
-
Hi Alexé,
It is still a bit confusing what you are trying to achieve. Let's assume you have a dataset that has a Date column in 'yyyy-MM' format and you don't have all Date values from the start date until today in your dataset. Example below.
Value Date Value 2 1 2005-01 1000 2 2006-03 1001 3 2007-03 1002 4 2020-01 1003 If you want to modify the dataset to include all Dates from the start Date until today (insert missing Dates into dataset), then one of the ways to achieve this is to use the Python Code recipe.
Below is the code I wrote to complete this task.
# -*- coding: utf-8 -*- import dataiku import pandas as pd from datetime import datetime # Read recipe inputs dates = dataiku.Dataset("input_dataset") #input dataset dates_df = dates.get_dataframe() #sort by date sorted_df = dates_df.sort_values(by='Date', ascending=True) #get start date start_date = sorted_df.iloc[0]['Date'] #generate date range year_month_range = pd.date_range(start_date, datetime.today().strftime("%Y-%m"), freq='MS') # create DataFrame df_year_month_range = pd.DataFrame(year_month_range) #rename column to merge two datasets df_year_month_range.rename(columns = {0:'Date'}, inplace = True) #format date to the same fromat for merging df_year_month_range['Date'] = df_year_month_range['Date'].dt.strftime('%Y-%m') #merge two DataFrames df_merged = pd.merge(df_year_month_range, dates_df, how="left", on=["Date"]) # Write recipe outputs dates_added = dataiku.Dataset("output_dataset") dates_added.write_with_schema(df_merged)
The above code reads input dataset, then gets start Date from column Date, then creates date range DataFrame with all Dates from the start Date until today in 'yyyy-MM' format, then merge it with DataFrame created from input dataset, and then saves merged DataFrame into output dataset.
As the result, you will get a dataset that will include all dates. To better understand what has been done, you can find input and output datasets attached in form of CSV files.
I hope this will be useful,
Vitaliy