Collect all months between 2 dates in a string column
Hi,
Each line of my data set contain an subscription with a start and end date.
I want to create another column which contains all the month between these 2 dates in string format with a separator (; for example).
For example if I have start_date = 01/07/2021 and end_date = 18/04/2022.
The new column should contain:
"202107;202108;202109;202110;202111;202112;202201;202202;202203;202204"
Do you have an idea about how to achieve that.
The ultimate goal is to be able to verify, for each given month, if the subscription is active.
The cross-join will be to expensive on the server so I think I can do a formula to search with regular expression to achieve the goal.
Thanks a lot for your help.
Operating system used: Window
Answers
-
sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron
Hi @lnguyen
,You can create a python code for this, check below code for your reference
import pandas as pd df = your_dataframe for index, row in df.iterrows(): calculate_dates = pd.date_range(row['start_date'], row['end_date'], freq='D') dates_series = pd.Series(calculate_dates .format()) dates_series_refined = [';'.join(ele.split()) for ele in dates_series] dates_final = ';'.join(dates_series_refined ) df.loc[index, 'dates_generate'] = dates_final
Please let me know if you are looking for something else.
Best Regards,
Shubham Jamwal
-
Hello,
The custom Python processor in a Prepare recipe is probably the best option here.
I have adapted the code from the following Stack Overflow post, which is essentially your requirement:
https://stackoverflow.com/a/34898764
from datetime import datetime, timedelta from collections import OrderedDict def process(row): # In 'cell' mode, the process function must return # a single cell value for each row, # which will be affected to a new column. # The 'row' argument is a dictionary of columns of the row dates = [row['start_date'], row['end_date']] start, end = [datetime.strptime(_, "%d/%m/%Y") for _ in dates] months = OrderedDict(((start + timedelta(_)).strftime(r"%Y%m"), None) for _ in range((end - start).days)).keys() return ';'.join(list(months))
I hope that helps.
-
Hi,
Thanks a lot for your answer,
My start and end dates have a parsed date format so I get this error:
'2020-02-29T23:00:00.000Z' does not match format '%d/%m/%Y'
Can you help me to fix it?
Sorry I don't really master Python.
Thanks a lot for your help.
-
Hi Shubham,
Thanks a lot for your answer.
As I shared in my question about my ultimate goal, I don't know if you can think of a better solution than creating a column containing all the month.
I have 2 tables, One contains subscription numbers with an start and end date. One is just a calendar which contain one month in each line (202104, 202201, etc.). My goal is for each month in the calendar table, calculate the number of active subscription so I will have to, for each month, count distinct subscription numbers which have start month <= that month and (end month > that month or end month is null).
Like I said, the cross join is too expensive and I don't know how to join these 2 tables without any similar key.
Thanks a lot for your help.
-
You can check the format codes here:
https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes
A parsed date would match this format:
"%Y-%m-%dT%H:%M:%S.%fZ"
-
Hi All,
I need a similar solution it is just that I need those dates between my start date and end date in a separate column against that row. Is there anything for it
-
I do not understand what you mean. Could you maybe give an example?