Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hi,
Each line of my data set contain an subscription with a start and end date.
I want to create another column which contains all the month between these 2 dates in string format with a separator (; for example).
For example if I have start_date = 01/07/2021 and end_date = 18/04/2022.
The new column should contain:
"202107;202108;202109;202110;202111;202112;202201;202202;202203;202204"
Do you have an idea about how to achieve that.
The ultimate goal is to be able to verify, for each given month, if the subscription is active.
The cross-join will be to expensive on the server so I think I can do a formula to search with regular expression to achieve the goal.
Thanks a lot for your help.
Operating system used: Window
Hi @lnguyen ,
You can create a python code for this, check below code for your reference
import pandas as pd
df = your_dataframe
for index, row in df.iterrows():
calculate_dates = pd.date_range(row['start_date'], row['end_date'], freq='D')
dates_series = pd.Series(calculate_dates .format())
dates_series_refined = [';'.join(ele.split()) for ele in dates_series]
dates_final = ';'.join(dates_series_refined )
df.loc[index, 'dates_generate'] = dates_final
Please let me know if you are looking for something else.
Best Regards,
Shubham Jamwal
Hi Shubham,
Thanks a lot for your answer.
As I shared in my question about my ultimate goal, I don't know if you can think of a better solution than creating a column containing all the month.
I have 2 tables, One contains subscription numbers with an start and end date. One is just a calendar which contain one month in each line (202104, 202201, etc.). My goal is for each month in the calendar table, calculate the number of active subscription so I will have to, for each month, count distinct subscription numbers which have start month <= that month and (end month > that month or end month is null).
Like I said, the cross join is too expensive and I don't know how to join these 2 tables without any similar key.
Thanks a lot for your help.
Hello,
The custom Python processor in a Prepare recipe is probably the best option here.
I have adapted the code from the following Stack Overflow post, which is essentially your requirement:
https://stackoverflow.com/a/34898764
from datetime import datetime, timedelta
from collections import OrderedDict
def process(row):
# In 'cell' mode, the process function must return
# a single cell value for each row,
# which will be affected to a new column.
# The 'row' argument is a dictionary of columns of the row
dates = [row['start_date'], row['end_date']]
start, end = [datetime.strptime(_, "%d/%m/%Y") for _ in dates]
months = OrderedDict(((start + timedelta(_)).strftime(r"%Y%m"), None) for _ in range((end - start).days)).keys()
return ';'.join(list(months))
I hope that helps.
Hi,
Thanks a lot for your answer,
My start and end dates have a parsed date format so I get this error:
Can you help me to fix it?
Sorry I don't really master Python.
Thanks a lot for your help.
You can check the format codes here:
https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes
A parsed date would match this format:
"%Y-%m-%dT%H:%M:%S.%fZ"