Submit your use case or success story to the 2023 edition of the Dataiku Frontrunner Awards ENTER YOUR SUBMISSION

Collect all months between 2 dates in a string column

Level 2
Collect all months between 2 dates in a string column


Each line of my data set contain an subscription with a start and end date.

I want to create another column which contains all the month between these 2 dates in string format with a separator (; for example).

For example if I have start_date = 01/07/2021 and end_date = 18/04/2022.

The new column should contain:


Do you have an idea about how to achieve that.

The ultimate goal is to be able to verify, for each given month, if the subscription is active.

The cross-join will be to expensive on the server so I think I can do a formula to search with regular expression to achieve the goal.

Thanks a lot for your help.

Operating system used: Window

0 Kudos
5 Replies

Hi @lnguyen ,


You can create a python code for this, check below code for your reference

import pandas as pd

df = your_dataframe

for index, row in df.iterrows():
    calculate_dates = pd.date_range(row['start_date'], row['end_date'], freq='D')
    dates_series = pd.Series(calculate_dates .format())
    dates_series_refined = [';'.join(ele.split()) for ele in dates_series]
    dates_final = ';'.join(dates_series_refined )
    df.loc[index, 'dates_generate'] = dates_final


Please let me know if you are looking for something else.

Best Regards,

Shubham Jamwal

Level 2

Hi Shubham,

Thanks a lot for your answer.

As I shared in my question about my ultimate goal, I don't know if you can think of a better solution than creating a column containing all the month.

I have 2 tables, One contains subscription numbers with an start and end date. One is just a calendar which contain one month in each line (202104, 202201, etc.). My goal is for each month in the calendar table, calculate the number of active subscription so I will have to, for each month, count distinct subscription numbers which have start month <= that month and (end month > that month or end month is null).

Like I said, the cross join is too expensive and I don't know how to join these 2 tables without any similar key.

Thanks a lot for your help.

0 Kudos


The custom Python processor  in a Prepare recipe is probably the best option here.



I have adapted the code from the following Stack Overflow post, which is essentially your requirement:

from datetime import datetime, timedelta
from collections import OrderedDict

def process(row):
    # In 'cell' mode, the process function must return
    # a single cell value for each row,
    # which will be affected to a new column.
    # The 'row' argument is a dictionary of columns of the row
    dates = [row['start_date'], row['end_date']]
    start, end = [datetime.strptime(_, "%d/%m/%Y") for _ in dates]
    months = OrderedDict(((start + timedelta(_)).strftime(r"%Y%m"), None) for _ in range((end - start).days)).keys()   
    return ';'.join(list(months))

I hope that helps.

Level 2


Thanks a lot for your answer,

My start and end dates have a parsed date format so I get this error:

'2020-02-29T23:00:00.000Z' does not match format '%d/%m/%Y'

Can you help me to fix it?

Sorry I don't really master Python.

Thanks a lot for your help.

0 Kudos

You can check the format codes here:

A parsed date would match this format:



0 Kudos