Want to Stop Rebuilding "Expensive" Parts of your Flow? Explicit Builds are the Answer!READ MORE

Collect all months between 2 dates in a string column

lnguyen
Level 2
Collect all months between 2 dates in a string column

Hi,

Each line of my data set contain an subscription with a start and end date.

I want to create another column which contains all the month between these 2 dates in string format with a separator (; for example).

For example if I have start_date = 01/07/2021 and end_date = 18/04/2022.

The new column should contain:

"202107;202108;202109;202110;202111;202112;202201;202202;202203;202204"

Do you have an idea about how to achieve that.

The ultimate goal is to be able to verify, for each given month, if the subscription is active.

The cross-join will be to expensive on the server so I think I can do a formula to search with regular expression to achieve the goal.

Thanks a lot for your help.


Operating system used: Window

0 Kudos
5 Replies
sj0071992
Neuron
Neuron

Hi @lnguyen ,

 

You can create a python code for this, check below code for your reference

import pandas as pd

df = your_dataframe

for index, row in df.iterrows():
    calculate_dates = pd.date_range(row['start_date'], row['end_date'], freq='D')
    dates_series = pd.Series(calculate_dates .format())
    dates_series_refined = [';'.join(ele.split()) for ele in dates_series]
    dates_final = ';'.join(dates_series_refined )
    df.loc[index, 'dates_generate'] = dates_final

 

Please let me know if you are looking for something else.

Best Regards,

Shubham Jamwal

lnguyen
Level 2
Author

Hi Shubham,

Thanks a lot for your answer.

As I shared in my question about my ultimate goal, I don't know if you can think of a better solution than creating a column containing all the month.

I have 2 tables, One contains subscription numbers with an start and end date. One is just a calendar which contain one month in each line (202104, 202201, etc.). My goal is for each month in the calendar table, calculate the number of active subscription so I will have to, for each month, count distinct subscription numbers which have start month <= that month and (end month > that month or end month is null).

Like I said, the cross join is too expensive and I don't know how to join these 2 tables without any similar key.

Thanks a lot for your help.

0 Kudos
JuanE
Dataiker
Dataiker

Hello,

The custom Python processor  in a Prepare recipe is probably the best option here.

dates.PNG

 

I have adapted the code from the following Stack Overflow post, which is essentially your requirement:

https://stackoverflow.com/a/34898764

from datetime import datetime, timedelta
from collections import OrderedDict

def process(row):
    # In 'cell' mode, the process function must return
    # a single cell value for each row,
    # which will be affected to a new column.
    # The 'row' argument is a dictionary of columns of the row
    dates = [row['start_date'], row['end_date']]
    start, end = [datetime.strptime(_, "%d/%m/%Y") for _ in dates]
    months = OrderedDict(((start + timedelta(_)).strftime(r"%Y%m"), None) for _ in range((end - start).days)).keys()   
    return ';'.join(list(months))

I hope that helps.

lnguyen
Level 2
Author

Hi,

Thanks a lot for your answer,

My start and end dates have a parsed date format so I get this error:

'2020-02-29T23:00:00.000Z' does not match format '%d/%m/%Y'

Can you help me to fix it?

Sorry I don't really master Python.

Thanks a lot for your help.

0 Kudos
JuanE
Dataiker
Dataiker

You can check the format codes here:

https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes

A parsed date would match this format:

"%Y-%m-%dT%H:%M:%S.%fZ"

 

0 Kudos