Collect all months between 2 dates in a string column

lnguyen
lnguyen Registered Posts: 7 ✭✭✭✭

Hi,

Each line of my data set contain an subscription with a start and end date.

I want to create another column which contains all the month between these 2 dates in string format with a separator (; for example).

For example if I have start_date = 01/07/2021 and end_date = 18/04/2022.

The new column should contain:

"202107;202108;202109;202110;202111;202112;202201;202202;202203;202204"

Do you have an idea about how to achieve that.

The ultimate goal is to be able to verify, for each given month, if the subscription is active.

The cross-join will be to expensive on the server so I think I can do a formula to search with regular expression to achieve the goal.

Thanks a lot for your help.


Operating system used: Window

Tagged:

Answers

  • sj0071992
    sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron
    edited July 17

    Hi @lnguyen
    ,

    You can create a python code for this, check below code for your reference

    import pandas as pd
    
    df = your_dataframe
    
    for index, row in df.iterrows():
        calculate_dates = pd.date_range(row['start_date'], row['end_date'], freq='D')
        dates_series = pd.Series(calculate_dates .format())
        dates_series_refined = [';'.join(ele.split()) for ele in dates_series]
        dates_final = ';'.join(dates_series_refined )
        df.loc[index, 'dates_generate'] = dates_final

    Please let me know if you are looking for something else.

    Best Regards,

    Shubham Jamwal

  • JuanE
    JuanE Dataiker, Registered Posts: 45 Dataiker
    edited July 17

    Hello,

    The custom Python processor in a Prepare recipe is probably the best option here.

    dates.PNG

    I have adapted the code from the following Stack Overflow post, which is essentially your requirement:

    https://stackoverflow.com/a/34898764

    from datetime import datetime, timedelta
    from collections import OrderedDict
    
    def process(row):
        # In 'cell' mode, the process function must return
        # a single cell value for each row,
        # which will be affected to a new column.
        # The 'row' argument is a dictionary of columns of the row
        dates = [row['start_date'], row['end_date']]
        start, end = [datetime.strptime(_, "%d/%m/%Y") for _ in dates]
        months = OrderedDict(((start + timedelta(_)).strftime(r"%Y%m"), None) for _ in range((end - start).days)).keys()   
        return ';'.join(list(months))

    I hope that helps.

  • lnguyen
    lnguyen Registered Posts: 7 ✭✭✭✭

    Hi,

    Thanks a lot for your answer,

    My start and end dates have a parsed date format so I get this error:

    '2020-02-29T23:00:00.000Z' does not match format '%d/%m/%Y'

    Can you help me to fix it?

    Sorry I don't really master Python.

    Thanks a lot for your help.

  • lnguyen
    lnguyen Registered Posts: 7 ✭✭✭✭

    Hi Shubham,

    Thanks a lot for your answer.

    As I shared in my question about my ultimate goal, I don't know if you can think of a better solution than creating a column containing all the month.

    I have 2 tables, One contains subscription numbers with an start and end date. One is just a calendar which contain one month in each line (202104, 202201, etc.). My goal is for each month in the calendar table, calculate the number of active subscription so I will have to, for each month, count distinct subscription numbers which have start month <= that month and (end month > that month or end month is null).

    Like I said, the cross join is too expensive and I don't know how to join these 2 tables without any similar key.

    Thanks a lot for your help.

  • JuanE
    JuanE Dataiker, Registered Posts: 45 Dataiker
    edited July 17

    You can check the format codes here:

    https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes

    A parsed date would match this format:

    "%Y-%m-%dT%H:%M:%S.%fZ"

  • gauravgarg
    gauravgarg Registered Posts: 6 ✭✭✭

    Hi All,

    I need a similar solution it is just that I need those dates between my start date and end date in a separate column against that row. Is there anything for it

  • JuanE
    JuanE Dataiker, Registered Posts: 45 Dataiker

    I do not understand what you mean. Could you maybe give an example?

Setup Info
    Tags
      Help me…