How to truncate a column header

Options
Dataiku_33
Dataiku_33 Registered Posts: 7 ✭✭✭

Hi all,

I'm manipulating forecasts data. They're rolling forecasts over the next 24 month.

Input dataset contains folowwing columns :

Part_number / forecasts / period_number

The latter is between 0 and 23, 0 is forecast for this month, whereas 23 is forecast in 2 years.

I've transformed the period_number in Year + Month (of forecast_period). This gives me 2 columns (Year / Month)

I've now used the pivot function to get the forecasts in columns (1st column is forecast for 10/2021, 2nd column is forecast for 11/2021, etc.).

BUT...

The columns headers are now like "2021_10_forecast_sum" / "2021_11_forecast_sum"...

How can I truncate the column header to get only "2021_10" / "2021_11"...

And ideally, how can I get this back to 10_2021 / 11_2021...

Thanks for your help.

Best regards.

Answers

  • Sarina
    Sarina Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer Posts: 315 Dataiker
    Options

    Hi @Dataiku_33
    ,

    If I understand correctly, you have a number of columns that now have the format YYYY_MM_forecast_sum and want to convert all of these into the format MM_YYYY. If that understanding is correct, I think you can use the regex feature of the "rename" option within a dataset to bulk change your column names back to the MM_YYYY format.

    Here's an example:

    To start, I have a sample schema that is similar to your current schema:
    Screen Shot 2021-10-28 at 2.28.52 PM.png

    1. From the dataset Settings tab you can select all columns and then click on the Actions menu > Rename:

    Screen Shot 2021-10-28 at 2.29.00 PM.png

    From there you'll see several options, and can select the "Apply a regular expression" to alter the column names by regular expression.
    2. Enter in an expression like the following "(\d\d\d\d)_(\d\d)(_forecast_sum)". This example will match the YYYY in the first group (represented by $1) and MM in the second group (represented by $2). The third group will match _forecast_sum, and I'll discard this group. In my "Replace by" field, I'll swap the order of the month and year, to then get MM_YYYY in my final output:

    Screen Shot 2021-10-28 at 2.32.47 PM.png

    After clicking "Ok" on the above, these are my final columns. If you make a mistake you can simply not click "Save" and then restart.

    Screen Shot 2021-10-28 at 2.33.12 PM.png

    Please let me know if you have any questions about this!

    Thank you,
    Sarina 

  • Dataiku_33
    Dataiku_33 Registered Posts: 7 ✭✭✭
    Options

    Thanks for the tip!

    I'm going to test this and come back to you to share the result.

  • Dataiku_33
    Dataiku_33 Registered Posts: 7 ✭✭✭
    Options

    Hi,

    it partially worked! I had forgotten to mention one point indeed.

    For months from january to september, the input format is not MM, but simply M instead. Thus I still have YYYY_M_forecast_sum for these months.

    Any tip to have a single "Regular expression", which would apply to both M and MM month format?

    An additional question:

    When I get back to the "schema" in the dataset setting, it comes back to default settings and the "Regular expression" has disappeared.

    Considering my dataset is a rolling forecast, does it means I need to set the "Regular expression" back each month (as a one shot solution) or will this be set "defintively"?

    Thanks for your help.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Options

    @Dataiku_33

    As your regular expression search you might try:

    "(\d\d\d\d)_(\d+)(_forecast_sum)"

    The + here represents any number of the previous character in this case \d a digit.

    or even better.

    "(\d\d\d\d)_(\d{1,2})(_forecast_sum)"

    {1,2} says take the previous \d (digit) and find it exactly 1 or two times.

    The replace with in the dialog box would stay unchanged.

    Hope this might help.

    There are sites on the internet like https://regex101.com/ that can help with figuring these regular expressions which can be a bit tricky.

    In the most recent versions of DSS 9.0.4+, in some places where Regex is used there is a bit of a Regex helper. This is super nice.

  • Dataiku_33
    Dataiku_33 Registered Posts: 7 ✭✭✭
    Options

    Hi,

    It works perfectly as expected! Thanks a lot!

    Still one question, is it a "one shot" solution or will it be applied automatically every month upon each new batch of input data, since these are "rolling forecasts" ?

    Thanks for your help.

  • Sarina
    Sarina Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer Posts: 315 Dataiker
    edited July 17
    Options

    Hi @Dataiku_33
    ,

    Unfortunately there isn't a way to automate the method we discussed. I think the easiest way to automate this would be through a scenario custom Python step that runs on a regular schema to ensure that your dataset columns are updated as expected.

    You can use a similar regex structure as @tgb417
    provided, and do something like this in your :

    import dataiku
    from dataiku import pandasutils as pdu
    import pandas as pd
    import re
    
    mydataset = dataiku.Dataset("YOUR_DATASET")
    
    # get the dataset schema
    schema = mydataset.read_schema()
    
    # set up the regex to match the column names to
    matching = re.compile("(\d{4})_(\d{1,2})(_forecast_sum)")
    updated_schema = []
    
    # for each column, if it matches the pattern, switch the column name to MM_YYYY, otherwise leave column as is
    for col in schema:
        col_name = col['name']
        col_match = re.match(matching, col_name)
        if len(list(col_match.groups())) > 1:
            col['name'] = col_match[2] + "_" + col_match[1]
    
    # finally, update the column schema
    mydataset.write_schema(schema)

    Let me know if you have any questions about this approach.

    Thanks,
    Sarina

  • Dataiku_33
    Dataiku_33 Registered Posts: 7 ✭✭✭
    Options

    Hi,

    Many thanks for this new tip. I'm gonna try this python recipe.

    Thanks a lot!

Setup Info
    Tags
      Help me…