How to truncate a column header

Dataiku_33
Level 2
How to truncate a column header

Hi all,

 

I'm manipulating forecasts data. They're rolling forecasts over the next 24 month.

Input dataset contains folowwing columns :

Part_number / forecasts / period_number

The latter is between 0 and 23, 0 is forecast for this month, whereas 23 is forecast in 2 years.

I've transformed the period_number in Year + Month (of forecast_period). This gives me 2 columns (Year / Month)

I've now used the pivot function to get the forecasts in columns (1st column is forecast for 10/2021, 2nd column is forecast for 11/2021, etc.).

BUT...

The columns headers are now like "2021_10_forecast_sum" / "2021_11_forecast_sum"...

How can I truncate the column header to get only "2021_10" / "2021_11"...

And ideally, how can I get this back to 10_2021 / 11_2021...

 

Thanks for your help.

Best regards.

 

 

 

 

 

 

0 Kudos
7 Replies
SarinaS
Dataiker

Hi @Dataiku_33,

If I understand correctly, you have a number of columns that now have the format YYYY_MM_forecast_sum and want to convert all of these into the format MM_YYYY. If that understanding is correct, I think you can use the regex feature of the "rename" option within a dataset to bulk change your column names back to the MM_YYYY format. 

Here's an example:

To start, I have a sample schema that is similar to your current schema:
Screen Shot 2021-10-28 at 2.28.52 PM.png

1. From the dataset Settings tab you can select all columns and then click on the Actions menu > Rename:

Screen Shot 2021-10-28 at 2.29.00 PM.png

From there you'll see several options, and can select the "Apply a regular expression" to alter the column names by regular expression. 
2. Enter in an expression like the following "(\d\d\d\d)_(\d\d)(_forecast_sum)". This example will match the YYYY in the first group (represented by $1) and MM in the second group (represented by $2). The third group will match _forecast_sum, and I'll discard this group. In my "Replace by" field, I'll swap the order of the month and year, to then get MM_YYYY in my final output: 

Screen Shot 2021-10-28 at 2.32.47 PM.png

After clicking "Ok" on the above, these are my final columns. If you make a mistake you can simply not click "Save" and then restart.  

Screen Shot 2021-10-28 at 2.33.12 PM.png

Please let me know if you have any questions about this!

Thank you,
Sarinaโ€ƒ

โ€ƒ

0 Kudos
Dataiku_33
Level 2
Author

Thanks for the tip!

I'm going to test this and come back to you to share the result.

Dataiku_33
Level 2
Author

Hi, 

it partially worked! I had forgotten to mention one point indeed.

For months from january to september, the input format is not MM, but simply M instead. Thus I still have YYYY_M_forecast_sum for these months.

Any tip to have a single "Regular expression", which would apply to both M and MM month format?

An additional question:

When I get back to the "schema" in the dataset setting, it comes back to default settings and the "Regular expression" has disappeared.

Considering my dataset is a rolling forecast, does it means I need to set the "Regular expression" back each month (as a one shot solution) or will this be set "defintively"?

Thanks for your help.

 

0 Kudos
tgb417

@Dataiku_33 
As your regular expression search you might try:

"(\d\d\d\d)_(\d+)(_forecast_sum)"

The + here represents any number of the previous character in this case \d a digit.

or even better.

"(\d\d\d\d)_(\d{1,2})(_forecast_sum)"

{1,2} says take the previous \d (digit) and find it exactly 1 or two times. 

The replace with in the dialog box would stay unchanged.

Hope this might help. 

There are sites on the internet like https://regex101.com/ that can help with figuring these regular expressions which can be a bit tricky.

In the most recent versions of DSS 9.0.4+, in some places where Regex is used there is a bit of a Regex helper.  This is super nice.  

--Tom
0 Kudos
Dataiku_33
Level 2
Author

Hi,

It works perfectly as expected! Thanks a lot!

Still one question, is it a "one shot" solution or will it be applied automatically every month upon each new batch of input data, since these are "rolling forecasts" ?

Thanks for your help.

0 Kudos
SarinaS
Dataiker

Hi @Dataiku_33,

Unfortunately there isn't a way to automate the method we discussed. I think the easiest way to automate this would be through a scenario custom Python step that runs on a regular schema to ensure that your dataset columns are updated as expected. 

You can use a similar regex structure as @tgb417 provided, and do something like this in your :

 

import dataiku
from dataiku import pandasutils as pdu
import pandas as pd
import re

mydataset = dataiku.Dataset("YOUR_DATASET")

# get the dataset schema
schema = mydataset.read_schema()

# set up the regex to match the column names to
matching = re.compile("(\d{4})_(\d{1,2})(_forecast_sum)")
updated_schema = []

# for each column, if it matches the pattern, switch the column name to MM_YYYY, otherwise leave column as is
for col in schema:
    col_name = col['name']
    col_match = re.match(matching, col_name)
    if len(list(col_match.groups())) > 1:
        col['name'] = col_match[2] + "_" + col_match[1]

# finally, update the column schema
mydataset.write_schema(schema)

 

Let me know if you have any questions about this approach. 

Thanks,
Sarina

0 Kudos
Dataiku_33
Level 2
Author

Hi,

Many thanks for this new tip. I'm gonna try this python recipe.

Thanks a lot!