How to truncate a column header
Hi all,
I'm manipulating forecasts data. They're rolling forecasts over the next 24 month.
Input dataset contains folowwing columns :
Part_number / forecasts / period_number
The latter is between 0 and 23, 0 is forecast for this month, whereas 23 is forecast in 2 years.
I've transformed the period_number in Year + Month (of forecast_period). This gives me 2 columns (Year / Month)
I've now used the pivot function to get the forecasts in columns (1st column is forecast for 10/2021, 2nd column is forecast for 11/2021, etc.).
BUT...
The columns headers are now like "2021_10_forecast_sum" / "2021_11_forecast_sum"...
How can I truncate the column header to get only "2021_10" / "2021_11"...
And ideally, how can I get this back to 10_2021 / 11_2021...
Thanks for your help.
Best regards.
Answers
-
Sarina Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 317 Dataiker
Hi @Dataiku_33
,
If I understand correctly, you have a number of columns that now have the format YYYY_MM_forecast_sum and want to convert all of these into the format MM_YYYY. If that understanding is correct, I think you can use the regex feature of the "rename" option within a dataset to bulk change your column names back to the MM_YYYY format.Here's an example:
To start, I have a sample schema that is similar to your current schema:
1. From the dataset Settings tab you can select all columns and then click on the Actions menu > Rename:From there you'll see several options, and can select the "Apply a regular expression" to alter the column names by regular expression.
2. Enter in an expression like the following "(\d\d\d\d)_(\d\d)(_forecast_sum)". This example will match the YYYY in the first group (represented by $1) and MM in the second group (represented by $2). The third group will match _forecast_sum, and I'll discard this group. In my "Replace by" field, I'll swap the order of the month and year, to then get MM_YYYY in my final output:After clicking "Ok" on the above, these are my final columns. If you make a mistake you can simply not click "Save" and then restart.
Please let me know if you have any questions about this!
Thank you,
Sarina -
Thanks for the tip!
I'm going to test this and come back to you to share the result.
-
Hi,
it partially worked! I had forgotten to mention one point indeed.
For months from january to september, the input format is not MM, but simply M instead. Thus I still have YYYY_M_forecast_sum for these months.
Any tip to have a single "Regular expression", which would apply to both M and MM month format?
An additional question:
When I get back to the "schema" in the dataset setting, it comes back to default settings and the "Regular expression" has disappeared.
Considering my dataset is a rolling forecast, does it means I need to set the "Regular expression" back each month (as a one shot solution) or will this be set "defintively"?
Thanks for your help.
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron
@Dataiku_33
As your regular expression search you might try:"(\d\d\d\d)_(\d+)(_forecast_sum)"
The + here represents any number of the previous character in this case \d a digit.
or even better.
"(\d\d\d\d)_(\d{1,2})(_forecast_sum)"
{1,2} says take the previous \d (digit) and find it exactly 1 or two times.
The replace with in the dialog box would stay unchanged.
Hope this might help.
There are sites on the internet like https://regex101.com/ that can help with figuring these regular expressions which can be a bit tricky.In the most recent versions of DSS 9.0.4+, in some places where Regex is used there is a bit of a Regex helper. This is super nice.
-
Hi,
It works perfectly as expected! Thanks a lot!
Still one question, is it a "one shot" solution or will it be applied automatically every month upon each new batch of input data, since these are "rolling forecasts" ?
Thanks for your help.
-
Sarina Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 317 Dataiker
Hi @Dataiku_33
,
Unfortunately there isn't a way to automate the method we discussed. I think the easiest way to automate this would be through a scenario custom Python step that runs on a regular schema to ensure that your dataset columns are updated as expected.
You can use a similar regex structure as @tgb417
provided, and do something like this in your :import dataiku from dataiku import pandasutils as pdu import pandas as pd import re mydataset = dataiku.Dataset("YOUR_DATASET") # get the dataset schema schema = mydataset.read_schema() # set up the regex to match the column names to matching = re.compile("(\d{4})_(\d{1,2})(_forecast_sum)") updated_schema = [] # for each column, if it matches the pattern, switch the column name to MM_YYYY, otherwise leave column as is for col in schema: col_name = col['name'] col_match = re.match(matching, col_name) if len(list(col_match.groups())) > 1: col['name'] = col_match[2] + "_" + col_match[1] # finally, update the column schema mydataset.write_schema(schema)
Let me know if you have any questions about this approach.
Thanks,
Sarina -
Hi,
Many thanks for this new tip. I'm gonna try this python recipe.
Thanks a lot!