Community Conundrum 25: Feature Visualization is now live! Read More

Creating a (categorical) column based on another column values?

Level 3
Creating a (categorical) column based on another column values?

Hi, 

I'm trying to create a column based on another column values. I'm aware I could easily achieve this by the builtin processor through the following steps:

- copy/duplicate the existing column,

- and do find-and-replace on the new created column according to the previous column values.

But I wish to do this with the formula processor instead, with the "if-statement" approach.  Perhaps I may illustrate them like the following :

existing_columnnew_column
BMW1
Toyota2
Honda3

 

I'd like to categorize the manufacturer brand on left column to values on the right column. So far, I've been doing it the new_column formula creation like the following:

if (existing_column == "BMW", "1", 
else
    existing_column == "Toyota", "2",
else
   existing_column == "Honda", "3",
end)

But the code didn't work, could anyone point me to a somewhat similar examples? Been looking for it for awhile, and any help would be greatly appreciated.

 

Thanks,

GLN

 

 

6 Replies
Dataiker
Dataiker

Hi @gerryleonugroho,

In your case, you need to use nested if-statements.

Here would be the correct usage for your example.

if(existing_column == "BMW", 1, 
if(existing_column == "Toyota", 2,
if(existing_column == "Honda", 3,"")))

 Have a nice day!

Level 3
Author

Hi @dimitri, thanks for the help, the solution worked out great! Though I'am just wondering, would there be a more "elegant" way to solve this? Assuming I have a little over then a hundred list of car manufacturers to match, since that would create a lot of nested closing brackets at the end.

Thanks,

GLN

Dataiker
Dataiker

Hi @gerryleonugroho,

An alternative way to achieve that in a more convenient manner when you have a lot of values would be to use a join recipe with a "reference table".

join recipe.png

The "reference table" could be an editable dataset that you create in DSS, with two columns, one containing all the values of the original column, and the second containing all the corresponding values.

editable dataset.png

 

Then create a join recipe (using either a left or a right join depending on the order of the original dataset), and you will be able to generate a new dataset having a new column containing the matching values.

recipe details.png

 

result.png

 

Have a nice day!

Level 3
Author

Hi @dimitri,

This is getting interesting, especially with the second step of having a reference table, where you can directly create in DSS, without having to upload them in the first place and have them edited/updated according to your requirements.

I didn't even know the feature exist, do you mind showing the steps to create one? I haven't been able to find them in the documentation/academy sections of the Dataiku's site. While for the join recipe, it's something I had in mind previously, it just that I took a different route to accomplish the journey.

 

Cheers,

GLN

Dataiker
Dataiker

Hi @gerryleonugroho,

You can create an editable dataset right from the flow by clicking on "+ dataset" then "editable".

new editable dataset.png

 

Select whether you want to start from scratch, or whether you want to prepopulate it with existing values from a file or another dataset, set its name and then click on "Create".

create editable dataset.png

 

Once created, when you open this dataset there is a new tab "Edit", from which you can edit the columns and the values. (Right-click on an existing column to insert a new one.)

edit dataset.png

 

Have a nice day!

Level 3
Author

Hi @dimitri,

This is looking good! Thanks for all the explanation, really appreciate all the efforts. You too, have a nice day, and please be safe during the pandemic. 

🙂

 

Cheers,

GLN

Labels (2)