Normalization of Columns by using Group By

fbakis · ‎10-20-2021

Hello,

I am trying to do normalization of columns in Dataiku. In my dataset I have a categoric value, I want to do normalization of columns according to each group. Can anyone help me?

Thank you in advance.

Fatma

tgb417 · ‎10-20-2021

@fbakis ,

Welcome to the Dataiku community.

I’m not clear that I understand your question. Given that you are working with a categorical variable. Let’s say for a particular group the variable has the following values A, B, C, A, A, B, C, C, C, D in each of ten rows in this example group.

What value would you expect for the rows within this group? The minimum of the values as A, or the most common C. Or the Maximun value D. Or some other kind of transform.

If I’m completely of target related to the scenario you are considering. (Which I think could be very likely.) Could you provide the community an example or what source data might look like and your expected results after the transformation?

Looking forward to hearing more. And others please jump in if I don’t see the response.

--Tom

fbakis · ‎10-21-2021

Let me clarify the situation better.

Let's say I have two columns one being categorical like A,A,B,B,B,B,C,C,C and other is a numerical column like 10,22,9,3,5,12,10,2,14. My goal is to normalize the values in the second column but I want to normalize the values between 0-1 range considering their categories. A's normalize within themselves, B's within themselves and so on. Desirable output would be something like: 0,1,0.66,0,0.22,1,0.66.0.1 I know Dataiku can normalize a column as a whole but I'm not sure how to do the thing I want. Hope this example made the situation clear.

Thanks in advance.

Fatma

tgb417 · ‎10-23-2021

@fbakis ,

Good to hear from you again.

I'm wondering if your data is stored in a SQL server. I know that I run a PostgreSQL server on my local laptop and servers for similar use cases. The idea is to push of to the SQL server some of the more database-like functions.

One of the possible ways to use a window visual recipe, on a table stored in SQL and create a custom aggregation written in SQL.

I've not specifically tried the following. However, I found this chunk of SQL that seems to do what you want on StackOverflow.

SELECT a,b,
    1.0 * c / CASE WHEN SUM(c) OVER(PARTITION BY a,b) = 0 THEN 1
                   ELSE SUM(c) OVER(PARTITION BY a,b) END AS c
FROM tab

I'm wondering if it could be adapted into a custom aggregation of a windowed visual recipe. Here is another conversation in the community that is related.

If you don't have access to a SQL server, you could likely do this is python or R in a notebook, or in Python in a visual recipe step.

Good luck, let us know how you get along.

--Tom

ruben80mdq · ‎06-09-2022

Hi @fbakis ,

I was wondering if you managed to find a solution as I am having the same need ATM and have not find a solution yet.

I have built a Colour Table in Dataiku and need to develop a normalised column (for each data category so that I can then use it within the colour variable of the heatmap as noted in the DSS documentation https://doc.dataiku.com/dss/latest/visualization/charts-tables.html)

Any assistance will be uch appreciated.

Thanks in advance!

Ruben

Sign up to take part

Normalization of Columns by using Group By

Normalization of Columns by using Group By