I am trying to do normalization of columns in Dataiku. In my dataset I have a categoric value, I want to do normalization of columns according to each group. Can anyone help me?
Thank you in advance.
Welcome to the Dataiku community.
I’m not clear that I understand your question. Given that you are working with a categorical variable. Let’s say for a particular group the variable has the following values A, B, C, A, A, B, C, C, C, D in each of ten rows in this example group.
What value would you expect for the rows within this group? The minimum of the values as A, or the most common C. Or the Maximun value D. Or some other kind of transform.
If I’m completely of target related to the scenario you are considering. (Which I think could be very likely.) Could you provide the community an example or what source data might look like and your expected results after the transformation?
Looking forward to hearing more. And others please jump in if I don’t see the response.
Let me clarify the situation better.
Let's say I have two columns one being categorical like A,A,B,B,B,B,C,C,C and other is a numerical column like 10,22,9,3,5,12,10,2,14. My goal is to normalize the values in the second column but I want to normalize the values between 0-1 range considering their categories. A's normalize within themselves, B's within themselves and so on. Desirable output would be something like: 0,1,0.66,0,0.22,1,0.66.0.1 I know Dataiku can normalize a column as a whole but I'm not sure how to do the thing I want. Hope this example made the situation clear.
Thanks in advance.
Good to hear from you again.
I'm wondering if your data is stored in a SQL server. I know that I run a PostgreSQL server on my local laptop and servers for similar use cases. The idea is to push of to the SQL server some of the more database-like functions.
One of the possible ways to use a window visual recipe, on a table stored in SQL and create a custom aggregation written in SQL.
I've not specifically tried the following. However, I found this chunk of SQL that seems to do what you want on StackOverflow.
SELECT a,b, 1.0 * c / CASE WHEN SUM(c) OVER(PARTITION BY a,b) = 0 THEN 1 ELSE SUM(c) OVER(PARTITION BY a,b) END AS c FROM tab
I'm wondering if it could be adapted into a custom aggregation of a windowed visual recipe. Here is another conversation in the community that is related.
If you don't have access to a SQL server, you could likely do this is python or R in a notebook, or in Python in a visual recipe step.
Good luck, let us know how you get along.