Data transformation
Hi All,
This is a mock data of my input :
Input | |||
Rule | Account | Entity | Audit |
R1 | a1,a2 | e1 | |
R2 | e2 | -c2 |
This has to be split and transformed to this form :
Rule | Dimension | Value |
R1 | Account | a1 |
R1 | Account | a2 |
R1 | Entity | e1 |
R2 | Entity | e2 |
R2 | Audit | -c2 |
Then have to search a databse, say "db1" and find the levels :
Rule | Dimension | Value | Level |
R1 | Account | a1 | L2 |
R1 | Account | a2 | L3 |
R1 | Entity | e1 | L1 |
R2 | Entity | e2 | L2 |
R2 | Audit | -c2 | L3 |
Then for the final output, data should look like :
Table_Dim_Account | Rule | L1 | L2 | L3 | Exclude |
R1 | a1 | N | |||
R1 | a2 | N | |||
R2 | N | ||||
Table_Dim_Entity | Rule | L1 | L2 | L3 | Exclude |
R1 | e1 | N | |||
R2 | e2 | N | |||
Table_Dim_Audit | Rule | L1 | L2 | L3 | Exclude |
R1 | c2 | Y | |||
R1 | N |
Exclude column is just a yes or no..if there is any negative value, exclude is set to Y or else N
how can we achieve the solution? open to both python recipe solution as well as visual recipe solutions
Thanks in advance
Operating system used: windows
Operating system used: windows
Answers
-
@shahas71
How many unique column values will you have? Second question is, are they fixed and known in advance or are they dynamic? -
Hi @info-rchitect
,Not quite sure if i get your question right. But the column, for eg, account might have 2 or 3 values eperated by a ',' delimiter. that has to be split into seperate rows. these values will be given to us & just have to find the level from seperate databse & transform it into the final form
-
Hi @shahas71
I would assume you don't know when CSV data will show up vs a single value in each of the three columns. This is valid Snowflake SQL that will accomplish the task:
create or replace temporary table mytable(rule varchar(50), account varchar(50), entity varchar(50), audit varchar(50)); insert into mytable values ('R1', 'a1,a2', 'e1', null), ('R2', null, 'e2', '-c2'); with account_data as ( select distinct rule, split(account, ',') as account_array from mytable ), account_data_flattened as ( select rule, 'Account' as dimension, to_char(a1.value) as value from account_data, table(flatten(account_data.account_array)) a1 ), entity_data as ( select distinct rule, split(entity, ',') as entity_array from mytable ), entity_data_flattened as ( select rule, 'Entity' as dimension, to_char(a1.value) as value from entity_data, table(flatten(entity_data.entity_array)) a1 ), audit_data as ( select distinct rule, split(audit, ',') as audit_array from mytable ), audit_data_flattened as ( select rule, 'Audit' as dimension, to_char(a1.value) as value from audit_data, table(flatten(audit_data.audit_array)) a1 ) select * from account_data_flattened union all select * from entity_data_flattened union all select * from audit_data_flattened;
-
I take that back, this is much better:
with unpivoted as ( select * from mytable unpivot(value for dimension in (Account, Entity, Audit)) ), formatted as ( select rule, dimension, split(value, ',') as value_array from unpivoted ) select rule, dimension, to_char(a1.value) as value from formatted, table(flatten(formatted.value_array)) a1;
-
and even better:
with unpivoted as ( select * from mytable unpivot(value for dimension in (Account, Entity, Audit)) ) select rule, dimension, records.value::varchar as value from unpivoted, lateral flatten (input => split(value, ','), outer => true) records;
-
@info-rchitect
Thanks for the reply.let me make it more clear
Input Rule Account Entity Audit R1 a1,a2 e1 R2 e2 -c2 from this table, we have to transform the table into this form first:
Rule Account Entity Audit R1 a1 e1 R1 a2 e1 R2 e2 -c2 after that, the table needs to expanded. all the columns, except Rule (this is just a mock data, in reality many columns will be present) will become under a new columns called Dimension, ie, dimension will contain account, entity, audit etc. The values a1,a2,e1 etc will come under new column named as value. so the output has to be like this :
Rule Dimension Value R1 Account a1 R1 Account a2 R1 Entity e1 R2 Entity e2 R2 Audit -c2 Hope its clear
-
@shahas71
So the middle table will actually be used and people/processes will select data from it? Also, do you only want these manipulations done in Python and not on the database itself? We can use Python to manipulate the database if needed (what I typically prefer).One more question: Why dos the middle table have to exist before the final table? It is much easier to use the SQL I posted above to make the final table first and then make the middle table from the final table.