Survey banner
The Dataiku Community is moving to a new home! Some short term disruption starting next week: LEARN MORE

Data transformation

shahas71
Level 2
Data transformation

Hi All,

This is a mock data of my input :

Input   
RuleAccountEntityAudit
R1a1,a2e1 
R2 e2-c2

 

This has to be split and transformed to this form :

RuleDimensionValue
R1Accounta1
R1Accounta2
R1Entitye1
R2Entitye2
R2Audit-c2

 

Then have to search a databse, say "db1" and find the levels :

RuleDimensionValueLevel
R1Accounta1L2
R1Accounta2L3
R1Entitye1L1
R2Entitye2L2
R2Audit-c2L3

 

Then for the final output, data should look like :

Table_Dim_AccountRuleL1L2L3Exclude
 R1 a1 N
 R1  a2N
 R2   N
      
Table_Dim_EntityRuleL1L2L3Exclude
 R1e1  N
 R2 e2 N
      
Table_Dim_AuditRuleL1L2L3Exclude
 R1  c2Y
 R1   N

 

Exclude column is just a yes or no..if there is any negative value, exclude is set to Y or else N

how can we achieve the solution? open to both python recipe solution as well as visual recipe solutions

Thanks in advance


Operating system used: windows


Operating system used: windows

0 Kudos
7 Replies
info-rchitect
Level 6

@shahas71 How many unique column values will you have?  Second question is, are they fixed and known in advance or are they dynamic?

0 Kudos
shahas71
Level 2
Author

Hi @info-rchitect ,

Not quite sure if i get your question right. But the column, for eg, account might have 2 or 3 values eperated by a ',' delimiter. that has to be split into seperate rows. these values will be given to us & just have to find the level from seperate databse & transform it into the final form

0 Kudos
info-rchitect
Level 6

Hi @shahas71

I would assume you don't know when CSV data will show up vs a single value in each of the three columns.  This is valid Snowflake SQL that will accomplish the task:

 

create or replace temporary table mytable(rule varchar(50), account varchar(50), entity varchar(50), audit varchar(50));

insert into mytable values
    ('R1', 'a1,a2', 'e1', null),
    ('R2', null, 'e2', '-c2');
    
with account_data as (
    select distinct rule,
        split(account, ',') as account_array
    from mytable
),
account_data_flattened as (
    select rule,
        'Account' as dimension,
        to_char(a1.value) as value
    from account_data, table(flatten(account_data.account_array)) a1
),
entity_data as (
    select distinct rule, split(entity, ',') as entity_array
    from mytable 
),
entity_data_flattened as (
    select rule,
        'Entity' as dimension,
        to_char(a1.value) as value
    from entity_data, table(flatten(entity_data.entity_array)) a1
),
audit_data as (
    select distinct rule, split(audit, ',') as audit_array
    from mytable 
),
audit_data_flattened as (
    select rule,
        'Audit' as dimension,
        to_char(a1.value) as value
    from audit_data, table(flatten(audit_data.audit_array)) a1
)
select * from account_data_flattened
union all
select * from entity_data_flattened
union all
select * from audit_data_flattened;
0 Kudos
info-rchitect
Level 6

I take that back, this is much better:

 

with unpivoted as (
    select * from mytable
    unpivot(value for dimension in (Account, Entity, Audit))
),
formatted as (
    select rule,
        dimension,
        split(value, ',') as value_array
    from unpivoted
)
select rule,
    dimension,
    to_char(a1.value) as value
from formatted, table(flatten(formatted.value_array)) a1;
0 Kudos
info-rchitect
Level 6

and even better:

with unpivoted as (
    select * from mytable
    unpivot(value for dimension in (Account, Entity, Audit))
)
select rule, dimension, records.value::varchar as value
from unpivoted, lateral flatten (input => split(value, ','), outer => true) records;
0 Kudos
shahas71
Level 2
Author

@info-rchitect Thanks for the reply.

let me make it more clear

Input   
RuleAccountEntityAudit
R1a1,a2e1 
R2 e2-c2

 

from this table, we have to transform the table into this form first:

RuleAccountEntityAudit
R1a1e1 
R1a2e1 
R2 e2-c2
    
    

 

after that, the table needs to expanded. all the columns, except Rule (this is just a mock data, in reality many columns will be present) will become under a new columns called Dimension, ie, dimension will contain account, entity, audit etc. The values a1,a2,e1 etc will come under new column named as value. so the output has to be like this :

RuleDimensionValue
R1Accounta1
R1Accounta2
R1Entitye1
R2Entitye2
R2Audit-c2

 

Hope its clear

0 Kudos
info-rchitect
Level 6

@shahas71 So the middle table will actually be used and people/processes will select data from it?  Also, do you only want these manipulations done in Python and not on the database itself?  We can use Python to manipulate the database if needed (what I typically prefer).

 

One more question:  Why dos the middle table have to exist before the final table?  It is much easier to use the SQL I posted above to make the final table first and then make the middle table from the final table.

0 Kudos

Labels

?

Setup info

?
A banner prompting to get Dataiku