Data transformation

shahas71
shahas71 Dataiku DSS Core Designer, Registered Posts: 12

Hi All,

This is a mock data of my input :

Input
RuleAccountEntityAudit
R1a1,a2e1
R2 e2-c2

This has to be split and transformed to this form :

RuleDimensionValue
R1Accounta1
R1Accounta2
R1Entitye1
R2Entitye2
R2Audit-c2

Then have to search a databse, say "db1" and find the levels :

RuleDimensionValueLevel
R1Accounta1L2
R1Accounta2L3
R1Entitye1L1
R2Entitye2L2
R2Audit-c2L3

Then for the final output, data should look like :

Table_Dim_AccountRuleL1L2L3Exclude
R1 a1 N
R1 a2N
R2 N
Table_Dim_EntityRuleL1L2L3Exclude
R1e1 N
R2 e2 N
Table_Dim_AuditRuleL1L2L3Exclude
R1 c2Y
R1 N

Exclude column is just a yes or no..if there is any negative value, exclude is set to Y or else N

how can we achieve the solution? open to both python recipe solution as well as visual recipe solutions

Thanks in advance


Operating system used: windows


Operating system used: windows

Answers

  • info-rchitect
    info-rchitect Registered Posts: 180 ✭✭✭✭✭✭

    @shahas71
    How many unique column values will you have? Second question is, are they fixed and known in advance or are they dynamic?

  • shahas71
    shahas71 Dataiku DSS Core Designer, Registered Posts: 12

    Hi @info-rchitect
    ,

    Not quite sure if i get your question right. But the column, for eg, account might have 2 or 3 values eperated by a ',' delimiter. that has to be split into seperate rows. these values will be given to us & just have to find the level from seperate databse & transform it into the final form

  • info-rchitect
    info-rchitect Registered Posts: 180 ✭✭✭✭✭✭
    edited July 17

    Hi @shahas71

    I would assume you don't know when CSV data will show up vs a single value in each of the three columns. This is valid Snowflake SQL that will accomplish the task:

    create or replace temporary table mytable(rule varchar(50), account varchar(50), entity varchar(50), audit varchar(50));
    
    insert into mytable values
        ('R1', 'a1,a2', 'e1', null),
        ('R2', null, 'e2', '-c2');
        
    with account_data as (
        select distinct rule,
            split(account, ',') as account_array
        from mytable
    ),
    account_data_flattened as (
        select rule,
            'Account' as dimension,
            to_char(a1.value) as value
        from account_data, table(flatten(account_data.account_array)) a1
    ),
    entity_data as (
        select distinct rule, split(entity, ',') as entity_array
        from mytable 
    ),
    entity_data_flattened as (
        select rule,
            'Entity' as dimension,
            to_char(a1.value) as value
        from entity_data, table(flatten(entity_data.entity_array)) a1
    ),
    audit_data as (
        select distinct rule, split(audit, ',') as audit_array
        from mytable 
    ),
    audit_data_flattened as (
        select rule,
            'Audit' as dimension,
            to_char(a1.value) as value
        from audit_data, table(flatten(audit_data.audit_array)) a1
    )
    select * from account_data_flattened
    union all
    select * from entity_data_flattened
    union all
    select * from audit_data_flattened;
  • info-rchitect
    info-rchitect Registered Posts: 180 ✭✭✭✭✭✭
    edited July 17

    I take that back, this is much better:

    with unpivoted as (
        select * from mytable
        unpivot(value for dimension in (Account, Entity, Audit))
    ),
    formatted as (
        select rule,
            dimension,
            split(value, ',') as value_array
        from unpivoted
    )
    select rule,
        dimension,
        to_char(a1.value) as value
    from formatted, table(flatten(formatted.value_array)) a1;
  • info-rchitect
    info-rchitect Registered Posts: 180 ✭✭✭✭✭✭
    edited July 17

    and even better:

    with unpivoted as (
        select * from mytable
        unpivot(value for dimension in (Account, Entity, Audit))
    )
    select rule, dimension, records.value::varchar as value
    from unpivoted, lateral flatten (input => split(value, ','), outer => true) records;
  • shahas71
    shahas71 Dataiku DSS Core Designer, Registered Posts: 12

    @info-rchitect
    Thanks for the reply.

    let me make it more clear

    Input
    RuleAccountEntityAudit
    R1a1,a2e1
    R2 e2-c2

    from this table, we have to transform the table into this form first:

    RuleAccountEntityAudit
    R1a1e1
    R1a2e1
    R2 e2-c2

    after that, the table needs to expanded. all the columns, except Rule (this is just a mock data, in reality many columns will be present) will become under a new columns called Dimension, ie, dimension will contain account, entity, audit etc. The values a1,a2,e1 etc will come under new column named as value. so the output has to be like this :

    RuleDimensionValue
    R1Accounta1
    R1Accounta2
    R1Entitye1
    R2Entitye2
    R2Audit-c2

    Hope its clear

  • info-rchitect
    info-rchitect Registered Posts: 180 ✭✭✭✭✭✭

    @shahas71
    So the middle table will actually be used and people/processes will select data from it? Also, do you only want these manipulations done in Python and not on the database itself? We can use Python to manipulate the database if needed (what I typically prefer).

    One more question: Why dos the middle table have to exist before the final table? It is much easier to use the SQL I posted above to make the final table first and then make the middle table from the final table.

Setup Info
    Tags
      Help me…