Rolling up columns into JSON column# -*- coding: utf-8 -*- import dataiku import pandas as pd, numpy

Registered Posts: 7

Operating system used: Apple


Operating system used: AppleHi there

First post, relative newbie to Dataiku and trying to get my head around Python.

[Before you ask!I know that I can do this as a recipe, but I am trying to upskill to Python ]

So...I have a dataset called "TEMP1" which I want to create a new JSON column from 3 of the columns.

I can't do it in a loop as it's too slow.

So the output would be

Col1, Col2, Col3, Col4, Col5, Col6, JSON of Col1/2/3.

I'm sure I'm on the right lines, but struggling as it's erroring as the JSON isn't columnar data.

If anyone can straighten me out, this will be a huge help.

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
TABLE1 = dataiku.Dataset("77504_TABLE1")
TABLE1_df = TABLE1.get_dataframe()

TEMP1_df = TABLE1_df[["COL1", "COL2", "COL3"]]
TEMP1_df['TEMP1_df'] = df.to_json(orient='records', lines=True).splitlines()

# Write recipe outputs
TEMP1 = dataiku.Dataset("TEMP1")
TEMP1.write_with_schema(df)

Welcome!

It looks like you're new here. Sign in or register to get started.

Best Answer

  • Registered Posts: 7
    edited July 2024 Answer ✓

    Had to change it to:

    # -*- coding: utf-8 -*-
    import dataiku
    import pandas as pd, numpy as np
    from dataiku import pandasutils as pdu
    
    # Read recipe inputs
    TEMP1 = dataiku.Dataset("TEMP1")
    TEMP1_df = TEMP1.get_dataframe()
    
    # Compute recipe outputs from inputs
    #TEMP1_df['d'] = TEMP1_df.apply(lambda row: json.dumps(dict(row)), axis=1)
    TEMP1_df['json_abc'] = TEMP1_df.apply(lambda row: json.dumps({'COLUMN1': row['COLUMN1'], 'COLUMN2': row['COLUMN2'], 'COLUMN4': row['COLUMN4']}), axis=1)
    
    # Write recipe outputs
    TEMP2 = dataiku.Dataset("TEMP2")
    TEMP2.write_with_schema(TEMP1_df)

    but works like a dream.

    Thank you - you've opened up a whole new world to me

Answers

  • Registered Posts: 7

    By the way, there's an error in the code. 77504_TABLE1 should read TABLE1.
    My browser isn't letting me update the code.

  • Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,253 Dataiker
    edited July 2024

    Hi Martin,

    Can you try to use something like this instead?

    df['d'] = df.apply(lambda row: json.dumps(dict(row)), axis=1)


    If that doesn't work, can you share the exact error you see?

  • Registered Posts: 7
    edited July 2024

    Thanks Alex

    Still not there. My own lack of knowledge.

    I've rewritten the code inside a redacted version so this is EXACTLY what I've got in Dataiku.

    # -*- coding: utf-8 -*-
    import dataiku
    import pandas as pd, numpy as np
    from dataiku import pandasutils as pdu
    # Read recipe inputs
    TEMP1 = dataiku.Dataset("TEMP1")
    TEMP1_df = TEMP1.get_dataframe()
    
    # Compute recipe outputs from inputs
    TEMP2_df['d'] = TEMP1_df.apply(lambda row: json.dumps(dict(row)), axis=1)
    
    # Write recipe outputs
    TEMP2 = dataiku.Dataset("TEMP2")
    TEMP2.write_with_schema(TEMP2_df)

    The error becomes:


    Job failed: Error in Python process: At line 15: <class 'NameError'>: name 'TEMP2_df' is not defined

    Thanks again
    Martin

  • Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,253 Dataiker
    edited July 2024

    Hi Martin,

    I think you need to do this instead, you never defined TEMP2_df you can't use the column 'd' from it.


    TEMP1_df['d'] = TEMP1_df.apply(lambda row: json.dumps(dict(row)), axis=1)
    
    
    When you write you can just write :
    
    TEMP2.write_with_schema(TEMP1_df)
    

  • Registered Posts: 7

    Hi Alex

    Thank you so much for this. It runs without error! One final problem is that I need to "JSON up" Col1, Col2, Col3, but leave Col4 as is.

    This wraps up all 4 columns.

    So close...

  • Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,253 Dataiker

    YOu can just specify which columns to json dump :

    TEMP1_df['json_abc'] = TEMP1_df.apply(lambda row: json.dumps({'a': row['a'], 'b': row['b'], 'c': row['c']}), axis=1)

    Hope that helps!

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.