Rolling up columns into JSON column# -*- coding: utf-8 -*- import dataiku import pandas as pd, numpy

Martin
Martin Registered Posts: 7

Operating system used: Apple


Operating system used: AppleHi there

First post, relative newbie to Dataiku and trying to get my head around Python.

[Before you ask!I know that I can do this as a recipe, but I am trying to upskill to Python ]

So...I have a dataset called "TEMP1" which I want to create a new JSON column from 3 of the columns.

I can't do it in a loop as it's too slow.

So the output would be

Col1, Col2, Col3, Col4, Col5, Col6, JSON of Col1/2/3.

I'm sure I'm on the right lines, but struggling as it's erroring as the JSON isn't columnar data.

If anyone can straighten me out, this will be a huge help.

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
TABLE1 = dataiku.Dataset("77504_TABLE1")
TABLE1_df = TABLE1.get_dataframe()

TEMP1_df = TABLE1_df[["COL1", "COL2", "COL3"]]
TEMP1_df['TEMP1_df'] = df.to_json(orient='records', lines=True).splitlines()

# Write recipe outputs
TEMP1 = dataiku.Dataset("TEMP1")
TEMP1.write_with_schema(df)

Best Answer

  • Martin
    Martin Registered Posts: 7
    edited July 17 Answer ✓

    Had to change it to:

    # -*- coding: utf-8 -*-
    import dataiku
    import pandas as pd, numpy as np
    from dataiku import pandasutils as pdu
    
    # Read recipe inputs
    TEMP1 = dataiku.Dataset("TEMP1")
    TEMP1_df = TEMP1.get_dataframe()
    
    # Compute recipe outputs from inputs
    #TEMP1_df['d'] = TEMP1_df.apply(lambda row: json.dumps(dict(row)), axis=1)
    TEMP1_df['json_abc'] = TEMP1_df.apply(lambda row: json.dumps({'COLUMN1': row['COLUMN1'], 'COLUMN2': row['COLUMN2'], 'COLUMN4': row['COLUMN4']}), axis=1)
    
    # Write recipe outputs
    TEMP2 = dataiku.Dataset("TEMP2")
    TEMP2.write_with_schema(TEMP1_df)

    but works like a dream.

    Thank you - you've opened up a whole new world to me

Answers

  • Martin
    Martin Registered Posts: 7

    By the way, there's an error in the code. 77504_TABLE1 should read TABLE1.
    My browser isn't letting me update the code.

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,211 Dataiker
    edited July 17

    Hi Martin,

    Can you try to use something like this instead?

    df['d'] = df.apply(lambda row: json.dumps(dict(row)), axis=1)


    If that doesn't work, can you share the exact error you see?

  • Martin
    Martin Registered Posts: 7
    edited July 17

    Thanks Alex

    Still not there. My own lack of knowledge.

    I've rewritten the code inside a redacted version so this is EXACTLY what I've got in Dataiku.

    # -*- coding: utf-8 -*-
    import dataiku
    import pandas as pd, numpy as np
    from dataiku import pandasutils as pdu
    # Read recipe inputs
    TEMP1 = dataiku.Dataset("TEMP1")
    TEMP1_df = TEMP1.get_dataframe()
    
    # Compute recipe outputs from inputs
    TEMP2_df['d'] = TEMP1_df.apply(lambda row: json.dumps(dict(row)), axis=1)
    
    # Write recipe outputs
    TEMP2 = dataiku.Dataset("TEMP2")
    TEMP2.write_with_schema(TEMP2_df)

    The error becomes:


    Job failed: Error in Python process: At line 15: <class 'NameError'>: name 'TEMP2_df' is not defined

    Thanks again
    Martin

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,211 Dataiker
    edited July 17

    Hi Martin,

    I think you need to do this instead, you never defined TEMP2_df you can't use the column 'd' from it.


    TEMP1_df['d'] = TEMP1_df.apply(lambda row: json.dumps(dict(row)), axis=1)
    
    
    When you write you can just write :
    
    TEMP2.write_with_schema(TEMP1_df)
    
    

  • Martin
    Martin Registered Posts: 7

    Hi Alex

    Thank you so much for this. It runs without error! One final problem is that I need to "JSON up" Col1, Col2, Col3, but leave Col4 as is.

    This wraps up all 4 columns.

    So close...

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,211 Dataiker

    YOu can just specify which columns to json dump :

    TEMP1_df['json_abc'] = TEMP1_df.apply(lambda row: json.dumps({'a': row['a'], 'b': row['b'], 'c': row['c']}), axis=1)

    Hope that helps!

Setup Info
    Tags
      Help me…