Rolling up columns into JSON column# -- coding: utf-8 -- import dataiku import pandas as pd, numpy

Martin · March 2023

Operating system used: Apple

Operating system used: AppleHi there

First post, relative newbie to Dataiku and trying to get my head around Python.

[Before you ask!I know that I can do this as a recipe, but I am trying to upskill to Python ]

So...I have a dataset called "TEMP1" which I want to create a new JSON column from 3 of the columns.

I can't do it in a loop as it's too slow.

So the output would be

Col1, Col2, Col3, Col4, Col5, Col6, JSON of Col1/2/3.

I'm sure I'm on the right lines, but struggling as it's erroring as the JSON isn't columnar data.

If anyone can straighten me out, this will be a huge help.

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
TABLE1 = dataiku.Dataset("77504_TABLE1")
TABLE1_df = TABLE1.get_dataframe()

TEMP1_df = TABLE1_df[["COL1", "COL2", "COL3"]]
TEMP1_df['TEMP1_df'] = df.to_json(orient='records', lines=True).splitlines()

# Write recipe outputs
TEMP1 = dataiku.Dataset("TEMP1")
TEMP1.write_with_schema(df)

Martin · March 2023

Had to change it to:

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
TEMP1 = dataiku.Dataset("TEMP1")
TEMP1_df = TEMP1.get_dataframe()

# Compute recipe outputs from inputs
#TEMP1_df['d'] = TEMP1_df.apply(lambda row: json.dumps(dict(row)), axis=1)
TEMP1_df['json_abc'] = TEMP1_df.apply(lambda row: json.dumps({'COLUMN1': row['COLUMN1'], 'COLUMN2': row['COLUMN2'], 'COLUMN4': row['COLUMN4']}), axis=1)

# Write recipe outputs
TEMP2 = dataiku.Dataset("TEMP2")
TEMP2.write_with_schema(TEMP1_df)

but works like a dream.

Thank you - you've opened up a whole new world to me

Martin · March 2023

By the way, there's an error in the code. 77504_TABLE1 should read TABLE1.
My browser isn't letting me update the code.

Alexandru · March 2023

Hi Martin,

Can you try to use something like this instead?

df['d'] = df.apply(lambda row: json.dumps(dict(row)), axis=1)

If that doesn't work, can you share the exact error you see?

Martin · March 2023

Thanks Alex

Still not there. My own lack of knowledge.

I've rewritten the code inside a redacted version so this is EXACTLY what I've got in Dataiku.

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
# Read recipe inputs
TEMP1 = dataiku.Dataset("TEMP1")
TEMP1_df = TEMP1.get_dataframe()

# Compute recipe outputs from inputs
TEMP2_df['d'] = TEMP1_df.apply(lambda row: json.dumps(dict(row)), axis=1)

# Write recipe outputs
TEMP2 = dataiku.Dataset("TEMP2")
TEMP2.write_with_schema(TEMP2_df)

The error becomes:

Job failed: Error in Python process: At line 15: <class 'NameError'>: name 'TEMP2_df' is not defined

Thanks again
Martin

Alexandru · March 2023

Hi Martin,

I think you need to do this instead, you never defined TEMP2_df you can't use the column 'd' from it.

TEMP1_df['d'] = TEMP1_df.apply(lambda row: json.dumps(dict(row)), axis=1)


When you write you can just write :

TEMP2.write_with_schema(TEMP1_df)

Martin · March 2023

Hi Alex

Thank you so much for this. It runs without error! One final problem is that I need to "JSON up" Col1, Col2, Col3, but leave Col4 as is.

This wraps up all 4 columns.

So close...

Alexandru · March 2023

YOu can just specify which columns to json dump :

TEMP1_df['json_abc'] = TEMP1_df.apply(lambda row: json.dumps({'a': row['a'], 'b': row['b'], 'c': row['c']}), axis=1)

Hope that helps!

Rolling up columns into JSON column# -- coding: utf-8 -- import dataiku import pandas as pd, numpy

Tags

Welcome!

Best Answer

Answers

Welcome!

Welcome!

Quick Links

Categories

Sign up to take part

Rolling up columns into JSON column# -*- coding: utf-8 -*- import dataiku import pandas as pd, numpy

Tags

Welcome!

Best Answer

Answers

Welcome!

Welcome!

Quick Links

Categories

Rolling up columns into JSON column# -- coding: utf-8 -- import dataiku import pandas as pd, numpy