Submit your innovative use case or inspiring success story to the 2023 Dataiku Frontrunner Awards! LET'S GO

Rolling up columns into JSON column# -*- coding: utf-8 -*- import dataiku import pandas as pd, numpy

Solved!
Martin
Level 2
Rolling up columns into JSON column# -*- coding: utf-8 -*- import dataiku import pandas as pd, numpy

Operating system used: Apple


Operating system used: AppleHi there

First post, relative newbie to Dataiku and trying to get my head around Python.

[Before you ask!I know that I can do this as a recipe, but I am trying to upskill to Python πŸ™‚ ]

So...I have a dataset called "TEMP1" which I want to  create a new JSON column from 3 of the columns.

I can't do it in a loop as it's too slow.

So the output would be

Col1, Col2, Col3, Col4, Col5, Col6, JSON of Col1/2/3.

I'm sure I'm on the right lines, but struggling as it's erroring as the JSON isn't columnar data.

If anyone can straighten me out, this will be a huge help.

 

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
TABLE1 = dataiku.Dataset("77504_TABLE1")
TABLE1_df = TABLE1.get_dataframe()

TEMP1_df = TABLE1_df[["COL1", "COL2", "COL3"]]
TEMP1_df['TEMP1_df'] = df.to_json(orient='records', lines=True).splitlines()

# Write recipe outputs
TEMP1 = dataiku.Dataset("TEMP1")
TEMP1.write_with_schema(df)

 

 

0 Kudos
1 Solution
Martin
Level 2
Author

Had to change it to:

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
TEMP1 = dataiku.Dataset("TEMP1")
TEMP1_df = TEMP1.get_dataframe()

# Compute recipe outputs from inputs
#TEMP1_df['d'] = TEMP1_df.apply(lambda row: json.dumps(dict(row)), axis=1)
TEMP1_df['json_abc'] = TEMP1_df.apply(lambda row: json.dumps({'COLUMN1': row['COLUMN1'], 'COLUMN2': row['COLUMN2'], 'COLUMN4': row['COLUMN4']}), axis=1)

# Write recipe outputs
TEMP2 = dataiku.Dataset("TEMP2")
TEMP2.write_with_schema(TEMP1_df)

but works like a dream. 

Thank you - you've opened up a whole new world to me πŸ™‚

View solution in original post

0 Kudos
7 Replies
Martin
Level 2
Author

By the way, there's an error in the code.  77504_TABLE1 should read TABLE1.
My browser isn't letting me update the code.

0 Kudos
AlexT
Dataiker

Hi Martin,

Can you try to use something like this instead? 

df['d'] = df.apply(lambda row: json.dumps(dict(row)), axis=1)

 
If that doesn't work, can you share the exact error you see? 

0 Kudos
Martin
Level 2
Author

Thanks Alex

Still not there.  My own lack of knowledge.

I've rewritten the code inside a redacted version so this is EXACTLY what I've got in Dataiku.

 

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
# Read recipe inputs
TEMP1 = dataiku.Dataset("TEMP1")
TEMP1_df = TEMP1.get_dataframe()

# Compute recipe outputs from inputs
TEMP2_df['d'] = TEMP1_df.apply(lambda row: json.dumps(dict(row)), axis=1)

# Write recipe outputs
TEMP2 = dataiku.Dataset("TEMP2")
TEMP2.write_with_schema(TEMP2_df)

 

The error becomes:


Job failed: Error in Python process: At line 15: <class 'NameError'>: name 'TEMP2_df' is not defined

Thanks again
Martin πŸ™‚

0 Kudos
AlexT
Dataiker

Hi Martin,

I think you need to do this instead, you never defined TEMP2_df you can't use the column 'd' from it.


TEMP1_df['d'] = TEMP1_df.apply(lambda row: json.dumps(dict(row)), axis=1)


When you write you can just write :

TEMP2.write_with_schema(TEMP1_df)

 

0 Kudos
Martin
Level 2
Author

Hi Alex

Thank you so much for this. It runs without error! πŸ™‚

One final problem is that I need to "JSON up" Col1, Col2, Col3, but leave Col4 as is.

This wraps up all 4 columns.

So close...

0 Kudos
AlexT
Dataiker

YOu can just specify which columns to json dump : 

TEMP1_df['json_abc'] = TEMP1_df.apply(lambda row: json.dumps({'a': row['a'], 'b': row['b'], 'c': row['c']}), axis=1)

Hope that helps!

0 Kudos
Martin
Level 2
Author

Had to change it to:

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
TEMP1 = dataiku.Dataset("TEMP1")
TEMP1_df = TEMP1.get_dataframe()

# Compute recipe outputs from inputs
#TEMP1_df['d'] = TEMP1_df.apply(lambda row: json.dumps(dict(row)), axis=1)
TEMP1_df['json_abc'] = TEMP1_df.apply(lambda row: json.dumps({'COLUMN1': row['COLUMN1'], 'COLUMN2': row['COLUMN2'], 'COLUMN4': row['COLUMN4']}), axis=1)

# Write recipe outputs
TEMP2 = dataiku.Dataset("TEMP2")
TEMP2.write_with_schema(TEMP1_df)

but works like a dream. 

Thank you - you've opened up a whole new world to me πŸ™‚

0 Kudos

Labels

?

Setup info

?
A banner prompting to get Dataiku