Rolling up columns into JSON column# -*- coding: utf-8 -*- import dataiku import pandas as pd, numpy
Operating system used: Apple
Operating system used: AppleHi there
First post, relative newbie to Dataiku and trying to get my head around Python.
[Before you ask!I know that I can do this as a recipe, but I am trying to upskill to Python
So...I have a dataset called "TEMP1" which I want to create a new JSON column from 3 of the columns.
I can't do it in a loop as it's too slow.
So the output would be
Col1, Col2, Col3, Col4, Col5, Col6, JSON of Col1/2/3.
I'm sure I'm on the right lines, but struggling as it's erroring as the JSON isn't columnar data.
If anyone can straighten me out, this will be a huge help.
# -*- coding: utf-8 -*- import dataiku import pandas as pd, numpy as np from dataiku import pandasutils as pdu # Read recipe inputs TABLE1 = dataiku.Dataset("77504_TABLE1") TABLE1_df = TABLE1.get_dataframe() TEMP1_df = TABLE1_df[["COL1", "COL2", "COL3"]] TEMP1_df['TEMP1_df'] = df.to_json(orient='records', lines=True).splitlines() # Write recipe outputs TEMP1 = dataiku.Dataset("TEMP1") TEMP1.write_with_schema(df)
Best Answer
-
Had to change it to:
# -*- coding: utf-8 -*- import dataiku import pandas as pd, numpy as np from dataiku import pandasutils as pdu # Read recipe inputs TEMP1 = dataiku.Dataset("TEMP1") TEMP1_df = TEMP1.get_dataframe() # Compute recipe outputs from inputs #TEMP1_df['d'] = TEMP1_df.apply(lambda row: json.dumps(dict(row)), axis=1) TEMP1_df['json_abc'] = TEMP1_df.apply(lambda row: json.dumps({'COLUMN1': row['COLUMN1'], 'COLUMN2': row['COLUMN2'], 'COLUMN4': row['COLUMN4']}), axis=1) # Write recipe outputs TEMP2 = dataiku.Dataset("TEMP2") TEMP2.write_with_schema(TEMP1_df)
but works like a dream.
Thank you - you've opened up a whole new world to me
Answers
-
By the way, there's an error in the code. 77504_TABLE1 should read TABLE1.
My browser isn't letting me update the code. -
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi Martin,
Can you try to use something like this instead?df['d'] = df.apply(lambda row: json.dumps(dict(row)), axis=1)
If that doesn't work, can you share the exact error you see? -
Thanks Alex
Still not there. My own lack of knowledge.
I've rewritten the code inside a redacted version so this is EXACTLY what I've got in Dataiku.# -*- coding: utf-8 -*- import dataiku import pandas as pd, numpy as np from dataiku import pandasutils as pdu # Read recipe inputs TEMP1 = dataiku.Dataset("TEMP1") TEMP1_df = TEMP1.get_dataframe() # Compute recipe outputs from inputs TEMP2_df['d'] = TEMP1_df.apply(lambda row: json.dumps(dict(row)), axis=1) # Write recipe outputs TEMP2 = dataiku.Dataset("TEMP2") TEMP2.write_with_schema(TEMP2_df)
The error becomes:
Job failed: Error in Python process: At line 15: <class 'NameError'>: name 'TEMP2_df' is not definedThanks again
Martin -
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi Martin,
I think you need to do this instead, you never defined TEMP2_df you can't use the column 'd' from it.TEMP1_df['d'] = TEMP1_df.apply(lambda row: json.dumps(dict(row)), axis=1) When you write you can just write : TEMP2.write_with_schema(TEMP1_df)
-
Hi Alex
Thank you so much for this. It runs without error!
One final problem is that I need to "JSON up" Col1, Col2, Col3, but leave Col4 as is. This wraps up all 4 columns.
So close...
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
YOu can just specify which columns to json dump :
TEMP1_df['json_abc'] = TEMP1_df.apply(lambda row: json.dumps({'a': row['a'], 'b': row['b'], 'c': row['c']}), axis=1)
Hope that helps!