Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Operating system used: Apple
Operating system used: AppleHi there
First post, relative newbie to Dataiku and trying to get my head around Python.
[Before you ask!I know that I can do this as a recipe, but I am trying to upskill to Python 🙂 ]
So...I have a dataset called "TEMP1" which I want to create a new JSON column from 3 of the columns.
I can't do it in a loop as it's too slow.
So the output would be
Col1, Col2, Col3, Col4, Col5, Col6, JSON of Col1/2/3.
I'm sure I'm on the right lines, but struggling as it's erroring as the JSON isn't columnar data.
If anyone can straighten me out, this will be a huge help.
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
# Read recipe inputs
TABLE1 = dataiku.Dataset("77504_TABLE1")
TABLE1_df = TABLE1.get_dataframe()
TEMP1_df = TABLE1_df[["COL1", "COL2", "COL3"]]
TEMP1_df['TEMP1_df'] = df.to_json(orient='records', lines=True).splitlines()
# Write recipe outputs
TEMP1 = dataiku.Dataset("TEMP1")
TEMP1.write_with_schema(df)
Had to change it to:
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
# Read recipe inputs
TEMP1 = dataiku.Dataset("TEMP1")
TEMP1_df = TEMP1.get_dataframe()
# Compute recipe outputs from inputs
#TEMP1_df['d'] = TEMP1_df.apply(lambda row: json.dumps(dict(row)), axis=1)
TEMP1_df['json_abc'] = TEMP1_df.apply(lambda row: json.dumps({'COLUMN1': row['COLUMN1'], 'COLUMN2': row['COLUMN2'], 'COLUMN4': row['COLUMN4']}), axis=1)
# Write recipe outputs
TEMP2 = dataiku.Dataset("TEMP2")
TEMP2.write_with_schema(TEMP1_df)
but works like a dream.
Thank you - you've opened up a whole new world to me 🙂
By the way, there's an error in the code. 77504_TABLE1 should read TABLE1.
My browser isn't letting me update the code.
Hi Martin,
Can you try to use something like this instead?
df['d'] = df.apply(lambda row: json.dumps(dict(row)), axis=1)
If that doesn't work, can you share the exact error you see?
Thanks Alex
Still not there. My own lack of knowledge.
I've rewritten the code inside a redacted version so this is EXACTLY what I've got in Dataiku.
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
# Read recipe inputs
TEMP1 = dataiku.Dataset("TEMP1")
TEMP1_df = TEMP1.get_dataframe()
# Compute recipe outputs from inputs
TEMP2_df['d'] = TEMP1_df.apply(lambda row: json.dumps(dict(row)), axis=1)
# Write recipe outputs
TEMP2 = dataiku.Dataset("TEMP2")
TEMP2.write_with_schema(TEMP2_df)
The error becomes:
Job failed: Error in Python process: At line 15: <class 'NameError'>: name 'TEMP2_df' is not defined
Thanks again
Martin 🙂
Hi Martin,
I think you need to do this instead, you never defined TEMP2_df you can't use the column 'd' from it.
TEMP1_df['d'] = TEMP1_df.apply(lambda row: json.dumps(dict(row)), axis=1)
When you write you can just write :
TEMP2.write_with_schema(TEMP1_df)
Hi Alex
Thank you so much for this. It runs without error! 🙂
One final problem is that I need to "JSON up" Col1, Col2, Col3, but leave Col4 as is.
This wraps up all 4 columns.
So close...
YOu can just specify which columns to json dump :
TEMP1_df['json_abc'] = TEMP1_df.apply(lambda row: json.dumps({'a': row['a'], 'b': row['b'], 'c': row['c']}), axis=1)
Hope that helps!
Had to change it to:
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
# Read recipe inputs
TEMP1 = dataiku.Dataset("TEMP1")
TEMP1_df = TEMP1.get_dataframe()
# Compute recipe outputs from inputs
#TEMP1_df['d'] = TEMP1_df.apply(lambda row: json.dumps(dict(row)), axis=1)
TEMP1_df['json_abc'] = TEMP1_df.apply(lambda row: json.dumps({'COLUMN1': row['COLUMN1'], 'COLUMN2': row['COLUMN2'], 'COLUMN4': row['COLUMN4']}), axis=1)
# Write recipe outputs
TEMP2 = dataiku.Dataset("TEMP2")
TEMP2.write_with_schema(TEMP1_df)
but works like a dream.
Thank you - you've opened up a whole new world to me 🙂