Type after Python recipe
Hello all,
I've an issue in DSS. A dataset where i've forced the type in setting, schema is an input of a python recipe.
I try with different proposition of the developper guide but i'm lost because each time i've an error.
I forced the type of column "MASTER-ID" who is the ref of the object in int. At this first level everything is ok.
I put a python recipe to rename a lot of column in the dataset and to remove "/n". Using "infer_with_pandas = False" to keep the type. Also with "write_from_dataframe".
I'm lost now because it never works . I give you the python code :
-- coding: utf-8 --import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
tmp_LAYER0_PCDB_DLOBJECT = dataiku.Dataset("tmp_LAYER0_PCDB_DLOBJECT")
tmp_LAYER0_PCDB_DLOBJECT_df = tmp_LAYER0_PCDB_DLOBJECT.get_dataframe(infer_with_pandas=True, bool_as_str=True)
#infer_with_pandas = True
tmp_FORMAT_DLOBJECT_df = tmp_LAYER0_PCDB_DLOBJECT_df # For this sample code, simply copy input to output
-------------------------------------------------------------------------------- NOTEBOOK-CELL: MARKDOWNtmp_LAYER0_PCDB_DLOBJECT_df['MASTER_ID'] = tmp_LAYER0_PCDB_DLOBJECT_df['MASTER_ID'].fillna(0)
#tmp_FORMAT_DLOBJECT_df = tmp_FORMAT_DLOBJECT_df.fillna('0')
#tmp_FORMAT_DLOBJECT_df['ITERATION'] = tmp_FORMAT_DLOBJECT_df['ITERATION'].fillna('0')
#Fix "ID" columns to integer format
#tmp_FORMAT_DLOBJECT_df["ID"] = tmp_FORMAT_DLOBJECT_df["ID"].astype(int)
#Fix "REFERENCE" columns to string format
#tmp_FORMAT_DLOBJECT_df["REFERENCE"] = tmp_FORMAT_DLOBJECT_df["REFERENCE"].astype(str)
#Fix "ITERATION" columns to integer format
#tmp_FORMAT_DLOBJECT_df["ITERATION"] = tmp_FORMAT_DLOBJECT_df["ITERATION"].astype(int)
#Fix "REVISION" columns to string format
#tmp_FORMAT_DLOBJECT_df["REVISION"] = tmp_FORMAT_DLOBJECT_df["REVISION"].astype(int)
#Replace carrier return by ""
tmp_FORMAT_DLOBJECT_df = tmp_FORMAT_DLOBJECT_df.applymap(lambda x: x.replace('\n', '') if isinstance(x, str) else x)
schema = [{'MASTER_ID': 'MASTER_ID', 'type': 'int'}]
-------------------------------------------------------------------------------- NOTEBOOK-CELL: CODEWrite recipe outputstmp_FORMAT_DLOBJECT = dataiku.Dataset("tmp_FORMAT_DLOBJECT")
tmp_FORMAT_DLOBJECT.write_from_dataframe(tmp_FORMAT_DLOBJECT_df)
The type is not keept after the build.
Have you a process or a solution? Thanks ;)
DSS V12
Best Answer
-
# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
# Read recipe inputs
tmp_LAYER0_PCDB_DLOBJECT = dataiku.Dataset("tmp_LAYER0_PCDB_DLOBJECT")
tmp_LAYER0_PCDB_DLOBJECT_df = tmp_LAYER0_PCDB_DLOBJECT.get_dataframe(infer_with_pandas=True, bool_as_str=True)
#infer_with_pandas = True
# Compute recipe outputs from inputs
# TODO: Replace this part by your actual code that computes the output, as a Pandas dataframe
# NB: DSS also supports other kinds of APIs for reading and writing data. Please see doc.
tmp_FORMAT_DLOBJECT_df = tmp_LAYER0_PCDB_DLOBJECT_df # For this sample code, simply copy input to output
# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
tmp_FORMAT_DLOBJECT_df['MASTER_ID'] = tmp_FORMAT_DLOBJECT_df['MASTER_ID'].fillna(0)
tmp_FORMAT_DLOBJECT_df['MASTER_ID'] = tmp_FORMAT_DLOBJECT_df['MASTER_ID'].astype(int)
tmp_FORMAT_DLOBJECT_df['REFERENCE'] = tmp_FORMAT_DLOBJECT_df['REFERENCE'].astype(str)
tmp_FORMAT_DLOBJECT_df['PREV_REV_ID'] = tmp_FORMAT_DLOBJECT_df['PREV_REV_ID'].fillna('0')
tmp_FORMAT_DLOBJECT_df['PREV_REV_ID'] = tmp_FORMAT_DLOBJECT_df['PREV_REV_ID'].astype(int)
tmp_FORMAT_DLOBJECT_df['ISLASTREV'] = tmp_FORMAT_DLOBJECT_df['ISLASTREV'].fillna('0')
tmp_FORMAT_DLOBJECT_df['ISLASTREV'] = tmp_FORMAT_DLOBJECT_df['ISLASTREV'].astype(int)
tmp_FORMAT_DLOBJECT_df['ITERATION'] = tmp_FORMAT_DLOBJECT_df['ITERATION'].fillna('0')
tmp_FORMAT_DLOBJECT_df['ITERATION'] = tmp_FORMAT_DLOBJECT_df['ITERATION'].astype(int)
tmp_FORMAT_DLOBJECT_df['REVISION'] = tmp_FORMAT_DLOBJECT_df['REVISION'].fillna('0')
tmp_FORMAT_DLOBJECT_df['REVISION'] = tmp_FORMAT_DLOBJECT_df['REVISION'].astype(int)
#Replace carrier return by "_"
tmp_FORMAT_DLOBJECT_df = tmp_FORMAT_DLOBJECT_df.applymap(lambda x: x.replace('\n', '_') if isinstance(x, str) else x)
# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
# Write recipe outputs
tmp_FORMAT_DLOBJECT = dataiku.Dataset("tmp_FORMAT_DLOBJECT")
tmp_FORMAT_DLOBJECT.write_with_schema(tmp_FORMAT_DLOBJECT_df)I'm answering to myself. I found this solution and it works perfectly. With False value it doesn't work.
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,088 Neuron
The code above uses infer_with_pandas=True not False like you said you did. Also please use the code block (the </> icon) to post code, like this:
print("Hello")
-
sorry, thanks.
I let the different line with comments after my tests. But i've used False
-
# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
# Read recipe inputs
tmp_LAYER0_PCDB_DLOBJECT = dataiku.Dataset("tmp_LAYER0_PCDB_DLOBJECT")
tmp_LAYER0_PCDB_DLOBJECT_df = tmp_LAYER0_PCDB_DLOBJECT.get_dataframe(infer_with_pandas=True, bool_as_str=True)
#infer_with_pandas = True
# Compute recipe outputs from inputs
# TODO: Replace this part by your actual code that computes the output, as a Pandas dataframe
# NB: DSS also supports other kinds of APIs for reading and writing data. Please see doc.
tmp_FORMAT_DLOBJECT_df = tmp_LAYER0_PCDB_DLOBJECT_df.head(1000) # For this sample code, simply copy input to output
# -------------------------------------------------------------------------------- NOTEBOOK-CELL: MARKDOWN
tmp_LAYER0_PCDB_DLOBJECT_df['MASTER_ID'] = tmp_LAYER0_PCDB_DLOBJECT_df['MASTER_ID'].fillna(0)
tmp_FORMAT_DLOBJECT_df['MASTER_ID'] = tmp_FORMAT_DLOBJECT_df['MASTER_ID'].astype(int)
tmp_FORMAT_DLOBJECT_df['ITERATION'] = tmp_FORMAT_DLOBJECT_df['ITERATION'].fillna('0')
tmp_FORMAT_DLOBJECT_df['REVISION'] = tmp_FORMAT_DLOBJECT_df['REVISION'].fillna('0')
#Replace carrier return by "_"
tmp_FORMAT_DLOBJECT_df = tmp_FORMAT_DLOBJECT_df.applymap(lambda x: x.replace('\n', '_') if isinstance(x, str) else x)
# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
# Write recipe outputs
tmp_FORMAT_DLOBJECT = dataiku.Dataset("tmp_FORMAT_DLOBJECT")
tmp_FORMAT_DLOBJECT.write_from_dataframe(tmp_FORMAT_DLOBJECT_df) -
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,088 Neuron
You are still using True in your code snippet.