How to prevent DSS replace NA with null?
Hi,
I'm using Python recipe to query and insert data to the output SQL Server dataset as below.
import dataiku import pandas as pd from dataiku import SQLExecutor2 # Read recipe inputs p787PDMItem = dataiku.Dataset("_P787PDMItem_src") p787PDMItem_df = p787PDMItem.get_dataframe() # Initialize an empty DataFrame to collect all results all_data = pd.DataFrame() # Loop through each item and generate SQL queries connection_name = "MyConnection" executor = SQLExecutor2(connection=connection_name) for item in p787PDMItem_df["ITEM"]: #print(f"ITEM: {item}") query = f""" SELECT Status, Value FROM MyTable WHERE PRODUCT_ID = '{item}' """ # Execute the SQL query and fetch the result query_result_df = executor.query_to_df(query) # Append the result to the all_data DataFrame all_data = all_data.append(query_result_df, ignore_index=True) # Write the collected data to the _P787PDM dataset p787PDM = dataiku.Dataset("_P787PDM") p787PDM.write_dataframe(all_data)
Here are the values of Status column: NA, WS001, WS001. However, the 'NA' value in the output dataset _P787PDM is replaced with 'null'.
The values in the 'Value' column are '004', '002', '001', etc. However, these values are also replaced with 4, 2, 1 in the output dataset.
Do you know how to prevent this replacement in Dataiku?
Best Answer
-
Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 321 Neuron
Hi @kevinhoang84
,I don't think you can control how NA values are handled in the query_to_df method. I believe this is on Dataiku's list to add but you could it as a suggestion in the Product Ideas area of this site.
Try using infer_from_schema=True in the query_to_df method to see if that helps with the conversion of '004' to 4 etc. My experience is that it will solve this but your situation may be different.
Marlan
Answers
-
The values in the 'Value' column are '004', '002', '001', etc. However, these values are also replaced with 4, 2, 1 in the output dataset. How to prevent this replacement?
-
Thank you very much, Marlan. infer_from_schema=True helps to remain the leading zeros.
I have to call the 'fillna' method to revert null to 'NA' as below.
all_data['Status'].fillna('NA', inplace=True)