Schema changes after running Python script

aw30 · ‎10-14-2020

I have a simple python script that I run after an initial dataset is pulled into my flow as it takes the name of each column and strips out blank spaces and lowers all case. I have seen in some situations that the initial dataset is a string but then after running the python script it changes it to a bigint. For instance data that has preceding 0's like 0003928. How do I prevent this from occurring?

I use the default code except for the 1 line I highlighted below. Thanks for any help you may be able to provide!

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
pmoexportSetProblem = dataiku.Dataset("PMOExportSetProblem")
pmoexportSetProblem_df = pmoexportSetProblem.get_dataframe()

# Compute recipe outputs from inputs
# TODO: Replace this part by your actual code that computes the output, as a Pandas dataframe
# NB: DSS also supports other kinds of APIs for reading and writing data. Please see doc.

sn_problem_df = pmoexportSetProblem_df # For this sample code, simply copy input to output
sn_problem_df.columns = sn_problem_df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '').str.replace('.', '_').str.replace('/', '_').str.replace('&', 'n').str.replace("'", '')

# Write recipe outputs
sn_problem = dataiku.Dataset("sn_problem")
sn_problem.write_with_schema(sn_problem_df)

Marlan · ‎10-14-2020

Hi @aw30,

Try adding infer_with_pandas=False to your get_dataframe call, i.e., get_dataframe(infer_with_pandas=False)

The get_dataframe method uses a pandas read method behind the scenes and by default the data types used for the resulting dataframe are inferred from the data rather than using the data types set in the dataset's schema. Setting infer_with_pandas to false will result in dataframe data types being set for the dataset's schema.

Marlan

View solution in original post

Marlan · ‎10-14-2020

Hi @aw30,

Try adding infer_with_pandas=False to your get_dataframe call, i.e., get_dataframe(infer_with_pandas=False)

The get_dataframe method uses a pandas read method behind the scenes and by default the data types used for the resulting dataframe are inferred from the data rather than using the data types set in the dataset's schema. Setting infer_with_pandas to false will result in dataframe data types being set for the dataset's schema.

Marlan

aw30 · ‎10-14-2020

Thank you for the solution - this just made my day!

Schema changes after running Python script

Schema changes after running Python script

Labels

Datasets

Python

Sign up to take part

Schema changes after running Python script

Schema changes after running Python script

Labels

Datasets

Python