Schema changes after running Python script

aw30 Dataiku DSS & SQL, Registered Posts: 49 ✭✭✭✭✭

I have a simple python script that I run after an initial dataset is pulled into my flow as it takes the name of each column and strips out blank spaces and lowers all case. I have seen in some situations that the initial dataset is a string but then after running the python script it changes it to a bigint. For instance data that has preceding 0's like 0003928. How do I prevent this from occurring?

I use the default code except for the 1 line I highlighted below. Thanks for any help you may be able to provide!

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
pmoexportSetProblem = dataiku.Dataset("PMOExportSetProblem")
pmoexportSetProblem_df = pmoexportSetProblem.get_dataframe()

# Compute recipe outputs from inputs
# TODO: Replace this part by your actual code that computes the output, as a Pandas dataframe
# NB: DSS also supports other kinds of APIs for reading and writing data. Please see doc.

sn_problem_df = pmoexportSetProblem_df # For this sample code, simply copy input to output
sn_problem_df.columns = sn_problem_df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '').str.replace('.', '_').str.replace('/', '_').str.replace('&', 'n').str.replace("'", '')

# Write recipe outputs
sn_problem = dataiku.Dataset("sn_problem")


Best Answer

  • Marlan
    Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 316 Neuron
    Answer ✓

    Hi @aw30

    Try adding infer_with_pandas=False to your get_dataframe call, i.e., get_dataframe(infer_with_pandas=False)

    The get_dataframe method uses a pandas read method behind the scenes and by default the data types used for the resulting dataframe are inferred from the data rather than using the data types set in the dataset's schema. Setting infer_with_pandas to false will result in dataframe data types being set for the dataset's schema.



Setup Info
      Help me…