Python code is overwriting my schema settings!

mbillingham
Level 2
Python code is overwriting my schema settings!

Hello-

I have some python code that is reading in a dataset and then writing out an updated version of that dataset. It is inferring several columns as integers when I want them to be text. Every time I run it overwrites my schema settings. Any recommendations? 

 

thanks!

Mindy


Operating system used: unix

0 Kudos
2 Replies
ZachM
Dataiker

Hi @mbillingham,

The reason that the type is being changed is because pandas dataframes automatically infer the type of each column based on the values.

There are a few ways to prevent a Python recipe from changing the types depending on your code:

  • Prevent pandas from inferring the types of the input dataset. This will cause it to copy the types from your input dataset: 
    input_dataset = dataiku.Dataset("input_dataset")
    input_df = input_dataset.get_dataframe(infer_with_pandas=False)โ€‹
  • Write data to the output dataset using write_dataframe() instead of write_with_schema(). This will prevent it from changing the schema, but it will fail if the existing schema isn't compatible: 
    output_dataset = dataiku.Dataset("output_dataset")
    output_dataset.write_dataframe(output_df)โ€‹
  • Manually change the type of the columns that you want to change to string: 
    output_df = output_df.astype({"col1": str, "col2": str})โ€‹

Thanks,

Zach

0 Kudos
mbillingham
Level 2
Author

Using the infer_with_pandas=False flag worked exactly the way I needed it to. Thank you so much!

0 Kudos