Python code is overwriting my schema settings!

mbillingham
mbillingham Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 7

Hello-

I have some python code that is reading in a dataset and then writing out an updated version of that dataset. It is inferring several columns as integers when I want them to be text. Every time I run it overwrites my schema settings. Any recommendations?

thanks!

Mindy


Operating system used: unix

Tagged:

Answers

  • Zach
    Zach Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 153 Dataiker
    edited July 2024

    Hi @mbillingham
    ,

    The reason that the type is being changed is because pandas dataframes automatically infer the type of each column based on the values.

    There are a few ways to prevent a Python recipe from changing the types depending on your code:

    • Prevent pandas from inferring the types of the input dataset. This will cause it to copy the types from your input dataset:
      input_dataset = dataiku.Dataset("input_dataset")
      input_df = input_dataset.get_dataframe(infer_with_pandas=False)​
    • Write data to the output dataset using write_dataframe() instead of write_with_schema(). This will prevent it from changing the schema, but it will fail if the existing schema isn't compatible:
      output_dataset = dataiku.Dataset("output_dataset")
      output_dataset.write_dataframe(output_df)​
    • Manually change the type of the columns that you want to change to string:
      output_df = output_df.astype({"col1": str, "col2": str})​

    Thanks,

    Zach

  • mbillingham
    mbillingham Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 7

    Using the infer_with_pandas=False flag worked exactly the way I needed it to. Thank you so much!

Setup Info
    Tags
      Help me…