Write pandas dataframe to managed dataset using custom schema

brianj11011
brianj11011 Registered Posts: 3
edited July 16 in Using Dataiku

The issue we are working through pertains to Dataiku auto-detect column type when passing a Pandas dataframe to Snowflake. The schema Dataiku creates based on the Pandas dataframe converts Object data types to VARCHAR which is fine. The issue is the VARCHAR size. In out test example all dataframe objects are given a VARCHAR(4194304) and dataframe numerics are set to Number(38,0). When i try to modify the datatypes, Dataiku will not allow a decrease in data-type sizes, ie VARCHAR(4194304) to VARCHAR(20). Ideally we would like to create a custom schema and apply that schema to a Dataiku managed dataset linked to Snowfalke.

The Jupyter Notebook code we are using:

client = dataiku.api_client()
project = dataiku.api_client().get_project(dataiku.default_project_key())

builder = project.new_managed_dataset("zzz_bs_digital_engagements_app_test")
builder.with_store_into("Snowflake-DDX_xxxxx_PROD_EBUSINESS")
builder.create()

# Write dataframe to dataset
dataiku.Dataset("zzz_bs_digital_engagements_app_test").write_with_schema(df_sessions, dropAndCreate=True)

We are looking to modify the "Write dataframe to dataset" code to use a custom schema rather than the auto-detect schema that the above code is using.


Operating system used: Windows

Answers

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,024 Neuron
    edited July 17

    Please always use code blocks when posting code. You can use the write_dataframe() method instead if you prefer:

    dataiku.Dataset("zzz_bs_digital_engagements_app_test").write_dataframe(df_sessions, infer_schema=False, dropAndCreate=True)

    and the schema won't be changed. But there are no free lunches, if you do this you will not be able to cascade schema changes across your flow and will have to maintain schema changes manually in your dataset. As you are building the dataset in code this might be desired.

    https://doc.dataiku.com/dss/latest/code_recipes/python.html

  • Marlan
    Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 319 Neuron

    Hi @brianj11011
    ,

    @Turribeach
    suggestion is part of the solution. It sounds like you also want to control Snowflake data types. One idea would be to set the Table Create SQL (Dataset Settings, Advanced in UI) when you create the dataset programmatically. This would enable you to control the data types used when creating the Snowflake table. I haven't tested this so don't know if it would work but it's probably the first thing I would try.

    Marlan

  • brianj11011
    brianj11011 Registered Posts: 3
    edited July 17

    Thanks for the reply. I did edit the post and insert a code block. Thanks for the gentle reminder.
    Tried the suggestion and received an error.

    Here's the code I used:

    Exception: An error occurred during dataset write (3EW67Gx3Os): RuntimeException: Schema incompatibility: 31 columns in data, 0 columns in target dataset.


    Below is the error

    client = dataiku.api_client()
    project = dataiku.api_client().get_project(dataiku.default_project_key())
    
    builder = project.new_managed_dataset("zzz_bs_digital_engagements_app_test")
    builder.with_store_into("Snowflake-DDX_MDB-xxxx_PROD_EBUSINESS")
    builder.create()
    
    dataiku.Dataset("zzz_bs_digital_engagements_app_test").write_dataframe(df_sessions, infer_schema=False, dropAndCreate=True)

    Thanks for the quick response.

  • brianj11011
    brianj11011 Registered Posts: 3

    Thanks for the quick response. The solution you provided fixed the dataset. However, I would like the Snowflake table to reflect the schema changes too. How do we cascade the schema changes to the managed Dataiku dataset to the corresponding Snowflake table.

    links to members

Setup Info
    Tags
      Help me…