Write pandas dataframe to managed dataset using custom schema

brianj11011 · December 2023

The issue we are working through pertains to Dataiku auto-detect column type when passing a Pandas dataframe to Snowflake. The schema Dataiku creates based on the Pandas dataframe converts Object data types to VARCHAR which is fine. The issue is the VARCHAR size. In out test example all dataframe objects are given a VARCHAR(4194304) and dataframe numerics are set to Number(38,0). When i try to modify the datatypes, Dataiku will not allow a decrease in data-type sizes, ie VARCHAR(4194304) to VARCHAR(20). Ideally we would like to create a custom schema and apply that schema to a Dataiku managed dataset linked to Snowfalke.

The Jupyter Notebook code we are using:

client = dataiku.api_client()
project = dataiku.api_client().get_project(dataiku.default_project_key())

builder = project.new_managed_dataset("zzz_bs_digital_engagements_app_test")
builder.with_store_into("Snowflake-DDX_xxxxx_PROD_EBUSINESS")
builder.create()

# Write dataframe to dataset
dataiku.Dataset("zzz_bs_digital_engagements_app_test").write_with_schema(df_sessions, dropAndCreate=True)

We are looking to modify the "Write dataframe to dataset" code to use a custom schema rather than the auto-detect schema that the above code is using.

Operating system used: Windows

Turribeach · December 2023

Please always use code blocks when posting code. You can use the write_dataframe() method instead if you prefer:

dataiku.Dataset("zzz_bs_digital_engagements_app_test").write_dataframe(df_sessions, infer_schema=False, dropAndCreate=True)

and the schema won't be changed. But there are no free lunches, if you do this you will not be able to cascade schema changes across your flow and will have to maintain schema changes manually in your dataset. As you are building the dataset in code this might be desired.

https://doc.dataiku.com/dss/latest/code_recipes/python.html

Marlan · December 2023

Hi @brianj11011
,

@Turribeach
suggestion is part of the solution. It sounds like you also want to control Snowflake data types. One idea would be to set the Table Create SQL (Dataset Settings, Advanced in UI) when you create the dataset programmatically. This would enable you to control the data types used when creating the Snowflake table. I haven't tested this so don't know if it would work but it's probably the first thing I would try.

Marlan

brianj11011 · December 2023

Thanks for the reply. I did edit the post and insert a code block. Thanks for the gentle reminder.
Tried the suggestion and received an error.

Here's the code I used:

Exception: An error occurred during dataset write (3EW67Gx3Os): RuntimeException: Schema incompatibility: 31 columns in data, 0 columns in target dataset.

Below is the error

client = dataiku.api_client()
project = dataiku.api_client().get_project(dataiku.default_project_key())

builder = project.new_managed_dataset("zzz_bs_digital_engagements_app_test")
builder.with_store_into("Snowflake-DDX_MDB-xxxx_PROD_EBUSINESS")
builder.create()

dataiku.Dataset("zzz_bs_digital_engagements_app_test").write_dataframe(df_sessions, infer_schema=False, dropAndCreate=True)

Thanks for the quick response.

brianj11011 · December 2023

Thanks for the quick response. The solution you provided fixed the dataset. However, I would like the Snowflake table to reflect the schema changes too. How do we cascade the schema changes to the managed Dataiku dataset to the corresponding Snowflake table.

links to members

Write pandas dataframe to managed dataset using custom schema

Answers

Categories

Setup Info

Tags