Write pandas dataframe to managed dataset using custom schema

brianj11011
Level 2
Write pandas dataframe to managed dataset using custom schema

The issue we are working through pertains to Dataiku auto-detect column type when passing a Pandas dataframe to Snowflake. The schema Dataiku creates based on the Pandas dataframe  converts Object data types to VARCHAR which is fine. The issue is the VARCHAR size. In out test example all dataframe objects are given a VARCHAR(4194304) and dataframe numerics are set to Number(38,0). When i try to modify the datatypes, Dataiku will not allow a decrease in data-type sizes, ie VARCHAR(4194304) to VARCHAR(20). Ideally we would like to create a custom schema and apply that schema to a Dataiku managed dataset linked to Snowfalke.

The Jupyter Notebook code we are using:

client = dataiku.api_client()
project = dataiku.api_client().get_project(dataiku.default_project_key())

builder = project.new_managed_dataset("zzz_bs_digital_engagements_app_test")
builder.with_store_into("Snowflake-DDX_xxxxx_PROD_EBUSINESS")
builder.create()

# Write dataframe to dataset
dataiku.Dataset("zzz_bs_digital_engagements_app_test").write_with_schema(df_sessions, dropAndCreate=True)

We are looking to modify the "Write dataframe to dataset" code to use a custom schema rather than the auto-detect schema that the above code is using.


Operating system used: Windows

0 Kudos
4 Replies
Turribeach

Please always use code blocks when posting code. You can use the write_dataframe() method instead if you prefer:

dataiku.Dataset("zzz_bs_digital_engagements_app_test").write_dataframe(df_sessions, infer_schema=False, dropAndCreate=True)

 and the schema won't be changed. But there are no free lunches, if you do this you will not be able to cascade schema changes across your flow and will have to maintain schema changes manually in your dataset. As you are building the dataset in code this might be desired.

https://doc.dataiku.com/dss/latest/code_recipes/python.html

 

0 Kudos
brianj11011
Level 2
Author

Thanks for the reply. I did edit the post and insert a code block. Thanks for the gentle reminder. 
Tried the suggestion and received an error.

Here's the code I used:

 

client = dataiku.api_client()
project = dataiku.api_client().get_project(dataiku.default_project_key())

builder = project.new_managed_dataset("zzz_bs_digital_engagements_app_test")
builder.with_store_into("Snowflake-DDX_MDB-xxxx_PROD_EBUSINESS")
builder.create()

dataiku.Dataset("zzz_bs_digital_engagements_app_test").write_dataframe(df_sessions, infer_schema=False, dropAndCreate=True)

 


Below is the error

Exception: An error occurred during dataset write (3EW67Gx3Os): RuntimeException: Schema incompatibility: 31 columns in data, 0 columns in target dataset.

 Thanks for the quick response.

0 Kudos
Marlan

Hi @brianj11011,

@Turribeach suggestion is part of the solution. It sounds like you also want to control Snowflake data types. One idea would be to set the Table Create SQL (Dataset Settings, Advanced in UI) when you create the dataset programmatically. This would enable you to control the data types used when creating the Snowflake table. I haven't tested this so don't know if it would work but it's probably the first thing I would try.

Marlan

0 Kudos
brianj11011
Level 2
Author

Thanks for the quick response. The solution you provided fixed the dataset. However, I would like the Snowflake table to reflect the schema changes too. How do we cascade the schema changes to the managed Dataiku dataset to the corresponding Snowflake table.

@ links to members

0 Kudos