Write pandas dataframe to managed dataset using custom schema
The issue we are working through pertains to Dataiku auto-detect column type when passing a Pandas dataframe to Snowflake. The schema Dataiku creates based on the Pandas dataframe converts Object data types to VARCHAR which is fine. The issue is the VARCHAR size. In out test example all dataframe objects are given a VARCHAR(4194304) and dataframe numerics are set to Number(38,0). When i try to modify the datatypes, Dataiku will not allow a decrease in data-type sizes, ie VARCHAR(4194304) to VARCHAR(20). Ideally we would like to create a custom schema and apply that schema to a Dataiku managed dataset linked to Snowfalke.
The Jupyter Notebook code we are using:
client = dataiku.api_client() project = dataiku.api_client().get_project(dataiku.default_project_key()) builder = project.new_managed_dataset("zzz_bs_digital_engagements_app_test") builder.with_store_into("Snowflake-DDX_xxxxx_PROD_EBUSINESS") builder.create() # Write dataframe to dataset dataiku.Dataset("zzz_bs_digital_engagements_app_test").write_with_schema(df_sessions, dropAndCreate=True)
We are looking to modify the "Write dataframe to dataset" code to use a custom schema rather than the auto-detect schema that the above code is using.
Operating system used: Windows
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,160 Neuron
Please always use code blocks when posting code. You can use the write_dataframe() method instead if you prefer:
dataiku.Dataset("zzz_bs_digital_engagements_app_test").write_dataframe(df_sessions, infer_schema=False, dropAndCreate=True)
and the schema won't be changed. But there are no free lunches, if you do this you will not be able to cascade schema changes across your flow and will have to maintain schema changes manually in your dataset. As you are building the dataset in code this might be desired.
https://doc.dataiku.com/dss/latest/code_recipes/python.html
-
Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 321 Neuron
Hi @brianj11011
,@Turribeach
suggestion is part of the solution. It sounds like you also want to control Snowflake data types. One idea would be to set the Table Create SQL (Dataset Settings, Advanced in UI) when you create the dataset programmatically. This would enable you to control the data types used when creating the Snowflake table. I haven't tested this so don't know if it would work but it's probably the first thing I would try.Marlan
-
Thanks for the reply. I did edit the post and insert a code block. Thanks for the gentle reminder.
Tried the suggestion and received an error.
Here's the code I used:Exception: An error occurred during dataset write (3EW67Gx3Os): RuntimeException: Schema incompatibility: 31 columns in data, 0 columns in target dataset.
Below is the errorclient = dataiku.api_client() project = dataiku.api_client().get_project(dataiku.default_project_key()) builder = project.new_managed_dataset("zzz_bs_digital_engagements_app_test") builder.with_store_into("Snowflake-DDX_MDB-xxxx_PROD_EBUSINESS") builder.create() dataiku.Dataset("zzz_bs_digital_engagements_app_test").write_dataframe(df_sessions, infer_schema=False, dropAndCreate=True)
Thanks for the quick response.
-
Thanks for the quick response. The solution you provided fixed the dataset. However, I would like the Snowflake table to reflect the schema changes too. How do we cascade the schema changes to the managed Dataiku dataset to the corresponding Snowflake table.
links to members