Writing a pandas dataframe to Snowflake table using writer
Hi ,
I am trying to write a Pandas Dataframe containing 1.58 million records to a snowflake table.
To make the procss faster, i wanted to use the chunked writer functionality available under: Datasets (reading and writing data) — Dataiku DSS 11 documentation.
I have the dataframe named "Out_df" containing 1.5m records which i want to write to Snowflake table "RM20_DATA"
when I execute the below piece of code, I get an error as below:
AttributeError: 'DataFrame' object has no attribute 'iter_dataframes'
I understand that the method iter_dataframes() cannot work on a dataframe but works on a dataset.
How can I convert my out_df to a dataset that can be iterated using iter_dataframes() function to write the data.
----Code---
rm20_data = dataiku.Dataset("RM20_DATA")
rm20_data.write_schema_from_dataframe(Out_df)
with rm20_data.get_writer() as writer:
for df in Out_df.iter_dataframes():
# Process the df dataframe ...
# Write the processed dataframe
writer.write_dataframe(df)
Operating system used: windows
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,239 Dataiker
Hi @Megha
,To increase the writing speed to snowflake from a python recipe you should try and leverage fast-path. This will be exponentially faster.
https://doc.dataiku.com/dss/latest/connecting/sql/snowflake.html#writing-data-into-snowflake
You will need to update the connection-enabled fast path and have cloud storage( S3, GCS, Azure Blob) in the same region and with other prerequisites detailed in the doc above.
Chunked reading/writing help with the memory usage of the python recipe but will not really help with the writting speed. If you can comfortably fit the dataset into memory you don't need to use chunked reading/writing. Your code is failing because you didn't define the writer with the Out_df,Please refer to below doc and syntax ;
https://doc.dataiku.com/dss/latest/python-api/datasets-data.html#chunked-reading-and-writing-with-pandasinp = Dataset("input") out = Dataset("output") with out.get_writer() as writer: for df in inp.iter_dataframes(): # Process the df dataframe ... # Write the processed dataframe writer.write_dataframe(df)
Thanks,