Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I am facing partial data loading while using partitions. I am trying to do a parallel loading into the same table(snowflake). I have to 8 partitions. I tried to load the dataset into CSV before SQL load, the data is intact.
I have defined 10K as batch insert size in the table
[2022/09/09-22:45:14.031] [qtp1016064173-915] [INFO] [dku.sql.generic] - Table PUBLIC.ARTIFACTS_DATA exists [2022/09/09-22:45:14.217] [qtp1016064173-915] [DEBUG] [dku.connections.sql.provider] - Commit conn=SNF_TESTEXP_CONN-8qI2HBI [2022/09/09-22:45:14.312] [qtp1016064173-915] [INFO] [com.dataiku.dip.dataflow.streaming.DatasetWriter] - Done initializing output writer [2022/09/09-22:45:14.473] [null-err-721] [INFO] [dku.utils] - 2022-09-09 22:45:14,473 INFO Waiting for data to send ... [2022/09/09-22:45:14.477] [qtp1016064173-915] [INFO] [dku.output.sql] - Executing SQL insert batch ... [2022/09/09-22:45:15.786] [qtp1016064173-915] [INFO] [dku.output.sql] - SQL insert batch done [2022/09/09-22:45:15.787] [qtp1016064173-915] [INFO] [dku.output.sql] - appended 10000 rows, failedRows=0 rowsWithFailedCells=0 [2022/09/09-22:45:15.787] [qtp1016064173-915] [INFO] [dku.datasets] - Emitted 10000 rows [2022/09/09-22:45:15.826] [qtp1016064173-915] [INFO] [dku.output.sql] - Executing SQL insert batch ... [2022/09/09-22:45:16.678] [qtp1016064173-915] [INFO] [dku.output.sql] - SQL insert batch done [2022/09/09-22:45:16.678] [qtp1016064173-915] [INFO] [dku.output.sql] - appended 20000 rows, failedRows=0 rowsWithFailedCells=0 [2022/09/09-22:45:16.678] [qtp1016064173-915] [INFO] [dku.datasets] - Emitted 20000 rows [2022/09/09-22:45:20.652] [null-err-721] [INFO] [dku.utils] - 2022-09-09 22:45:20,652 INFO Sending data (5000053) [2022/09/09-22:45:20.653] [null-err-721] [INFO] [dku.utils] - 2022-09-09 22:45:20,653 INFO Waiting for data to send ... [2022/09/09-22:45:20.672] [qtp1016064173-915] [INFO] [dku.output.sql] - Executing SQL insert batch ... [2022/09/09-22:45:21.398] [qtp1016064173-915] [INFO] [dku.output.sql] - SQL insert batch done [2022/09/09-22:45:21.398] [qtp1016064173-915] [INFO] [dku.output.sql] - appended 30000 rows, failedRows=0 rowsWithFailedCells=0 [2022/09/09-22:45:21.398] [qtp1016064173-915] [INFO] [dku.datasets] - Emitted 30000 rows [2022/09/09-22:45:21.436] [qtp1016064173-915] [INFO] [dku.output.sql] - Executing SQL insert batch ... [2022/09/09-22:45:22.188] [qtp1016064173-915] [INFO] [dku.output.sql] - SQL insert batch done [2022/09/09-22:45:22.188] [qtp1016064173-915] [INFO] [dku.output.sql] - appended 40000 rows, failedRows=0 rowsWithFailedCells=0 [2022/09/09-22:45:22.188] [qtp1016064173-915] [INFO] [dku.datasets] - Emitted 40000 rows [2022/09/09-22:45:26.693] [null-err-721] [INFO] [dku.utils] - 2022-09-09 22:45:26,692 INFO Sending data (5000052) [2022/09/09-22:45:26.694] [null-err-721] [INFO] [dku.utils] - 2022-09-09 22:45:26,693 INFO Waiting for data to send ... [2022/09/09-22:45:26.695] [qtp1016064173-915] [INFO] [dku.output.sql] - Executing SQL insert batch ... [2022/09/09-22:45:26.843] [qtp1016064173-42] [INFO] [dku.jobs] - Connects using API ticket [2022/09/09-22:45:26.843] [qtp1016064173-42] [DEBUG] [dku.jobs] - Received command : /tintercom/datasets/wait-write-session [2022/09/09-22:45:27.468] [qtp1016064173-915] [INFO] [dku.output.sql] - SQL insert batch done [2022/09/09-22:45:27.468] [qtp1016064173-915] [INFO] [dku.output.sql] - appended 50000 rows, failedRows=0 rowsWithFailedCells=0 [2022/09/09-22:45:27.468] [qtp1016064173-915] [INFO] [dku.datasets] - Emitted 50000 rows
Hi @muthu11 ,
Loading data into Snowflake would best be done via Cloud Storage( S3, GCS, Azure Blob) and levering fast-path this will be exponentially faster then regular SQL INSERT or COPY.
https://doc.dataiku.com/dss/latest/connecting/sql/snowflake.html#writing-data-into-snowflake
This will ensure better performance but will require the mentioned prerequisites in the documentation.
Can you please confirm what storage your current CSV files are and if the pre-requisites for fast-path are met?
Thanks