concurrent loading via partition - partial data loading
I am facing partial data loading while using partitions. I am trying to do a parallel loading into the same table(snowflake). I have to 8 partitions. I tried to load the dataset into CSV before SQL load, the data is intact.
I have defined 10K as batch insert size in the table
[2022/09/09-22:45:14.031] [qtp1016064173-915] [INFO] [dku.sql.generic] - Table PUBLIC.ARTIFACTS_DATA exists [2022/09/09-22:45:14.217] [qtp1016064173-915] [DEBUG] [dku.connections.sql.provider] - Commit conn=SNF_TESTEXP_CONN-8qI2HBI [2022/09/09-22:45:14.312] [qtp1016064173-915] [INFO] [com.dataiku.dip.dataflow.streaming.DatasetWriter] - Done initializing output writer [2022/09/09-22:45:14.473] [null-err-721] [INFO] [dku.utils] - 2022-09-09 22:45:14,473 INFO Waiting for data to send ... [2022/09/09-22:45:14.477] [qtp1016064173-915] [INFO] [dku.output.sql] - Executing SQL insert batch ... [2022/09/09-22:45:15.786] [qtp1016064173-915] [INFO] [dku.output.sql] - SQL insert batch done [2022/09/09-22:45:15.787] [qtp1016064173-915] [INFO] [dku.output.sql] - appended 10000 rows, failedRows=0 rowsWithFailedCells=0 [2022/09/09-22:45:15.787] [qtp1016064173-915] [INFO] [dku.datasets] - Emitted 10000 rows [2022/09/09-22:45:15.826] [qtp1016064173-915] [INFO] [dku.output.sql] - Executing SQL insert batch ... [2022/09/09-22:45:16.678] [qtp1016064173-915] [INFO] [dku.output.sql] - SQL insert batch done [2022/09/09-22:45:16.678] [qtp1016064173-915] [INFO] [dku.output.sql] - appended 20000 rows, failedRows=0 rowsWithFailedCells=0 [2022/09/09-22:45:16.678] [qtp1016064173-915] [INFO] [dku.datasets] - Emitted 20000 rows [2022/09/09-22:45:20.652] [null-err-721] [INFO] [dku.utils] - 2022-09-09 22:45:20,652 INFO Sending data (5000053) [2022/09/09-22:45:20.653] [null-err-721] [INFO] [dku.utils] - 2022-09-09 22:45:20,653 INFO Waiting for data to send ... [2022/09/09-22:45:20.672] [qtp1016064173-915] [INFO] [dku.output.sql] - Executing SQL insert batch ... [2022/09/09-22:45:21.398] [qtp1016064173-915] [INFO] [dku.output.sql] - SQL insert batch done [2022/09/09-22:45:21.398] [qtp1016064173-915] [INFO] [dku.output.sql] - appended 30000 rows, failedRows=0 rowsWithFailedCells=0 [2022/09/09-22:45:21.398] [qtp1016064173-915] [INFO] [dku.datasets] - Emitted 30000 rows [2022/09/09-22:45:21.436] [qtp1016064173-915] [INFO] [dku.output.sql] - Executing SQL insert batch ... [2022/09/09-22:45:22.188] [qtp1016064173-915] [INFO] [dku.output.sql] - SQL insert batch done [2022/09/09-22:45:22.188] [qtp1016064173-915] [INFO] [dku.output.sql] - appended 40000 rows, failedRows=0 rowsWithFailedCells=0 [2022/09/09-22:45:22.188] [qtp1016064173-915] [INFO] [dku.datasets] - Emitted 40000 rows [2022/09/09-22:45:26.693] [null-err-721] [INFO] [dku.utils] - 2022-09-09 22:45:26,692 INFO Sending data (5000052) [2022/09/09-22:45:26.694] [null-err-721] [INFO] [dku.utils] - 2022-09-09 22:45:26,693 INFO Waiting for data to send ... [2022/09/09-22:45:26.695] [qtp1016064173-915] [INFO] [dku.output.sql] - Executing SQL insert batch ... [2022/09/09-22:45:26.843] [qtp1016064173-42] [INFO] [dku.jobs] - Connects using API ticket [2022/09/09-22:45:26.843] [qtp1016064173-42] [DEBUG] [dku.jobs] - Received command : /tintercom/datasets/wait-write-session [2022/09/09-22:45:27.468] [qtp1016064173-915] [INFO] [dku.output.sql] - SQL insert batch done [2022/09/09-22:45:27.468] [qtp1016064173-915] [INFO] [dku.output.sql] - appended 50000 rows, failedRows=0 rowsWithFailedCells=0 [2022/09/09-22:45:27.468] [qtp1016064173-915] [INFO] [dku.datasets] - Emitted 50000 rows
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,248 Dataiker
Hi @muthu11
,Loading data into Snowflake would best be done via Cloud Storage( S3, GCS, Azure Blob) and levering fast-path this will be exponentially faster then regular SQL INSERT or COPY.
https://doc.dataiku.com/dss/latest/connecting/sql/snowflake.html#writing-data-into-snowflake
This will ensure better performance but will require the mentioned prerequisites in the documentation.
Can you please confirm what storage your current CSV files are and if the pre-requisites for fast-path are met?Thanks