concurrent loading via partition - partial data loading

muthu11 · September 2022

I am facing partial data loading while using partitions. I am trying to do a parallel loading into the same table(snowflake). I have to 8 partitions. I tried to load the dataset into CSV before SQL load, the data is intact.

I have defined 10K as batch insert size in the table

[2022/09/09-22:45:14.031] [qtp1016064173-915] [INFO] [dku.sql.generic]  - Table PUBLIC.ARTIFACTS_DATA exists
[2022/09/09-22:45:14.217] [qtp1016064173-915] [DEBUG] [dku.connections.sql.provider]  - Commit conn=SNF_TESTEXP_CONN-8qI2HBI
[2022/09/09-22:45:14.312] [qtp1016064173-915] [INFO] [com.dataiku.dip.dataflow.streaming.DatasetWriter]  - Done initializing output writer
[2022/09/09-22:45:14.473] [null-err-721] [INFO] [dku.utils]  - 2022-09-09 22:45:14,473 INFO Waiting for data to send ...
[2022/09/09-22:45:14.477] [qtp1016064173-915] [INFO] [dku.output.sql]  - Executing SQL insert batch ...
[2022/09/09-22:45:15.786] [qtp1016064173-915] [INFO] [dku.output.sql]  - SQL insert batch done
[2022/09/09-22:45:15.787] [qtp1016064173-915] [INFO] [dku.output.sql]  - appended 10000 rows, failedRows=0 rowsWithFailedCells=0
[2022/09/09-22:45:15.787] [qtp1016064173-915] [INFO] [dku.datasets]  - Emitted 10000 rows
[2022/09/09-22:45:15.826] [qtp1016064173-915] [INFO] [dku.output.sql]  - Executing SQL insert batch ...
[2022/09/09-22:45:16.678] [qtp1016064173-915] [INFO] [dku.output.sql]  - SQL insert batch done
[2022/09/09-22:45:16.678] [qtp1016064173-915] [INFO] [dku.output.sql]  - appended 20000 rows, failedRows=0 rowsWithFailedCells=0
[2022/09/09-22:45:16.678] [qtp1016064173-915] [INFO] [dku.datasets]  - Emitted 20000 rows
[2022/09/09-22:45:20.652] [null-err-721] [INFO] [dku.utils]  - 2022-09-09 22:45:20,652 INFO Sending data (5000053)
[2022/09/09-22:45:20.653] [null-err-721] [INFO] [dku.utils]  - 2022-09-09 22:45:20,653 INFO Waiting for data to send ...
[2022/09/09-22:45:20.672] [qtp1016064173-915] [INFO] [dku.output.sql]  - Executing SQL insert batch ...
[2022/09/09-22:45:21.398] [qtp1016064173-915] [INFO] [dku.output.sql]  - SQL insert batch done
[2022/09/09-22:45:21.398] [qtp1016064173-915] [INFO] [dku.output.sql]  - appended 30000 rows, failedRows=0 rowsWithFailedCells=0
[2022/09/09-22:45:21.398] [qtp1016064173-915] [INFO] [dku.datasets]  - Emitted 30000 rows
[2022/09/09-22:45:21.436] [qtp1016064173-915] [INFO] [dku.output.sql]  - Executing SQL insert batch ...
[2022/09/09-22:45:22.188] [qtp1016064173-915] [INFO] [dku.output.sql]  - SQL insert batch done
[2022/09/09-22:45:22.188] [qtp1016064173-915] [INFO] [dku.output.sql]  - appended 40000 rows, failedRows=0 rowsWithFailedCells=0
[2022/09/09-22:45:22.188] [qtp1016064173-915] [INFO] [dku.datasets]  - Emitted 40000 rows
[2022/09/09-22:45:26.693] [null-err-721] [INFO] [dku.utils]  - 2022-09-09 22:45:26,692 INFO Sending data (5000052)
[2022/09/09-22:45:26.694] [null-err-721] [INFO] [dku.utils]  - 2022-09-09 22:45:26,693 INFO Waiting for data to send ...
[2022/09/09-22:45:26.695] [qtp1016064173-915] [INFO] [dku.output.sql]  - Executing SQL insert batch ...
[2022/09/09-22:45:26.843] [qtp1016064173-42] [INFO] [dku.jobs]  - Connects using API ticket
[2022/09/09-22:45:26.843] [qtp1016064173-42] [DEBUG] [dku.jobs]  - Received command : /tintercom/datasets/wait-write-session
[2022/09/09-22:45:27.468] [qtp1016064173-915] [INFO] [dku.output.sql]  - SQL insert batch done
[2022/09/09-22:45:27.468] [qtp1016064173-915] [INFO] [dku.output.sql]  - appended 50000 rows, failedRows=0 rowsWithFailedCells=0
[2022/09/09-22:45:27.468] [qtp1016064173-915] [INFO] [dku.datasets]  - Emitted 50000 rows

Alexandru · September 2022

Hi @muthu11
,

Loading data into Snowflake would best be done via Cloud Storage( S3, GCS, Azure Blob) and levering fast-path this will be exponentially faster then regular SQL INSERT or COPY.

https://doc.dataiku.com/dss/latest/connecting/sql/snowflake.html#writing-data-into-snowflake

This will ensure better performance but will require the mentioned prerequisites in the documentation.

Can you please confirm what storage your current CSV files are and if the pre-requisites for fast-path are met?

Thanks

concurrent loading via partition - partial data loading

Answers

Categories

Setup Info

Tags