Optimize SQL Server uploads

pnaik1
Level 3
Optimize SQL Server uploads

Hi all, 

Currently, when I am trying to write data (residing on S3) to SQL Server using sync recipe, it is taking appx 60 minutes for writing 1.8 million rows. Is there any way to optimize it and reduce time?

Thanks!

0 Kudos
8 Replies
Clรฉment_Stenac

Hi,

Unfortunately, there is no specific capability in DSS that would allow you to significantly enhance this speed. Inserting data in SQL Server takes time.

If possible, we would recommend processing your data on S3 in order to reduce it, so as to only load the minimal amount of data in SQL Server.

0 Kudos
Marlan

Hi @pnaik1,

If you are OK working with Python and SQL, you may be able to develop a process in which the data is first written out of S3 to a local text file and then loaded into SQL Server using a bulk load approach.

We have developed this sort of process for loading data into Netezza (where inserts are particularly slow and costly) and we do see a pretty big speed improvement.

Let me know if you are interested and I can share more information.

Marlan

 

0 Kudos
pnaik1
Level 3
Author

Hi @Marlan Thanks for the reply. Sure, whatever works. 

0 Kudos
pnaik1
Level 3
Author

Hi @Marlan Thanks for the reply. Sure, whatever works.

Pardeep

0 Kudos
Marlan

Hi @pnaik1,

OK here's what we are doing. I can't say what sort of speed improvement you'd see but for us it was quite substantial. The speedup came from shifting from line by line inserts of the data into the SQL table to using Netezza's bulk load functionality. 

The approach involves two steps. The first step writes the source data to a text file in a DSS managed folder. How this is done depends on the where the source data is stored. Given that yours is stored in S3 I'd think you could use a download recipe to do this. 

The second step executes bulk load from the text file into the SQL database. This step is handled by a Python recipe.

Note that this approach will only work if the SQL Server instance you are writing to is able to access the text file you downloaded in the first step. The easiest case is if the SQL Server instance can access the DSS managed folder -- this is what is assumed in Python script given below. If not, you might be able to copy that file to a location that the SQL Server instance can access.

The following Python script shows how we have done this using Netezza's bulk load statement. The input for the recipe would the managed folder that contains the text file (stored there in the first step) and the output would be the destination table in SQL Server.  You would need to adjust the SQL part of this recipe to use SQL Server's bulk load statement.

import dataiku
from dataiku.core.sql import SQLExecutor2

#
# Get input folder/file details 
#
input_folder = dataiku.Folder('Output_Files')
input_folder_path = input_folder.get_path()
input_path_and_name = os.path.join(input_folder_path, 'output.csv')

#
# Specify SQL statements that will execute data load
# 

# Get SQL table name of output dataset 
output_dataset = dataiku.Dataset('OUTPUT')
sql_table_name = output_dataset.get_location_info()['info']['table'] # projectKey already resolved

# Load SQL statements including dropping and recreating table
pre_queries = """
DROP TABLE {sql_table_name} IF EXISTS;
COMMIT;

CREATE TABLE {sql_table_name} AS
SELECT 
    COL1,
    COL2,
    COL3
FROM EXTERNAL '{input_path_and_name}'
    (COL1 INTEGER,
    COL2 DATE,
    COL3 INTEGER)
USING (
    DELIM 9 -- tab character
    QUOTEDVALUE DOUBLE
    REMOTESOURCE 'JDBC');
COMMIT -- required in Netezza so above statements are executed and not rolled back
"""

# Replace variables
pre_queries = pre_queries.format(sql_table_name=sql_table_name, input_path_and_name=input_path_and_name)
print(pre_queries)

# Pre queries must be a list (convert to one statement per list item) 
pre_queries = pre_queries.split(';')


#
# Specify final query (must always have one)
# 

# Use final query to check that all records got loaded
query = "SELECT COUNT(*) AS REC_CNT FROM {sql_table_name}".format(sql_table_name=sql_table_name)
print(query)


#
# Execute queries
# 

executor = SQLExecutor2(dataset='OUTPUT')
result_df = executor.query_to_df(query, pre_queries=pre_queries) 
print(result_df.at[0, 'REC_CNT'])

 

While this approach adds an extra step of downloading the source data to a managed folder, the additional speed resulting from the use of a bulk load more than makes up for that extra step.

Let me know if you have questions.

Marlan

0 Kudos
pnaik1
Level 3
Author

Hi @Marlan 

Thanks a lot for your solution. Really appreciated, but I am getting error at first step only. The error is "

Job failed: Failed to set modification time of path '/out-s0.csv.gz', caused by: IOException: Failed to send command to kernel, caused by: SocketException: Broken pipe (Write failed)"

error.png

0 Kudos
Marlan

Hi @pnaik1,

We don't use S3 so unfortunately I'm not going to be very much help with this step. You should be able to get help on this step (as it's a standard DSS recipe) either here on the community or perhaps by opening a ticket with DSS support.

That said, looks like the files that are compressed. Not too surprising but it does mean that we'd need to add in a step to decompress them before the SQL Server bulk load step could be used. This can be be done within Python of course with the right package. I don't see any reason why this wouldn't work but it'll be some additional work to put this together. 

Marlan

0 Kudos
allan
Level 3

One thing to note is that if there is no schema information available, string columns will be defaulted by DSS to be NVARCHAR(MAX) which is slow and chews up storage - if you have definitive data sizes for your string columns, you should set them on the schema. The length is stored in the schema of the dataset. Please go to the "Schema" tab of the dataset settings, select your column, and in the right part of the screen, update the "max length".

0 Kudos