Discover this year's submissions to the Dataiku Frontrunner Awards and give kudos to your favorite use cases and success stories!READ MORE

File from SFTP to Dataiku csv/excel on daily basis.

Solved!
vinayk
Level 2
Level 2
File from SFTP to Dataiku csv/excel on daily basis.

Hello Community,

I need a help in regards to one scenario, where in we will be updating the csv/excel file in SFTP server and that updated file has to uploaded to Dataiku on daily basis, what is the way to achieve this.

Any help would be much appreciated.

Thank you.

0 Kudos
1 Solution
Ignacio_Toledo

Hi @vinayk,

OK, so, three options:

  1. If you have a "Files-based" SCP/SFTP connection to the machine where this file lives in, create a new dataset using this connection and the "New SFTP dataset" option.
    1. Be sure to show the "advanced options" and explicitly select the file you want to use as the source of your new dataset.
    2. To be sure that the file gets to be updated on daily basis, I'd create a Sync recipe, and then be sure to make it run in your scenario.
  2. If your instance doesn't have a "SCP/SFTP" connection configured, but you have the url and credentials, use the "Download Recipe" to get the files from the remote server, with the "Download from HTTP or FTP URL".
    1. This will download the files into a managed folder, from where you will need to create a new dataset using the downloaded file
  3. Use a python recipe to read the csv/excel file from the remote server, use pandas to convert it into a dataframe, and store it in a dataset. Using the paramiko library,  this could be done like this:

 

import dataiku
​import paramiko
import pandas as pd

client_sftp = paramiko.SSHClient()
client_sftp.set_missing_host_key_policy(paramiko.AutoAddPolicy())
client_sftp.connect('host.name',
               username='user',
               password='password')
sftp = client_sftp.open_sftp()

# if the file is csv:
tmp_df = pd.read_csv(sftp.open('path_to_the_csv_file/filename.csv'))

result_dataset = dataiku.Dataset("result_dataset")
result_dataset.write_with_schema(tmp_df)

 

Hope this helps!

View solution in original post

4 Replies
Ignacio_Toledo

Hi @vinayk,

Is it a single file or multiple files? Does the name change?

Using the download recipe might be a good option, but somehow I do prefer to use a python recipe to do the trick. 

While you give us some more details, maybe you can also find useful information in this post: Import dynamic dataset from SFTP 

Cheers!

0 Kudos
vinayk
Level 2
Level 2
Author

Thank you @Ignacio_Toledo, that is a single file and name remains same.

0 Kudos
Ignacio_Toledo

Hi @vinayk,

OK, so, three options:

  1. If you have a "Files-based" SCP/SFTP connection to the machine where this file lives in, create a new dataset using this connection and the "New SFTP dataset" option.
    1. Be sure to show the "advanced options" and explicitly select the file you want to use as the source of your new dataset.
    2. To be sure that the file gets to be updated on daily basis, I'd create a Sync recipe, and then be sure to make it run in your scenario.
  2. If your instance doesn't have a "SCP/SFTP" connection configured, but you have the url and credentials, use the "Download Recipe" to get the files from the remote server, with the "Download from HTTP or FTP URL".
    1. This will download the files into a managed folder, from where you will need to create a new dataset using the downloaded file
  3. Use a python recipe to read the csv/excel file from the remote server, use pandas to convert it into a dataframe, and store it in a dataset. Using the paramiko library,  this could be done like this:

 

import dataiku
​import paramiko
import pandas as pd

client_sftp = paramiko.SSHClient()
client_sftp.set_missing_host_key_policy(paramiko.AutoAddPolicy())
client_sftp.connect('host.name',
               username='user',
               password='password')
sftp = client_sftp.open_sftp()

# if the file is csv:
tmp_df = pd.read_csv(sftp.open('path_to_the_csv_file/filename.csv'))

result_dataset = dataiku.Dataset("result_dataset")
result_dataset.write_with_schema(tmp_df)

 

Hope this helps!

vinayk
Level 2
Level 2
Author

Thank you @Ignacio_Toledo , this helps.