File from SFTP to Dataiku csv/excel on daily basis.

vinayk
vinayk Partner, Dataiku DSS Core Designer, Registered Posts: 12 Partner

Hello Community,

I need a help in regards to one scenario, where in we will be updating the csv/excel file in SFTP server and that updated file has to uploaded to Dataiku on daily basis, what is the way to achieve this.

Any help would be much appreciated.

Thank you.

Best Answer

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron
    edited July 17 Answer ✓

    Hi @vinayk
    ,

    OK, so, three options:

    1. If you have a "Files-based" SCP/SFTP connection to the machine where this file lives in, create a new dataset using this connection and the "New SFTP dataset" option.
      1. Be sure to show the "advanced options" and explicitly select the file you want to use as the source of your new dataset.
      2. To be sure that the file gets to be updated on daily basis, I'd create a Sync recipe, and then be sure to make it run in your scenario.
    2. If your instance doesn't have a "SCP/SFTP" connection configured, but you have the url and credentials, use the "Download Recipe" to get the files from the remote server, with the "Download from HTTP or FTP URL".
      1. This will download the files into a managed folder, from where you will need to create a new dataset using the downloaded file
    3. Use a python recipe to read the csv/excel file from the remote server, use pandas to convert it into a dataframe, and store it in a dataset. Using the paramiko library, this could be done like this:

    import dataiku
    ​import paramiko
    import pandas as pd
    
    client_sftp = paramiko.SSHClient()
    client_sftp.set_missing_host_key_policy(paramiko.AutoAddPolicy())
    client_sftp.connect('host.name',
                   username='user',
                   password='password')
    sftp = client_sftp.open_sftp()
    
    # if the file is csv:
    tmp_df = pd.read_csv(sftp.open('path_to_the_csv_file/filename.csv'))
    
    result_dataset = dataiku.Dataset("result_dataset")
    result_dataset.write_with_schema(tmp_df)

    Hope this helps!

Answers

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron

    Hi @vinayk
    ,

    Is it a single file or multiple files? Does the name change?

    Using the download recipe might be a good option, but somehow I do prefer to use a python recipe to do the trick.

    While you give us some more details, maybe you can also find useful information in this post:

    Cheers!

  • vinayk
    vinayk Partner, Dataiku DSS Core Designer, Registered Posts: 12 Partner

    Thank you @Ignacio_Toledo
    , that is a single file and name remains same.

  • vinayk
    vinayk Partner, Dataiku DSS Core Designer, Registered Posts: 12 Partner

    Thank you @Ignacio_Toledo
    , this helps.

Setup Info
    Tags
      Help me…