Reading multiple large files from sftp

SaschaS
SaschaS Registered Posts: 12 ✭✭✭✭

Hi,

I would like to combine data from a SFTP server into a dataset.
The files are available on the server as zipped csv files.

A new file is added every day like "filename_YYYY-mm-dd.csv.zip".

The creation of the data set takes a long time (several hours) because the files are very large.
Is there a way to import a kind of delta so that not all files are always fetched from the server and only the newest file is added to the dataset?

Best regards
Sascha

Tagged:

Best Answer

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,212 Dataiker
    Answer ✓

    Hi @SaschaS
    ,

    You may be able to leverage partitioned folders in this case.

    If new files are added daily, and they all follow the pattern YYYY-MM-DD e.g

    Screenshot 2022-08-10 at 12.31.05.pngScreenshot 2022-08-10 at 12.31.14.png

    Here is a sample flow, partitioned folder + files in folder dataset + sync recipe to the partitioned dataset.
    When I build the latest partition via a scenario e.g LAST_DAY, it will only pick up the files from the last day only.

    Screenshot 2022-08-10 at 12.32.28.png

Answers

  • Miguel Angel
    Miguel Angel Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 118 Dataiker

    There is no need to fetch all the files from the SFTP server everyday. For example, you can save the results of the files combine on the DSS instance host.

    Then, you can use a "New SFTP dataset" or a "Download Recipe" to a "Managed Folder" to just get the new files from the remote server and combine them with the ones already processed from the previous day.

    You can use a scenario to automate the process so it runs on a daily basis

  • SaschaS
    SaschaS Registered Posts: 12 ✭✭✭✭

    Hi MiguelangelC,

    I'm not sure I fully understand this.
    I create an SFTP dataset with all the data on the server. Then a second, which only contains the last day.
    Tomorrow the second dataset would only contain the data from yesterday, but the SFTP dataset would only contain the data up to the day before yesterday.

    Can you maybe send me a screenshot with an example, then I can better understand what you mean.

  • SaschaS
    SaschaS Registered Posts: 12 ✭✭✭✭

    Maybe someone else can get me a hint how to solve my problem.

  • SaschaS
    SaschaS Registered Posts: 12 ✭✭✭✭

    Hi @AlexT

    thanks for the help.

    Your explanation helped me.
    If I understand everything correctly, I have to initially do a complete import of all partitions and can then use scenario to import yesterday.
    I've tried around a bit and it seems to be working.
    Do you have any tips or hints on what I have to pay attention to that might not be obvious to me?

    Best regards
    Sascha

Setup Info
    Tags
      Help me…