Reading multiple large files from sftp

Solved!
SaschaS
Level 2
Reading multiple large files from sftp

Hi,

I would like to combine data from a SFTP server into a dataset.
The files are available on the server as zipped csv files.

A new file is added every day like "filename_YYYY-mm-dd.csv.zip".

The creation of the data set takes a long time (several hours) because the files are very large.
Is there a way to import a kind of delta so that not all files are always fetched from the server and only the newest file is added to the dataset?

Best regards
Sascha

0 Kudos
1 Solution
AlexT
Dataiker

Hi @SaschaS ,

You may be able to leverage partitioned folders in this case. 

If new files are added daily, and they all follow the pattern YYYY-MM-DD e.g 

Screenshot 2022-08-10 at 12.31.05.pngScreenshot 2022-08-10 at 12.31.14.png

Here is a sample flow, partitioned folder + files in folder dataset + sync recipe to the partitioned dataset. 
When I build the latest partition via a scenario e.g LAST_DAY, it will only pick up the files from the last day only. 

Screenshot 2022-08-10 at 12.32.28.png

 

View solution in original post

0 Kudos
5 Replies
MiguelangelC
Dataiker

There is no need to fetch all the files from the SFTP server everyday.  For example, you can save the results of the files combine on the DSS instance host.

Then, you can use a "New SFTP dataset" or a "Download Recipe" to a "Managed Folder" to just get the new files from the remote server and combine them with the ones already processed from the previous day.

You can use a scenario to automate the process so it runs on a daily basis

0 Kudos
SaschaS
Level 2
Author

Hi MiguelangelC,

I'm not sure I fully understand this.
I create an SFTP dataset with all the data on the server. Then a second, which only contains the last day.
Tomorrow the second dataset would only contain the data from yesterday, but the SFTP dataset would only contain the data up to the day before yesterday.

Can you maybe send me a screenshot with an example, then I can better understand what you mean.

0 Kudos
SaschaS
Level 2
Author

Maybe someone else can get me a hint how to solve my problem. ๐Ÿ™‚

0 Kudos
AlexT
Dataiker

Hi @SaschaS ,

You may be able to leverage partitioned folders in this case. 

If new files are added daily, and they all follow the pattern YYYY-MM-DD e.g 

Screenshot 2022-08-10 at 12.31.05.pngScreenshot 2022-08-10 at 12.31.14.png

Here is a sample flow, partitioned folder + files in folder dataset + sync recipe to the partitioned dataset. 
When I build the latest partition via a scenario e.g LAST_DAY, it will only pick up the files from the last day only. 

Screenshot 2022-08-10 at 12.32.28.png

 

0 Kudos
SaschaS
Level 2
Author

Hi @AlexT 

thanks for the help.

Your explanation helped me.
If I understand everything correctly, I have to initially do a complete import of all partitions and can then use scenario to import yesterday.
I've tried around a bit and it seems to be working.
Do you have any tips or hints on what I have to pay attention to that might not be obvious to me?

Best regards
Sascha

0 Kudos