Reading multiple large files from sftp
Hi,
I would like to combine data from a SFTP server into a dataset.
The files are available on the server as zipped csv files.
A new file is added every day like "filename_YYYY-mm-dd.csv.zip".
The creation of the data set takes a long time (several hours) because the files are very large.
Is there a way to import a kind of delta so that not all files are always fetched from the server and only the newest file is added to the dataset?
Best regards
Sascha
Best Answer
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi @SaschaS
,You may be able to leverage partitioned folders in this case.
If new files are added daily, and they all follow the pattern YYYY-MM-DD e.g
Here is a sample flow, partitioned folder + files in folder dataset + sync recipe to the partitioned dataset.
When I build the latest partition via a scenario e.g LAST_DAY, it will only pick up the files from the last day only.
Answers
-
Miguel Angel Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 118 Dataiker
There is no need to fetch all the files from the SFTP server everyday. For example, you can save the results of the files combine on the DSS instance host.
Then, you can use a "New SFTP dataset" or a "Download Recipe" to a "Managed Folder" to just get the new files from the remote server and combine them with the ones already processed from the previous day.
You can use a scenario to automate the process so it runs on a daily basis
-
Hi MiguelangelC,
I'm not sure I fully understand this.
I create an SFTP dataset with all the data on the server. Then a second, which only contains the last day.
Tomorrow the second dataset would only contain the data from yesterday, but the SFTP dataset would only contain the data up to the day before yesterday.Can you maybe send me a screenshot with an example, then I can better understand what you mean.
-
Maybe someone else can get me a hint how to solve my problem.
-
Hi @AlexT
thanks for the help.
Your explanation helped me.
If I understand everything correctly, I have to initially do a complete import of all partitions and can then use scenario to import yesterday.
I've tried around a bit and it seems to be working.
Do you have any tips or hints on what I have to pay attention to that might not be obvious to me?Best regards
Sascha