Import dynamic dataset from SFTP

Solved!
clu320
Level 1
Import dynamic dataset from SFTP

Hi Team,

I am a new user to Dataiku and here is my question:

I will have daily incoming dataset which needed to be loaded into dataiku, and then I will need to do some ETL and export the final dataset. The daily incoming datasets will upload automatically in SFTP with different file names in date format such as " 0719/ 0810". How can I make dataiku to realize although the file name is different, but it only need to read the new upload file daily?

I would really appreciate all of your help and answers! Thank you so much!

0 Kudos
1 Solution
adamnieto

Hi @clu320 , 

I am not sure if you are already using a DSS managed folder but I think that will come in handy. Once you have this managed folder you can create a python script that takes the folder in as input and decides based upon the current time and date which files to place in an output folder. You can also leverage the DSS project custom variables inside as described here in the DSS documentation: https://doc.dataiku.com/dss/latest/variables/index.html#python  to help out with the python logic. Your flow should look something like this. Make sure in the python recipe to delete the current files in the folder before you look for new ones especially if you are looking to run this daily. Here is some pseudo code I can think of that would help out:

1. Look at your DSS project variables and see if you can find any new files you haven't seen before using project variables. Or use today's date to see if you can find just the new files based on their creation date if possible.

2. Put these new files in the output folder (good resources)

3. Continue your flow with whatever logic you need.

flow_example.PNG

In addition to this you may want to also leverage scenarios so that you can make your DSS project run on the daily. Here is some documentation on Scenarios. Also here is a academy course on scenarios as well.

Hope this helps! 

 

View solution in original post

1 Reply
adamnieto

Hi @clu320 , 

I am not sure if you are already using a DSS managed folder but I think that will come in handy. Once you have this managed folder you can create a python script that takes the folder in as input and decides based upon the current time and date which files to place in an output folder. You can also leverage the DSS project custom variables inside as described here in the DSS documentation: https://doc.dataiku.com/dss/latest/variables/index.html#python  to help out with the python logic. Your flow should look something like this. Make sure in the python recipe to delete the current files in the folder before you look for new ones especially if you are looking to run this daily. Here is some pseudo code I can think of that would help out:

1. Look at your DSS project variables and see if you can find any new files you haven't seen before using project variables. Or use today's date to see if you can find just the new files based on their creation date if possible.

2. Put these new files in the output folder (good resources)

3. Continue your flow with whatever logic you need.

flow_example.PNG

In addition to this you may want to also leverage scenarios so that you can make your DSS project run on the daily. Here is some documentation on Scenarios. Also here is a academy course on scenarios as well.

Hope this helps!