Access FTP location from Python Code

Solved!
sagar_dubey
Level 1
Access FTP location from Python Code

Hi,

I have FTP connection which is already configured on dataiku. I am able to create dataset based on files but I want to access the FTP location via a python script. Is there any way where I can access the location from python, read files from a particular path rather than reading from a full dataset. 

Any help would be really appreciated.

Thanks,

Sagar

0 Kudos
1 Solution
VitaliyD
Dataiker

Hi Sagar,

If I understood correctly and the requirement is to access some files in one FTP folder, archive them and move to another FTP folder, then it isn't DSS specific but more a Python exercise.

You could try using the ftplib module (https://docs.python.org/3/library/ftplib.html) or some other similar python package from doing a quick search online.

For example, with ftplib the file can be download to the local /tmp folder, removed from the original FTP folder, zipped and then uploaded to another FTP folder.
Below is a test I run in Python notebook on my test instance:

from ftplib import FTP
import os
ftp = FTP()
ftp.connect('host.com', 21)
ftp.login('user','password')
ftp.cwd('in')
filename = "file.name"
local_filename = os.path.join("/tmp", filename)
print(local_filename)
lf = open(local_filename, "wb")
ftp.retrbinary("RETR " + filename, lf.write, 8*1024)
lf.close()
ftp.delete(filename)
#do what needed with the local file
#dataiku api, pandas or any other python package can be used here if you need to performe any data manipulation and/or create a dataset from the file
#upload file to different directory
ftp.cwd('../out')
ftp.storbinary('STOR '+filename, open(local_filename, 'rb'))
ftp.quit()

Hopefully, this helps. If not, and I misunderstood your question, please provide your use case to understand the requirements better.

Best Regards,

Vitaliy

View solution in original post

4 Replies
VitaliyD
Dataiker

Hi Sagar,

If I understood correctly and the requirement is to access some files in one FTP folder, archive them and move to another FTP folder, then it isn't DSS specific but more a Python exercise.

You could try using the ftplib module (https://docs.python.org/3/library/ftplib.html) or some other similar python package from doing a quick search online.

For example, with ftplib the file can be download to the local /tmp folder, removed from the original FTP folder, zipped and then uploaded to another FTP folder.
Below is a test I run in Python notebook on my test instance:

from ftplib import FTP
import os
ftp = FTP()
ftp.connect('host.com', 21)
ftp.login('user','password')
ftp.cwd('in')
filename = "file.name"
local_filename = os.path.join("/tmp", filename)
print(local_filename)
lf = open(local_filename, "wb")
ftp.retrbinary("RETR " + filename, lf.write, 8*1024)
lf.close()
ftp.delete(filename)
#do what needed with the local file
#dataiku api, pandas or any other python package can be used here if you need to performe any data manipulation and/or create a dataset from the file
#upload file to different directory
ftp.cwd('../out')
ftp.storbinary('STOR '+filename, open(local_filename, 'rb'))
ftp.quit()

Hopefully, this helps. If not, and I misunderstood your question, please provide your use case to understand the requirements better.

Best Regards,

Vitaliy

sagar_dubey
Level 1
Author

Thanks @VitaliyD  for the response. I was trying the similar code as well it worked for me while trying on local. But I'm facing issue on dataiku maybe there is some issue with the configuration part.

0 Kudos
VitaliyD
Dataiker

Hi @sagar_dubey, if your DSS FTP connection working, it means the instance has access to FTP server so the above code should work. I tested it in my DSS python notebook on my test instace, and it works fine for me.

The other approach you can try to utilize the FTP connection you have and Dataiku API.

As a prerequisite, you should have two managed folders setup using your FTP connection and the files you want to process and move located in one managed folder so you can process, zip and move them to another managed folder. Below is the code that I tested in my DSS instance using Python notebook and Scenarios:

import os
import dataiku
from zipfile import ZipFile

input_folder = dataiku.Folder("managed_folder_in_id")
output_folder = dataiku.Folder("managed_folder_out_id")
for file in input_folder.list_paths_in_partition():
    #print(file)
    with input_folder.get_download_stream(file) as f:#read file from managed folder
        data = f.read()
    input_folder.delete_path(file)#delete file from input_folder
    home = os.path.expanduser("~")#get path of the home directory of the user. temp files will be stored and deleted after processed
    filename = file.split("/")[-1]#get file name
    #print(filename)
    local_filename = os.path.join(home, filename)#create path for local temp local storage
    #print(local_filename)
    f = open(local_filename, "w")#save file in local temp storage
    f.write(data)#save file in local temp storage
    f.close()#save file in local temp storage
    zip_file_name = filename.split(".")[0] + ".zip"#generate ziped file name based on the filename
    #print(zip_file_name)
    zip_local_file_name = os.path.join(home, zip_file_name)#create path for zipped file
    with ZipFile(zip_local_file_name, 'w') as zipfile:#save zipped file in templ local storage
        zipfile.write(local_filename, os.path.basename(local_filename))#save zipped file in templ local storage
    f = open(zip_local_file_name, "r")#read zipped file
    data = f.read()#read zipped file
    with output_folder.get_writer(zip_file_name) as w:#write zipped file into managed folder
        w.write(data)#write zipped file into managed folder
    os.remove(local_filename)#remove temp file from local storage
    os.remove(zip_local_file_name)#remove temp file from local storage

Hopefully, this will help.

Regards,

Vitaliy

tgb417

@VitaliyD 

I'm working on a similar project.

I have Zip files available on an SFTP DSS connection that I want to get into a Dataframe and eventually into a PostgreSQL dataset.

I can see the files on the DSS connection.

input_folder = dataiku.Folder("Managed_Folder_ID")
for file in input_folder.list_paths_in_partition():
print(file)

What I'm wondering is if I have to download the files, to local storage, in order to undo the zip files?  Of if there is a way to get a stream of some sort to load into zipfiles for decoding into a dataframe?

 

 

 

--Tom
0 Kudos