SFTP Site with .Zip files (with more than just data in the .zip file)

tgb417
SFTP Site with .Zip files (with more than just data in the .zip file)

I'm receiving data from an external partner.

They have setup an SFTP file server for me to get the data.

They .zip the .tsv files that I'm expecting.

However, they also add other documents in the .zip file that are not the data I need for my process.  Basicly a data dictionary for the data they are providing.

From this page:

https://doc.dataiku.com/dss/latest/connecting/scp-sftp.html

I understand how to use an sftp connector.  And it is working OK.

From this page:

https://doc.dataiku.com/dss/latest/connecting/connections.html

I understand that DSS is able to open .zip files and get at the content.

However, I'm not clear if I can control which file/files in the .zip file will be treated as data.

If that will not work.  Are there other workarounds to automate the downloading of the .zip file extracting the needed data file? And getting the data into a DSS Data flow.

Love to hear folks' thoughts.  I'll try to report back if I end up creating a solution for this.

--Tom

 

--Tom
0 Kudos
4 Replies
AlexT
Dataiker

Hi @tgb417 ,

Exclusion rules for files within a zip file are not available. 

As you suggest using a python recipe with the SFTP managed folder should do the trick.  Unzip them locally or in memory, including or excluding files as needed. 

Here is a very basic example, using ByteIO, so no temp files are needed on disk. This example assumes you have a single data file ".csv" you want to open. 

import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
from zipfile import ZipFile
from io import BytesIO
import fnmatch


mf = dataiku.Folder("NP2gVabt")
mf_info = mf.get_info()
paths = mf.list_paths_in_partition()

#hardcoding file name for testing you can use paths to determine and loop trough all files in the managed folder

zipped_file = "my-zip.zip"
    
with mf.get_download_stream(zipped_file) as stream:
    zipObj = ZipFile(BytesIO(stream.read()))
    
    #list files in zip
    file_list = zipObj.namelist()
    
    #use fnmatch.filter to get the csv file
    csv_file = fnmatch.filter(file_list, "*.csv")
    
    #get the csv data
    data = zipObj.open(*csv_file)

#read into dataframe
df = pd.read_csv(data)

#write to dataset
py_recipe_output = dataiku.Dataset("my_dataset")
py_recipe_output.write_with_schema(df)

 

 

Screenshot 2021-10-27 at 11.54.23.png

Let me know if you have any questions. Hope this help!

tgb417
Author

@AlexT 

Over the weekend I had figured out much of this. But Iโ€™m dropping the file to the working directory, of either the python recipient or the Jupiter notebook,   I had not figured out how to do the BytesIO bit.  (Although I tried to do things like this, without success.). For now Iโ€™ve got the job done.  However, when I get a bit of time I will investigate this solution more completely.  Iโ€™ll let you know how I get on with this when I have a moment. For now I just want to say thank you for the response.   

--Tom
azamora
Level 2

Hi @tgb417 ,

If you are able to decompress all your .zip files content, then if you create a dataset (Dataset - Internal - Files from Folder) you should be able to select which files you want to use to create the dataset, see image below.

Hope this helps,

0 Kudos
tgb417
Author

@azamora

Thanks for your insight.  I see you posted over on this thread about something similar.  Which I responded to prior to seeing you posted here.

In my case, the Zip files had multiple files in them only some of these were actual data. So I could not just create a SFTP data object that would itself dig into the .zip file and pull out the data. I ended up with a flow like this.  The download step got multiple zip files /*.zip.  The python step used the .zipfile python library to root around in the zip file.  I did not take this further the project I was doing, is complete for now.

Dealing with Zip Files.png

 

 

--Tom
0 Kudos