SFTP Site with .Zip files (with more than just data in the .zip file)

I'm receiving data from an external partner.
They have setup an SFTP file server for me to get the data.
They .zip the .tsv files that I'm expecting.
However, they also add other documents in the .zip file that are not the data I need for my process. Basicly a data dictionary for the data they are providing.
From this page:
https://doc.dataiku.com/dss/latest/connecting/scp-sftp.html
I understand how to use an sftp connector. And it is working OK.
From this page:
https://doc.dataiku.com/dss/latest/connecting/connections.html
I understand that DSS is able to open .zip files and get at the content.
However, I'm not clear if I can control which file/files in the .zip file will be treated as data.
If that will not work. Are there other workarounds to automate the downloading of the .zip file extracting the needed data file? And getting the data into a DSS Data flow.
Love to hear folks' thoughts. I'll try to report back if I end up creating a solution for this.
--Tom
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,349 Dataiker
Hi @tgb417
,Exclusion rules for files within a zip file are not available.
As you suggest using a python recipe with the SFTP managed folder should do the trick. Unzip them locally or in memory, including or excluding files as needed.
Here is a very basic example, using ByteIO, so no temp files are needed on disk. This example assumes you have a single data file ".csv" you want to open.
import dataiku import pandas as pd, numpy as np from dataiku import pandasutils as pdu from zipfile import ZipFile from io import BytesIO import fnmatch mf = dataiku.Folder("NP2gVabt") mf_info = mf.get_info() paths = mf.list_paths_in_partition() #hardcoding file name for testing you can use paths to determine and loop trough all files in the managed folder zipped_file = "my-zip.zip" with mf.get_download_stream(zipped_file) as stream: zipObj = ZipFile(BytesIO(stream.read())) #list files in zip file_list = zipObj.namelist() #use fnmatch.filter to get the csv file csv_file = fnmatch.filter(file_list, "*.csv") #get the csv data data = zipObj.open(*csv_file) #read into dataframe df = pd.read_csv(data) #write to dataset py_recipe_output = dataiku.Dataset("my_dataset") py_recipe_output.write_with_schema(df)
Let me know if you have any questions. Hope this help!
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,630 Neuron
Over the weekend I had figured out much of this. But I’m dropping the file to the working directory, of either the python recipient or the Jupiter notebook, I had not figured out how to do the BytesIO bit. (Although I tried to do things like this, without success.). For now I’ve got the job done. However, when I get a bit of time I will investigate this solution more completely. I’ll let you know how I get on with this when I have a moment. For now I just want to say thank you for the response.
-
Hi @tgb417
,If you are able to decompress all your .zip files content, then if you create a dataset (Dataset - Internal - Files from Folder) you should be able to select which files you want to use to create the dataset, see image below.
Hope this helps,
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,630 Neuron
Thanks for your insight. I see you posted over on this thread about something similar. Which I responded to prior to seeing you posted here.
In my case, the Zip files had multiple files in them only some of these were actual data. So I could not just create a SFTP data object that would itself dig into the .zip file and pull out the data. I ended up with a flow like this. The download step got multiple zip files /*.zip. The python step used the .zipfile python library to root around in the zip file. I did not take this further the project I was doing, is complete for now.