Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Added on October 22, 2021 3:03PM
Likes: 0
Replies: 4
I'm receiving data from an external partner.
They have setup an SFTP file server for me to get the data.
They .zip the .tsv files that I'm expecting.
However, they also add other documents in the .zip file that are not the data I need for my process. Basicly a data dictionary for the data they are providing.
From this page:
https://doc.dataiku.com/dss/latest/connecting/scp-sftp.html
I understand how to use an sftp connector. And it is working OK.
From this page:
https://doc.dataiku.com/dss/latest/connecting/connections.html
I understand that DSS is able to open .zip files and get at the content.
However, I'm not clear if I can control which file/files in the .zip file will be treated as data.
If that will not work. Are there other workarounds to automate the downloading of the .zip file extracting the needed data file? And getting the data into a DSS Data flow.
Love to hear folks' thoughts. I'll try to report back if I end up creating a solution for this.
--Tom
Hi @tgb417
,
Exclusion rules for files within a zip file are not available.
As you suggest using a python recipe with the SFTP managed folder should do the trick. Unzip them locally or in memory, including or excluding files as needed.
Here is a very basic example, using ByteIO, so no temp files are needed on disk. This example assumes you have a single data file ".csv" you want to open.
import dataiku import pandas as pd, numpy as np from dataiku import pandasutils as pdu from zipfile import ZipFile from io import BytesIO import fnmatch mf = dataiku.Folder("NP2gVabt") mf_info = mf.get_info() paths = mf.list_paths_in_partition() #hardcoding file name for testing you can use paths to determine and loop trough all files in the managed folder zipped_file = "my-zip.zip" with mf.get_download_stream(zipped_file) as stream: zipObj = ZipFile(BytesIO(stream.read())) #list files in zip file_list = zipObj.namelist() #use fnmatch.filter to get the csv file csv_file = fnmatch.filter(file_list, "*.csv") #get the csv data data = zipObj.open(*csv_file) #read into dataframe df = pd.read_csv(data) #write to dataset py_recipe_output = dataiku.Dataset("my_dataset") py_recipe_output.write_with_schema(df)
Let me know if you have any questions. Hope this help!
Over the weekend I had figured out much of this. But I’m dropping the file to the working directory, of either the python recipient or the Jupiter notebook, I had not figured out how to do the BytesIO bit. (Although I tried to do things like this, without success.). For now I’ve got the job done. However, when I get a bit of time I will investigate this solution more completely. I’ll let you know how I get on with this when I have a moment. For now I just want to say thank you for the response.
Hi @tgb417
,
If you are able to decompress all your .zip files content, then if you create a dataset (Dataset - Internal - Files from Folder) you should be able to select which files you want to use to create the dataset, see image below.
Hope this helps,
Thanks for your insight. I see you posted over on this thread about something similar. Which I responded to prior to seeing you posted here.
In my case, the Zip files had multiple files in them only some of these were actual data. So I could not just create a SFTP data object that would itself dig into the .zip file and pull out the data. I ended up with a flow like this. The download step got multiple zip files /*.zip. The python step used the .zipfile python library to root around in the zip file. I did not take this further the project I was doing, is complete for now.