SFTP Site with .Zip files (with more than just data in the .zip file)

Options
tgb417
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron

I'm receiving data from an external partner.

They have setup an SFTP file server for me to get the data.

They .zip the .tsv files that I'm expecting.

However, they also add other documents in the .zip file that are not the data I need for my process. Basicly a data dictionary for the data they are providing.

From this page:

https://doc.dataiku.com/dss/latest/connecting/scp-sftp.html

I understand how to use an sftp connector. And it is working OK.

From this page:

https://doc.dataiku.com/dss/latest/connecting/connections.html

I understand that DSS is able to open .zip files and get at the content.

However, I'm not clear if I can control which file/files in the .zip file will be treated as data.

If that will not work. Are there other workarounds to automate the downloading of the .zip file extracting the needed data file? And getting the data into a DSS Data flow.

Love to hear folks' thoughts. I'll try to report back if I end up creating a solution for this.

--Tom

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    Options

    Hi @tgb417
    ,

    Exclusion rules for files within a zip file are not available.

    As you suggest using a python recipe with the SFTP managed folder should do the trick. Unzip them locally or in memory, including or excluding files as needed.

    Here is a very basic example, using ByteIO, so no temp files are needed on disk. This example assumes you have a single data file ".csv" you want to open.

    import dataikuimport pandas as pd, numpy as npfrom dataiku import pandasutils as pdufrom zipfile import ZipFilefrom io import BytesIOimport fnmatchmf = dataiku.Folder("NP2gVabt")mf_info = mf.get_info()paths = mf.list_paths_in_partition()#hardcoding file name for testing you can use paths to determine and loop trough all files in the managed folderzipped_file = "my-zip.zip"with mf.get_download_stream(zipped_file) as stream:zipObj = ZipFile(BytesIO(stream.read()))#list files in zipfile_list = zipObj.namelist()#use fnmatch.filter to get the csv filecsv_file = fnmatch.filter(file_list, "*.csv")#get the csv datadata = zipObj.open(*csv_file)#read into dataframedf = pd.read_csv(data)#write to datasetpy_recipe_output = dataiku.Dataset("my_dataset")py_recipe_output.write_with_schema(df)

    Screenshot 2021-10-27 at 11.54.23.png

    Let me know if you have any questions. Hope this help!

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Options

    @AlexT

    Over the weekend I had figured out much of this. But I’m dropping the file to the working directory, of either the python recipient or the Jupiter notebook, I had not figured out how to do the BytesIO bit. (Although I tried to do things like this, without success.). For now I’ve got the job done. However, when I get a bit of time I will investigate this solution more completely. I’ll let you know how I get on with this when I have a moment. For now I just want to say thank you for the response.

  • azamora
    azamora Partner, Registered Posts: 9 Partner
    Options

    Hi @tgb417
    ,

    If you are able to decompress all your .zip files content, then if you create a dataset (Dataset - Internal - Files from Folder) you should be able to select which files you want to use to create the dataset, see image below.

    Hope this helps,

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Options

    @azamora

    Thanks for your insight. I see you posted over on this thread about something similar. Which I responded to prior to seeing you posted here.

    In my case, the Zip files had multiple files in them only some of these were actual data. So I could not just create a SFTP data object that would itself dig into the .zip file and pull out the data. I ended up with a flow like this. The download step got multiple zip files /*.zip. The python step used the .zipfile python library to root around in the zip file. I did not take this further the project I was doing, is complete for now.

    Dealing with Zip Files.png

Setup Info
    Tags
      Help me…