SFTP Site with .Zip files (with more than just data in the .zip file)

tgb417
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

I'm receiving data from an external partner.

They have setup an SFTP file server for me to get the data.

They .zip the .tsv files that I'm expecting.

However, they also add other documents in the .zip file that are not the data I need for my process. Basicly a data dictionary for the data they are providing.

From this page:

https://doc.dataiku.com/dss/latest/connecting/scp-sftp.html

I understand how to use an sftp connector. And it is working OK.

From this page:

https://doc.dataiku.com/dss/latest/connecting/connections.html

I understand that DSS is able to open .zip files and get at the content.

However, I'm not clear if I can control which file/files in the .zip file will be treated as data.

If that will not work. Are there other workarounds to automate the downloading of the .zip file extracting the needed data file? And getting the data into a DSS Data flow.

Love to hear folks' thoughts. I'll try to report back if I end up creating a solution for this.

--Tom

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,212 Dataiker
    edited July 17

    Hi @tgb417
    ,

    Exclusion rules for files within a zip file are not available.

    As you suggest using a python recipe with the SFTP managed folder should do the trick. Unzip them locally or in memory, including or excluding files as needed.

    Here is a very basic example, using ByteIO, so no temp files are needed on disk. This example assumes you have a single data file ".csv" you want to open.

    import dataiku
    import pandas as pd, numpy as np
    from dataiku import pandasutils as pdu
    from zipfile import ZipFile
    from io import BytesIO
    import fnmatch
    
    
    mf = dataiku.Folder("NP2gVabt")
    mf_info = mf.get_info()
    paths = mf.list_paths_in_partition()
    
    #hardcoding file name for testing you can use paths to determine and loop trough all files in the managed folder
    
    zipped_file = "my-zip.zip"
        
    with mf.get_download_stream(zipped_file) as stream:
        zipObj = ZipFile(BytesIO(stream.read()))
        
        #list files in zip
        file_list = zipObj.namelist()
        
        #use fnmatch.filter to get the csv file
        csv_file = fnmatch.filter(file_list, "*.csv")
        
        #get the csv data
        data = zipObj.open(*csv_file)
    
    #read into dataframe
    df = pd.read_csv(data)
    
    #write to dataset
    py_recipe_output = dataiku.Dataset("my_dataset")
    py_recipe_output.write_with_schema(df)

    Screenshot 2021-10-27 at 11.54.23.png

    Let me know if you have any questions. Hope this help!

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    @AlexT

    Over the weekend I had figured out much of this. But I’m dropping the file to the working directory, of either the python recipient or the Jupiter notebook, I had not figured out how to do the BytesIO bit. (Although I tried to do things like this, without success.). For now I’ve got the job done. However, when I get a bit of time I will investigate this solution more completely. I’ll let you know how I get on with this when I have a moment. For now I just want to say thank you for the response.

  • azamora
    azamora Partner, Registered Posts: 9 Partner

    Hi @tgb417
    ,

    If you are able to decompress all your .zip files content, then if you create a dataset (Dataset - Internal - Files from Folder) you should be able to select which files you want to use to create the dataset, see image below.

    Hope this helps,

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    @azamora

    Thanks for your insight. I see you posted over on this thread about something similar. Which I responded to prior to seeing you posted here.

    In my case, the Zip files had multiple files in them only some of these were actual data. So I could not just create a SFTP data object that would itself dig into the .zip file and pull out the data. I ended up with a flow like this. The download step got multiple zip files /*.zip. The python step used the .zipfile python library to root around in the zip file. I did not take this further the project I was doing, is complete for now.

    Dealing with Zip Files.png

Setup Info
    Tags
      Help me…