How to automatically uncompress files in a download recipe

Options
azamora
azamora Partner, Registered Posts: 9 Partner

Hi everyone,

I am using a Download recipe to connect to a SFTP site and download some .zip files.

Is there any way DSS can uncompress the .zip files automatically ? To do it I have to manually go to each file an click decompress.

Thanks,


Operating system used: Linux CentOS

Tagged:

Answers

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Options

    @azamora
    ,

    I'm not clear if I've done exactly the scenario that you are trying to do. However, experimenting with another project, I did come across this documentation saying that sftp can open remote zipped files as if they were a data source. It looks like version 8.0 has this feature as well.

    https://doc.dataiku.com/dss/8.0/connecting/connections.html

    You might try the dataset menu in the flow -> SFTP

    Choose the connection that you set up previously under Administrations -> New Data Connections -> SFTP (You have to scroll down to get to this.)

    Once you have been able to open the remote file as a data source you can then use the visual sync recipe to move the data into the more local(ish) data source you will be doing your analysis and modeling with.

    Let us know how you are getting on with this. Maybe someone else can also lend some further clarity.

  • azamora
    azamora Partner, Registered Posts: 9 Partner
    Options

    Thanks @tgb417
    ,

    It worked really good. I am able to download the .zip files (step 1), uncompress them (step 2) and create datasets (step 3) from the different files within the uncompressed folder.

    The last part of the puzzle is how can I automate it?

    I am able to create a scenario and run the download the .zip files and build the dataset (step 1 and 3) but I don't see a way to uncompress the .zip files automatically.

    Any guidance will be highly appreciated

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Options

    @azamora
    ,

    I'm not clear if I'm following exactly what you have done that is working.

    What method are you using withing Dataiku DSS to "download the zip file? and create a dataset. Are you using a Network dataset that uses SFTP, Or are you using a download recipe to a managed folder?

    If you are using the Network Dataset, I think that this step should be dealing with your unzipping for you. If you are downloading to a managed folder. Have you written a set of Python or R to unzip the file? Or are you using some other method to Unzip the file?

    Then how are you creating the "different file"?

    If you have that laid out in your workflow, you can always run a build on the last node in your sequence. The Scenario Builder has a Step type of Build. Which you can use to automate the building of your dataset from the flow. Here is a little bit from the documentation on this point. https://doc.dataiku.com/dss/latest/scenarios/steps.html

    Here is also some training materials of scenarios and scenario steps. https://academy.dataiku.com/automation-course-1/668968

    Hope this is helpful. If I'm on the wrong path here, please share a bit more about the flow you have created. Let us all know how you are getting on.

  • azamora
    azamora Partner, Registered Posts: 9 Partner
    Options

    Hi Tom,

    First thanks for taking the time to help me!

    Each zip file contain 15 tsv files and I need to create one dataset for each tsv file.

    I tried both the Download recipe and the Network dataset.

    The Download recipe works great, the only missing part is to unzip the files automatically (I can do it manually).

    The Network dataset does unzip the files and creates a dataset but it is a merged dataset of all the 15 tsv files within the zip file and this is not what I need.

    I guess I will stick with the Download recipe and create a Python code to unzip the files.

    Thanks a lot !

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Options

    I’m not at a computer with dss at the moment.

    a couple of more thoughts

    However, if I Remember correctly there may be a way to add a column that says from which data source the data came from. That might allow you to untangle the automatically appended fIles.

    Depending on the layout of the multiple files. If they have the same layout, I’m wondering if having them in the same data file to start might be advantageous saving you a step.

    finally if the files are small enough to have duplicated in your flow, a quick python script might automate the unzipp steps.

    others please jump in here if you have other ideas.

  • chrishnet997
    chrishnet997 Registered Posts: 4 ✭✭✭
    Options

    Hi @azamora
    ,

    I also have the same problem. How did you unzip them did you did it with python or through dss?

    I created a python code to read the files in the zip through a loop but not sure how to add them in the input dataset.

    Thank you in advance

  • azamora
    azamora Partner, Registered Posts: 9 Partner
    Options

    Hi @chrishnet997

    Yes, I ended up with a python recipe. I put the files in a shared folder in HDFS and from there I run a scenario to create a recipe from the files.

  • CoreyS
    CoreyS Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Registered Posts: 1,150 ✭✭✭✭✭✭✭✭✭
    Options

    Thank you for sharing your solution with us @azamora
    !

Setup Info
    Tags
      Help me…