How to automatically uncompress files in a download recipe

azamora
Level 2
How to automatically uncompress files in a download recipe

Hi everyone,

I am using a Download recipe to connect to a SFTP site and download some .zip files.

Is there any way DSS can uncompress the .zip files automatically ? To do it I have to manually go to each file an click decompress.

Thanks,


Operating system used: Linux CentOS

0 Kudos
8 Replies
tgb417

@azamora ,

I'm not clear if I've done exactly the scenario that you are trying to do.  However, experimenting with another project, I did come across this documentation saying that sftp can open remote zipped files as if they were a data source.  It looks like version 8.0 has this feature as well.

https://doc.dataiku.com/dss/8.0/connecting/connections.html

You might try the dataset menu in the flow -> SFTP 

Choose the connection that you set up previously under Administrations -> New Data Connections -> SFTP (You have to scroll down to get to this.)

Once you have been able to open the remote file as a data source you can then use the visual sync recipe to move the data into the more local(ish) data source you will be doing your analysis and modeling with.

Let us know how you are getting on with this.  Maybe someone else can also lend some further clarity.

--Tom
0 Kudos
azamora
Level 2
Author

Thanks @tgb417 ,

It worked really good. I am able to download the .zip files (step 1), uncompress them (step 2) and create datasets (step 3) from the different files within the uncompressed folder.

The last part of the puzzle is how can I automate it?

I am able to create a scenario and run the download the .zip files and build the dataset (step 1 and 3) but I don't see a way to uncompress the .zip files automatically.

Any guidance will be highly appreciated

0 Kudos
tgb417

@azamora ,

I'm not clear if I'm following exactly what you have done that is working.

What method are you using withing Dataiku DSS to "download the zip file? and create a dataset.  Are you using a Network dataset that uses SFTP, Or are you using a download recipe to a managed folder?

If you are using the Network Dataset, I think that this step should be dealing with your unzipping for you.  If you are downloading to a managed folder.  Have you written a set of Python or R to unzip the file?  Or are you using some other method to Unzip the file?

Then how are you creating the "different file"?

If you have that laid out in your workflow, you can always run a build on the last node in your sequence.  The Scenario Builder has a Step type of Build.  Which you can use to automate the building of your dataset from the flow.  Here is a little bit from the documentation on this point.  https://doc.dataiku.com/dss/latest/scenarios/steps.html

Here is also some training materials of scenarios and scenario steps.  https://academy.dataiku.com/automation-course-1/668968

Hope this is helpful.  If I'm on the wrong path here, please share a bit more about the flow you have created.  Let us all know how you are getting on.

 

 

--Tom
0 Kudos
azamora
Level 2
Author

Hi Tom,

First thanks for taking the time to help me!

Each zip file contain 15 tsv files and I need to create one dataset for each tsv file.

I tried both the Download recipe and the Network dataset.

The Download recipe works great, the only missing part is to unzip the files automatically (I can do it manually).

The Network dataset does unzip the files and creates a dataset but it is a merged dataset of all the 15 tsv files within the zip file and this is not what I need.

I guess I will stick with the Download recipe and create a Python code to unzip the files.

Thanks a lot !

0 Kudos
tgb417

Iโ€™m not at a computer with dss at the moment.  

a couple of more thoughts

However, if I Remember correctly there may be a way to add a column that says from which data source the data came from.  That might allow you to untangle the automatically appended fIles.

Depending on the layout of the multiple files.  If they have the same layout, Iโ€™m wondering if having them in the same data file to start might be advantageous saving you a step.

finally if the files are small enough to have duplicated in your flow, a quick python script might automate the unzipp steps.

others please jump in here if you have other ideas.

--Tom
0 Kudos
chrishnet997
Level 2

Hi @azamora ,

I also have the same problem. How did you unzip them did you did it with python or through dss? 

I created a python code to read the files in the zip through a loop but not sure how to add them in the input dataset.

 

Thank you in advance 

0 Kudos
azamora
Level 2
Author

Hi @chrishnet997 

Yes, I ended up with a python recipe. I put the files in a shared folder in HDFS and from there I run a scenario to create a recipe from the files.

CoreyS
Dataiker Alumni

Thank you for sharing your solution with us @azamora

Looking for more resources to help you use Dataiku effectively and upskill your knowledge? Check out these great resources: Dataiku Academy | Documentation | Knowledge Base

A reply answered your question? Mark as โ€˜Accepted Solutionโ€™ to help others like you!
0 Kudos