how to open .rar compressed files

Solved!
AdrienBB
Level 1
how to open .rar compressed files

Hi,

I am looking for the more dataiku friendly and straight forward method to open csv files compressed as .rar file from an sftp server into DSS as an sql table.

Currently, we don't manage to open .rar files with the sftp connector  :

  • Is there an alternative other than shell or python script ?
  • If not, which one is recommended? 

PS : Moreover, to complicate everything, data are not comma separated but '|*|' separated and files are in 3 subfolders into each .rar archive.

Thank you.


Operating system used: ubuntu 18

0 Kudos
1 Solution
AdrienBB
Level 1
Author

Thank you for your answer.

As suggested, we installed unrar and used a Python recipe to execute shell commands and prepare our dataset with Dask.
As Dataiku (v10) can't Handle separators longer than one character, we first tried your solution but we don't managed to easily extract the first line as column headers.
So we decided to use sed command to directly replace |*| per | (comma are used in adresses) in each files.

retained solution as of today :
1/We download files from the ftp to a dataiku managed folder.

2/Files are unzipped and modified on place in the dataiku managed folder using a python script (uzip files -> mv archive to a "treated" folder -> sed command to replace separators).

3/Then we open this folder as a dataset with legit csv files.

4/files are 100Go each so we partitionned this dataset using filename as partitions.

Cheers !

View solution in original post

3 Replies
Jurre
Level 5

Hi and welcome @AdrienBB ,

Although mentioned in your "preferred not" list 'unrar'  would be my personal first choice. The unrar package needs to be installed. It can handle subfolders easily but it will take a shell (or python) step; the procedure itself is not complicated, if you need help with that let me know. If installing packages is not an option i would do it outside of dataiku with output to a folder dataiku can read from. For example with 7zip. 

Reading the extracted files as csv can be done directly because you can specify the separator used. I have not tested the specified separator! If dataiku cannot handle that one directly a prepare-step to split out the values and then load it into a database might be an option. 

I'm aware of your requirements but currently no alternative comes to mind. So i'm as curious as you if someone is around with a better idea! 

Cheers!

AdrienBB
Level 1
Author

Thank you for your answer.

As suggested, we installed unrar and used a Python recipe to execute shell commands and prepare our dataset with Dask.
As Dataiku (v10) can't Handle separators longer than one character, we first tried your solution but we don't managed to easily extract the first line as column headers.
So we decided to use sed command to directly replace |*| per | (comma are used in adresses) in each files.

retained solution as of today :
1/We download files from the ftp to a dataiku managed folder.

2/Files are unzipped and modified on place in the dataiku managed folder using a python script (uzip files -> mv archive to a "treated" folder -> sed command to replace separators).

3/Then we open this folder as a dataset with legit csv files.

4/files are 100Go each so we partitionned this dataset using filename as partitions.

Cheers !

Jurre
Level 5

Thanx for posting back your solution @AdrienBB  , sed is great for these kinda operations indeed! 

happy prepping!

0 Kudos