Enabling parquet format in Dataiku DSS

Solved!
Ankur30
Level 3
Enabling parquet format in Dataiku DSS

Hi

Currently when we write into Dataiku file system we only csv and avro format.

How can I enable parque format in Dataiku DSS running on linux platform on EC2 instance.

I need steps for that. Also we don't have any HDFS connection setup as well.

Regards,

Ankur.

1 Solution
AlexT
Dataiker

Hi Ankur,

To support parquet files on non-Hadoop install You will need to install hadoop integration with the standalone libraries for parquet to work. Please review: https://doc.dataiku.com/dss/latest/connecting/formats/parquet.html#applicability to see the restrictions related to parquet. 

The steps to install: 

https://doc.dataiku.com/dss/latest/containers/setup-k8s.html#optional-setup-spark 

Download the standalone libs from( if you are on a different version change the version in the URL) : https://downloads.dataiku.com/public/studio/9.0.5/dataiku-dss-hadoop-standalone-libs-generic-hadoop3... 

./bin/dssadmin install-hadoop-integration -standaloneArchive /PATH/TO/dataiku-dss-hadoop3-standalone-libs-generic...tar.gz

 

Let me know if you have any issues. 

 

View solution in original post

0 Kudos
3 Replies
AlexT
Dataiker

Hi Ankur,

To support parquet files on non-Hadoop install You will need to install hadoop integration with the standalone libraries for parquet to work. Please review: https://doc.dataiku.com/dss/latest/connecting/formats/parquet.html#applicability to see the restrictions related to parquet. 

The steps to install: 

https://doc.dataiku.com/dss/latest/containers/setup-k8s.html#optional-setup-spark 

Download the standalone libs from( if you are on a different version change the version in the URL) : https://downloads.dataiku.com/public/studio/9.0.5/dataiku-dss-hadoop-standalone-libs-generic-hadoop3... 

./bin/dssadmin install-hadoop-integration -standaloneArchive /PATH/TO/dataiku-dss-hadoop3-standalone-libs-generic...tar.gz

 

Let me know if you have any issues. 

 

0 Kudos
Ankur30
Level 3
Author

Thanks @AlexT  for prompt response. I will use the above steps you mentioned and then Accept it as solution once I was able to configure the parque format.

Thank you.

 

Regards,

Ankur,

0 Kudos
somepunter
Level 3

thanks for this,

@Ankur30 what did you use as storage option? S3? 

 

the documentation mentions:

Parquet datasets can be stored on the following cloud storage and hadoop connections: HDFS, S3, GCS, Azure Blob storagebut 

@AlexT 

I'm curious whether it can be written to local / network filesystems

 

0 Kudos

Labels

?
Labels (2)
A banner prompting to get Dataiku