Enabling parquet format in Dataiku DSS
Hi
Currently when we write into Dataiku file system we only csv and avro format.
How can I enable parque format in Dataiku DSS running on linux platform on EC2 instance.
I need steps for that. Also we don't have any HDFS connection setup as well.
Regards,
Ankur.
Best Answer
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi Ankur,
To support parquet files on non-Hadoop install You will need to install hadoop integration with the standalone libraries for parquet to work. Please review: https://doc.dataiku.com/dss/latest/connecting/formats/parquet.html#applicability to see the restrictions related to parquet.
The steps to install:
https://doc.dataiku.com/dss/latest/containers/setup-k8s.html#optional-setup-spark
Download the standalone libs from( if you are on a different version change the version in the URL) : https://downloads.dataiku.com/public/studio/9.0.5/dataiku-dss-hadoop-standalone-libs-generic-hadoop3-9.0.5.tar.gz
./bin/dssadmin install-hadoop-integration -standaloneArchive /PATH/TO/dataiku-dss-hadoop3-standalone-libs-generic...tar.gz
Let me know if you have any issues.
Answers
-
thanks for this,
@Ankur30
what did you use as storage option? S3?the documentation mentions:
Parquet datasets can be stored on the following cloud storage and hadoop connections: HDFS, S3, GCS, Azure Blob storagebut
I'm curious whether it can be written to local / network filesystems