Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Added on November 18, 2021 5:12AM
Likes: 1
Replies: 3
Hi
Currently when we write into Dataiku file system we only csv and avro format.
How can I enable parque format in Dataiku DSS running on linux platform on EC2 instance.
I need steps for that. Also we don't have any HDFS connection setup as well.
Regards,
Ankur.
Hi Ankur,
To support parquet files on non-Hadoop install You will need to install hadoop integration with the standalone libraries for parquet to work. Please review: https://doc.dataiku.com/dss/latest/connecting/formats/parquet.html#applicability to see the restrictions related to parquet.
The steps to install:
https://doc.dataiku.com/dss/latest/containers/setup-k8s.html#optional-setup-spark
Download the standalone libs from( if you are on a different version change the version in the URL) : https://downloads.dataiku.com/public/studio/9.0.5/dataiku-dss-hadoop-standalone-libs-generic-hadoop3-9.0.5.tar.gz
./bin/dssadmin install-hadoop-integration -standaloneArchive /PATH/TO/dataiku-dss-hadoop3-standalone-libs-generic...tar.gz
Let me know if you have any issues.
thanks for this,
@Ankur30
what did you use as storage option? S3?
the documentation mentions:
Parquet datasets can be stored on the following cloud storage and hadoop connections: HDFS, S3, GCS, Azure Blob storagebut
I'm curious whether it can be written to local / network filesystems