Unable to process Parquet File types in Dataiku Platform
Hi Team,
We are unable to process the Parquet File typw when we integrated to Hadoop Source systems.
Thank you
Answers
-
Hi,
it's as the message says, DSS won't let you read parquet data directly in a HTTP dataset. You need to download the parquet files to a HDFS-friendly location first:
- create a managed folder (+ New dataset > Folder) on a HDFS or S3 or Azure or GCS connection
- use a Download recipe to download the parquet files to that folder
- make a new HDFS or S3 or Azure or GCS dataset (same type as the folder) pointing to the location of the folder
Note that=
1) you'll need your DSS to have a working Hadoop integration in place, because Parquet is a file format with strong ties to Hadoop. So if you have a Hadoop distribution installed on the machine, use the `bin/dssadmin install-hadoop-integration` command from the DSS data dir; if you don't have a Hadoop distribution, grab the hadoop standalone jar package from our download site and install with `bin/dssadmin install-hadoop-integration -standaloneArchive /path/to/where/you/downloaded/it` (DSS restart needed afterwards)
2) if using S3/Azure/GCS, you'll need to set the "HDFS interface" field on the DSS connection to something else than "None". If using Azure, it must be ABFS