Unable to process Parquet File types in Dataiku Platform

Balaparamesh
Level 1
Unable to process Parquet File types in Dataiku Platform

Hi Team,

We are unable to process the Parquet File typw when we integrated to Hadoop Source systems.

Thank you

0 Kudos
1 Reply
fchataigner2
Dataiker

Hi,

it's as the message says, DSS won't let you read parquet data directly in a HTTP dataset. You need to download the parquet files to a HDFS-friendly location first:

- create a managed folder (+ New dataset > Folder) on a HDFS or S3 or Azure or GCS connection

- use a Download recipe to download the parquet files to that folder

- make a new HDFS or S3 or Azure or GCS dataset (same type as the folder) pointing to the location of the folder

 

Note that=

1) you'll need your DSS to have a working Hadoop integration in place, because Parquet is a file format with strong ties to Hadoop. So if you have a Hadoop distribution installed on the machine, use the `bin/dssadmin install-hadoop-integration` command from the DSS data dir; if you don't have a Hadoop distribution, grab the hadoop standalone jar package from our download site and install with `bin/dssadmin install-hadoop-integration -standaloneArchive /path/to/where/you/downloaded/it`  (DSS restart needed afterwards)

2) if using S3/Azure/GCS, you'll need to set the "HDFS interface" field on the DSS connection to something else than "None". If using Azure, it must be ABFS

0 Kudos