Unable to process Parquet File types in Dataiku Platform

Balaparamesh · April 2022

Hi Team,

We are unable to process the Parquet File typw when we integrated to Hadoop Source systems.

Thank you

fchataigner2 · April 2022

Hi,

it's as the message says, DSS won't let you read parquet data directly in a HTTP dataset. You need to download the parquet files to a HDFS-friendly location first:

- create a managed folder (+ New dataset > Folder) on a HDFS or S3 or Azure or GCS connection

- use a Download recipe to download the parquet files to that folder

- make a new HDFS or S3 or Azure or GCS dataset (same type as the folder) pointing to the location of the folder

Note that=

1) you'll need your DSS to have a working Hadoop integration in place, because Parquet is a file format with strong ties to Hadoop. So if you have a Hadoop distribution installed on the machine, use the `bin/dssadmin install-hadoop-integration` command from the DSS data dir; if you don't have a Hadoop distribution, grab the hadoop standalone jar package from our download site and install with `bin/dssadmin install-hadoop-integration -standaloneArchive /path/to/where/you/downloaded/it` (DSS restart needed afterwards)

2) if using S3/Azure/GCS, you'll need to set the "HDFS interface" field on the DSS connection to something else than "None". If using Azure, it must be ABFS

Unable to process Parquet File types in Dataiku Platform

Answers

Categories

Setup Info

Tags