Discover this year's submissions to the Dataiku Frontrunner Awards and give kudos to your favorite use cases and success stories!READ MORE

Unable to process Parquet File types in Dataiku Platform

Balaparamesh
Level 1
Level 1
Unable to process Parquet File types in Dataiku Platform

Hi Team,

We are unable to process the Parquet File typw when we integrated to Hadoop Source systems.

Thank you

0 Kudos
1 Reply
fchataigner2
Dataiker
Dataiker

Hi,

it's as the message says, DSS won't let you read parquet data directly in a HTTP dataset. You need to download the parquet files to a HDFS-friendly location first:

- create a managed folder (+ New dataset > Folder) on a HDFS or S3 or Azure or GCS connection

- use a Download recipe to download the parquet files to that folder

- make a new HDFS or S3 or Azure or GCS dataset (same type as the folder) pointing to the location of the folder

 

Note that=

1) you'll need your DSS to have a working Hadoop integration in place, because Parquet is a file format with strong ties to Hadoop. So if you have a Hadoop distribution installed on the machine, use the `bin/dssadmin install-hadoop-integration` command from the DSS data dir; if you don't have a Hadoop distribution, grab the hadoop standalone jar package from our download site and install with `bin/dssadmin install-hadoop-integration -standaloneArchive /path/to/where/you/downloaded/it`  (DSS restart needed afterwards)

2) if using S3/Azure/GCS, you'll need to set the "HDFS interface" field on the DSS connection to something else than "None". If using Azure, it must be ABFS

0 Kudos