Unable to process Parquet File types in Dataiku Platform

Balaparamesh
Balaparamesh Partner, Registered Posts: 1 Partner

Hi Team,

We are unable to process the Parquet File typw when we integrated to Hadoop Source systems.

Thank you

Tagged:

Answers

  • fchataigner2
    fchataigner2 Dataiker Posts: 355 Dataiker

    Hi,

    it's as the message says, DSS won't let you read parquet data directly in a HTTP dataset. You need to download the parquet files to a HDFS-friendly location first:

    - create a managed folder (+ New dataset > Folder) on a HDFS or S3 or Azure or GCS connection

    - use a Download recipe to download the parquet files to that folder

    - make a new HDFS or S3 or Azure or GCS dataset (same type as the folder) pointing to the location of the folder

    Note that=

    1) you'll need your DSS to have a working Hadoop integration in place, because Parquet is a file format with strong ties to Hadoop. So if you have a Hadoop distribution installed on the machine, use the `bin/dssadmin install-hadoop-integration` command from the DSS data dir; if you don't have a Hadoop distribution, grab the hadoop standalone jar package from our download site and install with `bin/dssadmin install-hadoop-integration -standaloneArchive /path/to/where/you/downloaded/it` (DSS restart needed afterwards)

    2) if using S3/Azure/GCS, you'll need to set the "HDFS interface" field on the DSS connection to something else than "None". If using Azure, it must be ABFS

Setup Info
    Tags
      Help me…