Conversion to Parquet fails in Hadoop HDFS
$ hadoop version Hadoop 3.1.2
Source code repository https://github.com/apache/hadoop.git -r 1019dde65bcf12e05ef48ac71e84550d589e5d9a
Compiled by sunilg on 2019-01-29T01:39Z
Compiled with protoc 2.5.0 From source with checksum 64b8bdd4ca6e77cce75a93eb09ab2a9
This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-3.1.2.jar
I receive this error shortly after the recipe starts:
parquet/io/api/RecordConsumer, caused by: ClassNotFoundException: parquet.io.api.RecordConsumer
Looks like Java cant find the RecordConsumer.class or .jar file. Any ideas how to fix this?
---SOLVED---
1. Locate your env-hadoop.sh in DATA_DIR
2. Sudo nano env-hadoop.sh
3. find line "export DKU_HADOOP_CP="
4. add
:$DKUINSTALLDIR/lib/ivy/parquet-run/*
5. Restart DSS
Best Answer
-
Hi,
Dataiku does not support "home made" Hadoop distributions.
You may have some success by editing the "bin/env-hadoop.sh" file, locating the "DKU_HIVE_CP" line, and adding at the end (within the quotes):
:$DKUINSTALLDIR/lib/ivy/parquet-run/*Then restart DSS
Answers
-
Thanks for the answer however i can't find the "DKU_HIVE_CP" line you mention. You can find my hadoop-env.sh here:
https://paste.ubuntu.com/p/jgcSMTGbSd/ -
Figured you're talking about the DATADIR/bin later on. Ignore my question. Thanks for the help.
-
Tested it and it works. Thanks. Btw i added it to DKU_HADOOP_CP not DKU_HIVE_CP.
-
Hi Clement, I was looking for the parquet-run /ivy/parquet-run/ folder under the dku install dir /lib but I cannot find it. Do we have to install another dependencies?
best regards.
EDIT,
nevermind, I was looking at the data dir, not install dir.
For future records, quoting the docs:A Dataiku DSS installation spans over two folders:
- The installation directory, which contains the code of Dataiku DSS. This is the directory where the Dataiku DSS tarball is unzipped (denoted as “INSTALL_DIR”) → this is where /lib/ivy/parquet-run/ located
- The data directory (which will later be named “DATA_DIR”).