NoClassDefFoundError when reading a parquet file

phildav · October 2020

I have setup an HDFS connection to access a Google Cloud Storage bucket on which I have parquet files.

After adding GoogleHadoopFileSystem to the hadoop configuration I can access the bucket and files.

However when I create a new dataset and I select a parquet file (including a standard sample found at https://github.com/Teradata/kylo/blob/master/samples/sample-data/parquet/userdata1.parquet), I have this error:

 Oops: an unexpected error occurred

parquet/hadoop/ParquetInputFormat

Please see our options for getting help

HTTP code: 500, type: java.lang.NoClassDefFoundError

The complete error returned by the server is:

{"errorType":"java.lang.NoClassDefFoundError","message":"parquet/hadoop/ParquetInputFormat","detailedMessage":"parquet/hadoop/ParquetInputFormat","detailedMessageHTML":"\u003cspan\u003e\u003cspan class\u003d\"err-msg\"\u003eparquet/hadoop/ParquetInputFormat\u003c/span\u003e\u003c/span\u003e","stackTraceStr":"java.lang.NoClassDefFoundError: parquet/hadoop/ParquetInputFormat\n\tat com.dataiku.dip.input.formats.parquet.ParquetFormatExtractor$1.run(ParquetFormatExtractor.java:114)\n\tat com.dataiku.dip.input.formats.parquet.ParquetFormatExtractor$1.run(ParquetFormatExtractor.java:106)\n\tat java.base/java.security.AccessController.doPrivileged(Native Method)\n\tat java.base/javax.security.auth.Subject.doAs(Subject.java:423)\n\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)\n\tat com.dataiku.dip.util.HadoopUtils.fixedUpDoAs(HadoopUtils.java:36)\n\tat com.dataiku.dip.input.formats.parquet.ParquetFormatExtractor.run(ParquetFormatExtractor.java:106)\n\tat com.dataiku.dip.datasets.fs.FileFormatDatasetTestHandler.gatherSampleRecords(FileFormatDatasetTestHandler.java:462)\n\tat com.dataiku.dip.datasets.fs.FileFormatDatasetTestHandler.detectFormats(FileFormatDatasetTestHandler.java:175)\n\tat com.dataiku.dip.server.datasets.DatasetsTestController$TestAndDetectFormatFutureThread.compute(DatasetsTestController.java:364)\n\tat com.dataiku.dip.server.datasets.DatasetsTestController$TestAndDetectFormatFutureThread.compute(DatasetsTestController.java:327)\n\tat com.dataiku.dip.futures.SimpleFutureThread.execute(SimpleFutureThread.java:36)\n\tat com.dataiku.dip.futures.FutureThreadBase.run(FutureThreadBase.java:88)\n","stackTrace":[{"file":"ParquetFormatExtractor.java","line":114,"function":"com.dataiku.dip.input.formats.parquet.ParquetFormatExtractor$1.run"},{"file":"ParquetFormatExtractor.java","line":106,"function":"com.dataiku.dip.input.formats.parquet.ParquetFormatExtractor$1.run"},{"file":"AccessController.java","line":-2,"function":"java.security.AccessController.doPrivileged"},{"file":"Subject.java","line":423,"function":"javax.security.auth.Subject.doAs"},{"file":"UserGroupInformation.java","line":1893,"function":"org.apache.hadoop.security.UserGroupInformation.doAs"},{"file":"HadoopUtils.java","line":36,"function":"com.dataiku.dip.util.HadoopUtils.fixedUpDoAs"},{"file":"ParquetFormatExtractor.java","line":106,"function":"com.dataiku.dip.input.formats.parquet.ParquetFormatExtractor.run"},{"file":"FileFormatDatasetTestHandler.java","line":462,"function":"com.dataiku.dip.datasets.fs.FileFormatDatasetTestHandler.gatherSampleRecords"},{"file":"FileFormatDatasetTestHandler.java","line":175,"function":"com.dataiku.dip.datasets.fs.FileFormatDatasetTestHandler.detectFormats"},{"file":"DatasetsTestController.java","line":364,"function":"com.dataiku.dip.server.datasets.DatasetsTestController$TestAndDetectFormatFutureThread.compute"},{"file":"DatasetsTestController.java","line":327,"function":"com.dataiku.dip.server.datasets.DatasetsTestController$TestAndDetectFormatFutureThread.compute"},{"file":"SimpleFutureThread.java","line":36,"function":"com.dataiku.dip.futures.SimpleFutureThread.execute"},{"file":"FutureThreadBase.java","line":88,"function":"com.dataiku.dip.futures.FutureThreadBase.run"}]}

Using DSS 8.0.2 with hadoop-2.10.0 and spark-2.4.5-bin-without-hadoop.

Ignacio_Toledo · October 2020

Hi @phildav
. What "engine" are you using to create the new dataset? Wherever the calculation is taking place, apparently it doesn't have installed the libraries to read parquet files.

tstaig · November 2020

I see a very similar problem, but looking at the run/backend.log, I can see the problem of not finding a class that should be provided by DSS:

[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - [2020/11/23-18:03:45.186] [FT-TestAndDetectFormatFutureThread-ZAfFeflE-392] [WARN] [dku.futures] - Future thread failed
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - java.lang.NoClassDefFoundError: com/dataiku/dip/input/formats/parquet/DSSParquetInputFormat
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at com.dataiku.dip.input.formats.parquet.ParquetFormatExtractor$1.run(ParquetFormatExtractor.java:114)
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at com.dataiku.dip.input.formats.parquet.ParquetFormatExtractor$1.run(ParquetFormatExtractor.java:106)
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at java.security.AccessController.doPrivileged(Native Method)
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at javax.security.auth.Subject.doAs(Subject.java:422)
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at com.dataiku.dip.util.HadoopUtils.fixedUpDoAs(HadoopUtils.java:36)
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at com.dataiku.dip.input.formats.parquet.ParquetFormatExtractor.run(ParquetFormatExtractor.java:106)
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at com.dataiku.dip.datasets.fs.FileFormatDatasetTestHandler.gatherSampleRecords(FileFormatDatasetTestHandler.java:462)
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at com.dataiku.dip.datasets.fs.FileFormatDatasetTestHandler.detectFormats(FileFormatDatasetTestHandler.java:175)
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at com.dataiku.dip.server.datasets.DatasetsTestController$TestAndDetectFormatFutureThread.compute(DatasetsTestController.java:364)
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at com.dataiku.dip.server.datasets.DatasetsTestController$TestAndDetectFormatFutureThread.compute(DatasetsTestController.java:327)
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at com.dataiku.dip.futures.SimpleFutureThread.execute(SimpleFutureThread.java:36)
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at com.dataiku.dip.futures.FutureThreadBase.run(FutureThreadBase.java:88)

The strange thing is that the missing class is on the same JAR file as the one making the call (com.dataiku.dip.input.formats.parquet.ParquetFormatExtractor$1.run(ParquetFormatExtractor.java:114)):

dataiku-dss-8.0.2/dist/dataiku-dip.jar

tstaig · November 2020

After some more investigation, we figured out the problem in our case, had to do with the use of a custom installation of Hadoop (unsupported by DSS), and this script did the magic (to include the required jars for Parquet support in HDFS):

def get_extra_parquet_jars(hadoop_version, hive_jars):
"""
Gets the list of JARs to add to the DKU_HIVE_CP for support of Parquet 1.6
for distributions that don't have it anymore.

Note: this function also does a side effect to detect EMR flavor
"""
add_dss_parquet_jars = False
# HDP 2.5+ does not provide parquet 1.5/1.6
if hadoop_version.is_hdp3() or hadoop_version.is_hdp() and re.search("^2\\.[56].*$",hadoop_version.hdp_version) is not None:
print("HDP 2.5+ detected, adding parquet 1.6 to Hive classpath")
add_dss_parquet_jars = True

if hadoop_version.is_cdh6():
print("CDH 6+ detected, adding parquet 1.6 to Hive classpath")
add_dss_parquet_jars = True

for jar in hive_jars:
# Nor Hive 2.x on EMR 5.x.
if re.search("hive\\-exec\\-2\\.[0-9]\\.[0-9]\\-amzn\\-", jar) is not None:
print("EMR 5.x detected, adding parquet 1.6 to Hive classpath")
add_dss_parquet_jars = True
hv.flavor = "emr"
# Nor Hive 2.x on MapR 5.2 / MEP3.
if re.search("hive\\-exec\\-2\\.[0-9]\\.[0-9]\\-mapr\\-", jar) is not None:
print("Hive2 on MapR detected, adding parquet 1.6 to Hive classpath")
add_dss_parquet_jars = True

# Nor Google Cloud Dataproc
if hadoop_version.is_dataproc():
print("Google Cloud Dataproc detected, adding parquet 1.6 to Hive classpath")
add_dss_parquet_jars = True

if add_dss_parquet_jars:
parquet_run_folder = "%s/lib/ivy/parquet-run/" % os.environ["DKUINSTALLDIR"]
parquet_jars = []
for f in os.listdir(parquet_run_folder):
if f.endswith(".jar"):
parquet_jars.append("%s%s" % (parquet_run_folder, f))
return parquet_jars
else:
return []

As you can see it adds the parquet 1.6 jars for Hortonworks, Cloudera, EMR, MapR, Google Cloud, etc. (if some other conditions are met), but in our case we have a "generic" Hadoop version so they were ignored.

Going back to the original issue, using Google Cloud Hadoop, it should have been installed and I believe there was a problem in the hadoop integration installation process (either the script or by the user configuration/environment).

In any case, the workaround we used was simply to modify the $DSS_DATA_DIR/bin/env-site.sh with the following:

export DKU_HADOOP_CP=$DKU_HADOOP_CP:/home/dataiku/dataiku-dss-8.0.2/lib/ivy/parquet-run/adal4j-1.6.4.jar:/home/dataiku/dataiku-dss-8.0.2/lib/ivy/parquet-run/jdom-1.1.3.jar:/home/dataiku/dataiku-dss-8.0.2/lib/ivy/parquet-run/parquet-hadoop-bundle-1.6.0.jar:/home/dataiku/dataiku-dss-8.0.2/lib/ivy/parquet-run/parquet-pig-1.6.0.jar

To ensure the DSS' distributed parquet jars to be part of the backend's classpath. A more dynamic way (to allow different versions of DataIKU and the list of jars on that directory) could be the following:

export DKU_HADOOP_CP=$DKU_HADOOP_CP:$(ls $DKUINSTALLDIR/lib/ivy/parquet-run/*.jar |xargs echo |sed "s /:/g")

NoClassDefFoundError when reading a parquet file

Answers

Categories

Setup Info

Tags