NoClassDefFoundError when reading a parquet file
I have setup an HDFS connection to access a Google Cloud Storage bucket on which I have parquet files.
After adding GoogleHadoopFileSystem to the hadoop configuration I can access the bucket and files.
However when I create a new dataset and I select a parquet file (including a standard sample found at https://github.com/Teradata/kylo/blob/master/samples/sample-data/parquet/userdata1.parquet), I have this error:
Oops: an unexpected error occurred
parquet/hadoop/ParquetInputFormat
Please see our options for getting help
HTTP code: 500, type: java.lang.NoClassDefFoundError
The complete error returned by the server is:
{"errorType":"java.lang.NoClassDefFoundError","message":"parquet/hadoop/ParquetInputFormat","detailedMessage":"parquet/hadoop/ParquetInputFormat","detailedMessageHTML":"\u003cspan\u003e\u003cspan class\u003d\"err-msg\"\u003eparquet/hadoop/ParquetInputFormat\u003c/span\u003e\u003c/span\u003e","stackTraceStr":"java.lang.NoClassDefFoundError: parquet/hadoop/ParquetInputFormat\n\tat com.dataiku.dip.input.formats.parquet.ParquetFormatExtractor$1.run(ParquetFormatExtractor.java:114)\n\tat com.dataiku.dip.input.formats.parquet.ParquetFormatExtractor$1.run(ParquetFormatExtractor.java:106)\n\tat java.base/java.security.AccessController.doPrivileged(Native Method)\n\tat java.base/javax.security.auth.Subject.doAs(Subject.java:423)\n\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)\n\tat com.dataiku.dip.util.HadoopUtils.fixedUpDoAs(HadoopUtils.java:36)\n\tat com.dataiku.dip.input.formats.parquet.ParquetFormatExtractor.run(ParquetFormatExtractor.java:106)\n\tat com.dataiku.dip.datasets.fs.FileFormatDatasetTestHandler.gatherSampleRecords(FileFormatDatasetTestHandler.java:462)\n\tat com.dataiku.dip.datasets.fs.FileFormatDatasetTestHandler.detectFormats(FileFormatDatasetTestHandler.java:175)\n\tat com.dataiku.dip.server.datasets.DatasetsTestController$TestAndDetectFormatFutureThread.compute(DatasetsTestController.java:364)\n\tat com.dataiku.dip.server.datasets.DatasetsTestController$TestAndDetectFormatFutureThread.compute(DatasetsTestController.java:327)\n\tat com.dataiku.dip.futures.SimpleFutureThread.execute(SimpleFutureThread.java:36)\n\tat com.dataiku.dip.futures.FutureThreadBase.run(FutureThreadBase.java:88)\n","stackTrace":[{"file":"ParquetFormatExtractor.java","line":114,"function":"com.dataiku.dip.input.formats.parquet.ParquetFormatExtractor$1.run"},{"file":"ParquetFormatExtractor.java","line":106,"function":"com.dataiku.dip.input.formats.parquet.ParquetFormatExtractor$1.run"},{"file":"AccessController.java","line":-2,"function":"java.security.AccessController.doPrivileged"},{"file":"Subject.java","line":423,"function":"javax.security.auth.Subject.doAs"},{"file":"UserGroupInformation.java","line":1893,"function":"org.apache.hadoop.security.UserGroupInformation.doAs"},{"file":"HadoopUtils.java","line":36,"function":"com.dataiku.dip.util.HadoopUtils.fixedUpDoAs"},{"file":"ParquetFormatExtractor.java","line":106,"function":"com.dataiku.dip.input.formats.parquet.ParquetFormatExtractor.run"},{"file":"FileFormatDatasetTestHandler.java","line":462,"function":"com.dataiku.dip.datasets.fs.FileFormatDatasetTestHandler.gatherSampleRecords"},{"file":"FileFormatDatasetTestHandler.java","line":175,"function":"com.dataiku.dip.datasets.fs.FileFormatDatasetTestHandler.detectFormats"},{"file":"DatasetsTestController.java","line":364,"function":"com.dataiku.dip.server.datasets.DatasetsTestController$TestAndDetectFormatFutureThread.compute"},{"file":"DatasetsTestController.java","line":327,"function":"com.dataiku.dip.server.datasets.DatasetsTestController$TestAndDetectFormatFutureThread.compute"},{"file":"SimpleFutureThread.java","line":36,"function":"com.dataiku.dip.futures.SimpleFutureThread.execute"},{"file":"FutureThreadBase.java","line":88,"function":"com.dataiku.dip.futures.FutureThreadBase.run"}]}
Using DSS 8.0.2 with hadoop-2.10.0 and spark-2.4.5-bin-without-hadoop.
Answers
-
Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron
Hi @phildav
. What "engine" are you using to create the new dataset? Wherever the calculation is taking place, apparently it doesn't have installed the libraries to read parquet files. -
I see a very similar problem, but looking at the run/backend.log, I can see the problem of not finding a class that should be provided by DSS:
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - [2020/11/23-18:03:45.186] [FT-TestAndDetectFormatFutureThread-ZAfFeflE-392] [WARN] [dku.futures] - Future thread failed
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - java.lang.NoClassDefFoundError: com/dataiku/dip/input/formats/parquet/DSSParquetInputFormat
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at com.dataiku.dip.input.formats.parquet.ParquetFormatExtractor$1.run(ParquetFormatExtractor.java:114)
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at com.dataiku.dip.input.formats.parquet.ParquetFormatExtractor$1.run(ParquetFormatExtractor.java:106)
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at java.security.AccessController.doPrivileged(Native Method)
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at javax.security.auth.Subject.doAs(Subject.java:422)
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at com.dataiku.dip.util.HadoopUtils.fixedUpDoAs(HadoopUtils.java:36)
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at com.dataiku.dip.input.formats.parquet.ParquetFormatExtractor.run(ParquetFormatExtractor.java:106)
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at com.dataiku.dip.datasets.fs.FileFormatDatasetTestHandler.gatherSampleRecords(FileFormatDatasetTestHandler.java:462)
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at com.dataiku.dip.datasets.fs.FileFormatDatasetTestHandler.detectFormats(FileFormatDatasetTestHandler.java:175)
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at com.dataiku.dip.server.datasets.DatasetsTestController$TestAndDetectFormatFutureThread.compute(DatasetsTestController.java:364)
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at com.dataiku.dip.server.datasets.DatasetsTestController$TestAndDetectFormatFutureThread.compute(DatasetsTestController.java:327)
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at com.dataiku.dip.futures.SimpleFutureThread.execute(SimpleFutureThread.java:36)
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at com.dataiku.dip.futures.FutureThreadBase.run(FutureThreadBase.java:88)The strange thing is that the missing class is on the same JAR file as the one making the call (com.dataiku.dip.input.formats.parquet.ParquetFormatExtractor$1.run(ParquetFormatExtractor.java:114)):
dataiku-dss-8.0.2/dist/dataiku-dip.jar
-
After some more investigation, we figured out the problem in our case, had to do with the use of a custom installation of Hadoop (unsupported by DSS), and this script did the magic (to include the required jars for Parquet support in HDFS):
def get_extra_parquet_jars(hadoop_version, hive_jars):
"""
Gets the list of JARs to add to the DKU_HIVE_CP for support of Parquet 1.6
for distributions that don't have it anymore.
Note: this function also does a side effect to detect EMR flavor
"""
add_dss_parquet_jars = False
# HDP 2.5+ does not provide parquet 1.5/1.6
if hadoop_version.is_hdp3() or hadoop_version.is_hdp() and re.search("^2\\.[56].*$",hadoop_version.hdp_version) is not None:
print("HDP 2.5+ detected, adding parquet 1.6 to Hive classpath")
add_dss_parquet_jars = True
if hadoop_version.is_cdh6():
print("CDH 6+ detected, adding parquet 1.6 to Hive classpath")
add_dss_parquet_jars = True
for jar in hive_jars:
# Nor Hive 2.x on EMR 5.x.
if re.search("hive\\-exec\\-2\\.[0-9]\\.[0-9]\\-amzn\\-", jar) is not None:
print("EMR 5.x detected, adding parquet 1.6 to Hive classpath")
add_dss_parquet_jars = True
hv.flavor = "emr"
# Nor Hive 2.x on MapR 5.2 / MEP3.
if re.search("hive\\-exec\\-2\\.[0-9]\\.[0-9]\\-mapr\\-", jar) is not None:
print("Hive2 on MapR detected, adding parquet 1.6 to Hive classpath")
add_dss_parquet_jars = True
# Nor Google Cloud Dataproc
if hadoop_version.is_dataproc():
print("Google Cloud Dataproc detected, adding parquet 1.6 to Hive classpath")
add_dss_parquet_jars = True
if add_dss_parquet_jars:
parquet_run_folder = "%s/lib/ivy/parquet-run/" % os.environ["DKUINSTALLDIR"]
parquet_jars = []
for f in os.listdir(parquet_run_folder):
if f.endswith(".jar"):
parquet_jars.append("%s%s" % (parquet_run_folder, f))
return parquet_jars
else:
return []As you can see it adds the parquet 1.6 jars for Hortonworks, Cloudera, EMR, MapR, Google Cloud, etc. (if some other conditions are met), but in our case we have a "generic" Hadoop version so they were ignored.
Going back to the original issue, using Google Cloud Hadoop, it should have been installed and I believe there was a problem in the hadoop integration installation process (either the script or by the user configuration/environment).
In any case, the workaround we used was simply to modify the $DSS_DATA_DIR/bin/env-site.sh with the following:
export DKU_HADOOP_CP=$DKU_HADOOP_CP:/home/dataiku/dataiku-dss-8.0.2/lib/ivy/parquet-run/adal4j-1.6.4.jar:/home/dataiku/dataiku-dss-8.0.2/lib/ivy/parquet-run/jdom-1.1.3.jar:/home/dataiku/dataiku-dss-8.0.2/lib/ivy/parquet-run/parquet-hadoop-bundle-1.6.0.jar:/home/dataiku/dataiku-dss-8.0.2/lib/ivy/parquet-run/parquet-pig-1.6.0.jar
To ensure the DSS' distributed parquet jars to be part of the backend's classpath. A more dynamic way (to allow different versions of DataIKU and the list of jars on that directory) could be the following:
export DKU_HADOOP_CP=$DKU_HADOOP_CP:$(ls $DKUINSTALLDIR/lib/ivy/parquet-run/*.jar |xargs echo |sed "s /:/g")