Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I have setup an HDFS connection to access a Google Cloud Storage bucket on which I have parquet files.
After adding GoogleHadoopFileSystem to the hadoop configuration I can access the bucket and files.
However when I create a new dataset and I select a parquet file (including a standard sample found at https://github.com/Teradata/kylo/blob/master/samples/sample-data/parquet/userdata1.parquet), I have this error:
Oops: an unexpected error occurred
parquet/hadoop/ParquetInputFormat
Please see our options for getting help
HTTP code: 500, type: java.lang.NoClassDefFoundError
The complete error returned by the server is:
{"errorType":"java.lang.NoClassDefFoundError","message":"parquet/hadoop/ParquetInputFormat","detailedMessage":"parquet/hadoop/ParquetInputFormat","detailedMessageHTML":"\u003cspan\u003e\u003cspan class\u003d\"err-msg\"\u003eparquet/hadoop/ParquetInputFormat\u003c/span\u003e\u003c/span\u003e","stackTraceStr":"java.lang.NoClassDefFoundError: parquet/hadoop/ParquetInputFormat\n\tat com.dataiku.dip.input.formats.parquet.ParquetFormatExtractor$1.run(ParquetFormatExtractor.java:114)\n\tat com.dataiku.dip.input.formats.parquet.ParquetFormatExtractor$1.run(ParquetFormatExtractor.java:106)\n\tat java.base/java.security.AccessController.doPrivileged(Native Method)\n\tat java.base/javax.security.auth.Subject.doAs(Subject.java:423)\n\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)\n\tat com.dataiku.dip.util.HadoopUtils.fixedUpDoAs(HadoopUtils.java:36)\n\tat com.dataiku.dip.input.formats.parquet.ParquetFormatExtractor.run(ParquetFormatExtractor.java:106)\n\tat com.dataiku.dip.datasets.fs.FileFormatDatasetTestHandler.gatherSampleRecords(FileFormatDatasetTestHandler.java:462)\n\tat com.dataiku.dip.datasets.fs.FileFormatDatasetTestHandler.detectFormats(FileFormatDatasetTestHandler.java:175)\n\tat com.dataiku.dip.server.datasets.DatasetsTestController$TestAndDetectFormatFutureThread.compute(DatasetsTestController.java:364)\n\tat com.dataiku.dip.server.datasets.DatasetsTestController$TestAndDetectFormatFutureThread.compute(DatasetsTestController.java:327)\n\tat com.dataiku.dip.futures.SimpleFutureThread.execute(SimpleFutureThread.java:36)\n\tat com.dataiku.dip.futures.FutureThreadBase.run(FutureThreadBase.java:88)\n","stackTrace":[{"file":"ParquetFormatExtractor.java","line":114,"function":"com.dataiku.dip.input.formats.parquet.ParquetFormatExtractor$1.run"},{"file":"ParquetFormatExtractor.java","line":106,"function":"com.dataiku.dip.input.formats.parquet.ParquetFormatExtractor$1.run"},{"file":"AccessController.java","line":-2,"function":"java.security.AccessController.doPrivileged"},{"file":"Subject.java","line":423,"function":"javax.security.auth.Subject.doAs"},{"file":"UserGroupInformation.java","line":1893,"function":"org.apache.hadoop.security.UserGroupInformation.doAs"},{"file":"HadoopUtils.java","line":36,"function":"com.dataiku.dip.util.HadoopUtils.fixedUpDoAs"},{"file":"ParquetFormatExtractor.java","line":106,"function":"com.dataiku.dip.input.formats.parquet.ParquetFormatExtractor.run"},{"file":"FileFormatDatasetTestHandler.java","line":462,"function":"com.dataiku.dip.datasets.fs.FileFormatDatasetTestHandler.gatherSampleRecords"},{"file":"FileFormatDatasetTestHandler.java","line":175,"function":"com.dataiku.dip.datasets.fs.FileFormatDatasetTestHandler.detectFormats"},{"file":"DatasetsTestController.java","line":364,"function":"com.dataiku.dip.server.datasets.DatasetsTestController$TestAndDetectFormatFutureThread.compute"},{"file":"DatasetsTestController.java","line":327,"function":"com.dataiku.dip.server.datasets.DatasetsTestController$TestAndDetectFormatFutureThread.compute"},{"file":"SimpleFutureThread.java","line":36,"function":"com.dataiku.dip.futures.SimpleFutureThread.execute"},{"file":"FutureThreadBase.java","line":88,"function":"com.dataiku.dip.futures.FutureThreadBase.run"}]}
Using DSS 8.0.2 with hadoop-2.10.0 and spark-2.4.5-bin-without-hadoop.
Hi @phildav . What "engine" are you using to create the new dataset? Wherever the calculation is taking place, apparently it doesn't have installed the libraries to read parquet files.
I see a very similar problem, but looking at the run/backend.log, I can see the problem of not finding a class that should be provided by DSS:
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - [2020/11/23-18:03:45.186] [FT-TestAndDetectFormatFutureThread-ZAfFeflE-392] [WARN] [dku.futures] - Future thread failed
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - java.lang.NoClassDefFoundError: com/dataiku/dip/input/formats/parquet/DSSParquetInputFormat
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at com.dataiku.dip.input.formats.parquet.ParquetFormatExtractor$1.run(ParquetFormatExtractor.java:114)
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at com.dataiku.dip.input.formats.parquet.ParquetFormatExtractor$1.run(ParquetFormatExtractor.java:106)
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at java.security.AccessController.doPrivileged(Native Method)
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at javax.security.auth.Subject.doAs(Subject.java:422)
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at com.dataiku.dip.util.HadoopUtils.fixedUpDoAs(HadoopUtils.java:36)
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at com.dataiku.dip.input.formats.parquet.ParquetFormatExtractor.run(ParquetFormatExtractor.java:106)
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at com.dataiku.dip.datasets.fs.FileFormatDatasetTestHandler.gatherSampleRecords(FileFormatDatasetTestHandler.java:462)
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at com.dataiku.dip.datasets.fs.FileFormatDatasetTestHandler.detectFormats(FileFormatDatasetTestHandler.java:175)
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at com.dataiku.dip.server.datasets.DatasetsTestController$TestAndDetectFormatFutureThread.compute(DatasetsTestController.java:364)
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at com.dataiku.dip.server.datasets.DatasetsTestController$TestAndDetectFormatFutureThread.compute(DatasetsTestController.java:327)
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at com.dataiku.dip.futures.SimpleFutureThread.execute(SimpleFutureThread.java:36)
[2020/11/23-18:03:45.187] [KNL-FEK-cQ06LqL9-err-45813] [INFO] [dku.utils] - at com.dataiku.dip.futures.FutureThreadBase.run(FutureThreadBase.java:88)
The strange thing is that the missing class is on the same JAR file as the one making the call (com.dataiku.dip.input.formats.parquet.ParquetFormatExtractor$1.run(ParquetFormatExtractor.java:114)):
dataiku-dss-8.0.2/dist/dataiku-dip.jar
After some more investigation, we figured out the problem in our case, had to do with the use of a custom installation of Hadoop (unsupported by DSS), and this script did the magic (to include the required jars for Parquet support in HDFS):
def get_extra_parquet_jars(hadoop_version, hive_jars):
"""
Gets the list of JARs to add to the DKU_HIVE_CP for support of Parquet 1.6
for distributions that don't have it anymore.
Note: this function also does a side effect to detect EMR flavor
"""
add_dss_parquet_jars = False
# HDP 2.5+ does not provide parquet 1.5/1.6
if hadoop_version.is_hdp3() or hadoop_version.is_hdp() and re.search("^2\\.[56].*$",hadoop_version.hdp_version) is not None:
print("HDP 2.5+ detected, adding parquet 1.6 to Hive classpath")
add_dss_parquet_jars = True
if hadoop_version.is_cdh6():
print("CDH 6+ detected, adding parquet 1.6 to Hive classpath")
add_dss_parquet_jars = True
for jar in hive_jars:
# Nor Hive 2.x on EMR 5.x.
if re.search("hive\\-exec\\-2\\.[0-9]\\.[0-9]\\-amzn\\-", jar) is not None:
print("EMR 5.x detected, adding parquet 1.6 to Hive classpath")
add_dss_parquet_jars = True
hv.flavor = "emr"
# Nor Hive 2.x on MapR 5.2 / MEP3.
if re.search("hive\\-exec\\-2\\.[0-9]\\.[0-9]\\-mapr\\-", jar) is not None:
print("Hive2 on MapR detected, adding parquet 1.6 to Hive classpath")
add_dss_parquet_jars = True
# Nor Google Cloud Dataproc
if hadoop_version.is_dataproc():
print("Google Cloud Dataproc detected, adding parquet 1.6 to Hive classpath")
add_dss_parquet_jars = True
if add_dss_parquet_jars:
parquet_run_folder = "%s/lib/ivy/parquet-run/" % os.environ["DKUINSTALLDIR"]
parquet_jars = []
for f in os.listdir(parquet_run_folder):
if f.endswith(".jar"):
parquet_jars.append("%s%s" % (parquet_run_folder, f))
return parquet_jars
else:
return []
As you can see it adds the parquet 1.6 jars for Hortonworks, Cloudera, EMR, MapR, Google Cloud, etc. (if some other conditions are met), but in our case we have a "generic" Hadoop version so they were ignored.
Going back to the original issue, using Google Cloud Hadoop, it should have been installed and I believe there was a problem in the hadoop integration installation process (either the script or by the user configuration/environment).
In any case, the workaround we used was simply to modify the $DSS_DATA_DIR/bin/env-site.sh with the following:
export DKU_HADOOP_CP=$DKU_HADOOP_CP:/home/dataiku/dataiku-dss-8.0.2/lib/ivy/parquet-run/adal4j-1.6.4.jar:/home/dataiku/dataiku-dss-8.0.2/lib/ivy/parquet-run/jdom-1.1.3.jar:/home/dataiku/dataiku-dss-8.0.2/lib/ivy/parquet-run/parquet-hadoop-bundle-1.6.0.jar:/home/dataiku/dataiku-dss-8.0.2/lib/ivy/parquet-run/parquet-pig-1.6.0.jar
To ensure the DSS' distributed parquet jars to be part of the backend's classpath. A more dynamic way (to allow different versions of DataIKU and the list of jars on that directory) could be the following:
export DKU_HADOOP_CP=$DKU_HADOOP_CP:$(ls $DKUINSTALLDIR/lib/ivy/parquet-run/*.jar |xargs echo |sed "s /:/g")