Failed to create file in HDFS
Hi,
I have setup DSS with Hadoop but having permission issues when storing dataset in HDFS.
Have used below article to setup, but I need help on steps in the HDFS section where I need to create a writable home directory in HDFS? That is my suspicion for the errors.
https://doc.dataiku.com/dss/latest/hadoop/installation.html#hdfs
You may also need to setup a writable HDFS home directory for DSS (typically “/user/dataiku”) if you plan to store DSS datasets in HDFS.
Error in DSS recipe build below:
Mkdirs failed to create file:/user/dataiku/dss_managed_datasets/ISMAIL_PROJ5/sales_prepared_gpby_by_month/_temporary/0/_temporary/attempt_dss_0000_r_000000_0 (exists=false, cwd=file:/home/dataiku/dss/designer/data/run)
[10:41:30] [ERROR] [org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter] - Mkdirs failed to create file:/user/dataiku/dss_managed_datasets/ISMAIL_PROJ5/sales_prepared_gpby_by_month/_temporary/0 [10:41:30] [INFO] [com.dataiku.dip.input.formats.parquet.DSSRowWriteSupport] - Output Parquet MessageType : message hive_schema { optional int96 month; optional int64 count; } [10:41:30] [INFO] [dku.flow.activity] - Run thread failed for activity compute_sales_prepared_gpby_by_month_NP java.io.IOException: Mkdirs failed to create file:/user/dataiku/dss_managed_datasets/ISMAIL_PROJ5/sales_prepared_gpby_by_month/_temporary/0/_temporary/attempt_dss_0000_r_000000_0 (exists=false, cwd=file:/home/dataiku/dss/designer/data/run) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:458) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:443) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1052) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1032) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:921) at parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:176) at parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:160) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:289) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262) at com.dataiku.dip.input.formats.parquet.ParquetOutputWriter$1.run(ParquetOutputWriter.java:98) at com.dataiku.dip.input.formats.parquet.ParquetOutputWriter$1.run(ParquetOutputWriter.java:81) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875) at com.dataiku.dip.input.formats.parquet.ParquetOutputWriter.init(ParquetOutputWriter.java:81) at com.dataiku.dip.dataflow.exec.stream.ToDatasetStreamer.init(ToDatasetStreamer.java:125) at com.dataiku.dip.dataflow.exec.stream.ToDatasetStreamer.getAsProcessor(ToDatasetStreamer.java:108) at com.dataiku.dip.dataflow.exec.stream.ToDatasetStreamer.getAsOutput(ToDatasetStreamer.java:112) at com.dataiku.dip.recipes.code.sql.SQLQueryRecipeUtils.buildProcessorOutput(SQLQueryRecipeUtils.java:283) at com.dataiku.dip.recipes.code.sql.AbstractSQLQueryRecipeRunner.runRegularToDataset(AbstractSQLQueryRecipeRunner.java:178) at com.dataiku.dip.dataflow.exec.sql.SQLQueryRecipeRunner.runRegular(SQLQueryRecipeRunner.java:223) at com.dataiku.dip.dataflow.exec.sql.SQLQueryRecipeRunner.run(SQLQueryRecipeRunner.java:164) at com.dataiku.dip.dataflow.exec.MultiEngineRecipeRunner.run(MultiEngineRecipeRunner.java:203) at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:380) [10:41:30] [INFO] [dku.flow.activity] running compute_sales_prepared_gpby_by_month_NP - activity is finished [10:41:30] [ERROR] [dku.flow.activity] running compute_sales_prepared_gpby_by_month_NP - Activity failed java.io.IOException: Mkdirs failed to create file:/user/dataiku/dss_managed_datasets/ISMAIL_PROJ5/sales_prepared_gpby_by_month/_temporary/0/_temporary/attempt_dss_0000_r_000000_0 (exists=false, cwd=file:/home/dataiku/dss/designer/data/run) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:458) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:443) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1052) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1032) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:921) at parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:176) at parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:160) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:289) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262) at com.dataiku.dip.input.formats.parquet.ParquetOutputWriter$1.run(ParquetOutputWriter.java:98) at com.dataiku.dip.input.formats.parquet.ParquetOutputWriter$1.run(ParquetOutputWriter.java:81) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875) at com.dataiku.dip.input.formats.parquet.ParquetOutputWriter.init(ParquetOutputWriter.java:81) at com.dataiku.dip.dataflow.exec.stream.ToDatasetStreamer.init(ToDatasetStreamer.java:125) at com.dataiku.dip.dataflow.exec.stream.ToDatasetStreamer.getAsProcessor(ToDatasetStreamer.java:108) at com.dataiku.dip.dataflow.exec.stream.ToDatasetStreamer.getAsOutput(ToDatasetStreamer.java:112) at com.dataiku.dip.recipes.code.sql.SQLQueryRecipeUtils.buildProcessorOutput(SQLQueryRecipeUtils.java:283) at com.dataiku.dip.recipes.code.sql.AbstractSQLQueryRecipeRunner.runRegularToDataset(AbstractSQLQueryRecipeRunner.java:178) at com.dataiku.dip.dataflow.exec.sql.SQLQueryRecipeRunner.runRegular(SQLQueryRecipeRunner.java:223) at com.dataiku.dip.dataflow.exec.sql.SQLQueryRecipeRunner.run(SQLQueryRecipeRunner.java:164) at com.dataiku.dip.dataflow.exec.MultiEngineRecipeRunner.run(MultiEngineRecipeRunner.java:203) at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:380)
Thanks is advance folks.
Best Answer
-
Hi ismail,
In terms of making sure that the "dataiku" user has a writable home directory in HDFS, you may also want to involve your Hadoop admins. For example, you could use the "hdfs dfs" commands with the appropriate hadoop user to create the home directory for your dataiku user, something like:
hdfs dfs -mkdir /user/dataiku hdfs dfs -chown dataiku:dataiku /user/dataiku su - dataiku hdfs dfs -ls
You can also find many examples of how to do this online, such as:
https://blog.dbi-services.com/create-an-hdfs-users-home-directory/
http://www.hadooplessons.info/2017/12/creating-home-directory-for-user-in-hdfs-hdpca.html
However, as mentioned previously, this is more of a Hadoop exercise so you may want to include your Hadoop admins accordingly if you have one.
Best,
Andrew
Answers
-
Hi,
the URI in the log excerpt you pasted says `file:/user/...` which means that the Hadoop setup on the DSS instance or more generally on the machine is incomplete or incorrect. That URI implies the Hadoop property fs.defaultFS is still on its default value, which in turn implies that however Hadoop was set up, the Hadoop settings aren't available to DSS.
You need to check that you can run `hadoop` on the command line, inspect the HDFS with commands like `hadoop fs -ls`. If your Hadoop installation is a local one, that you have set up manually (read: doesn't come from a Hadoop vendor), there's a good chance that you don't have a HADOOP_CONF_DIR environment variable set; in that case, define it in the `./bin/env-site.sh` of you DSS instance, and make it point to wherever the Hadoop conf files are, and restart DSS.