Failed to create file in HDFS

ismail
ismail Registered Posts: 4 ✭✭✭✭

Hi,

I have setup DSS with Hadoop but having permission issues when storing dataset in HDFS.

Have used below article to setup, but I need help on steps in the HDFS section where I need to create a writable home directory in HDFS? That is my suspicion for the errors.

https://doc.dataiku.com/dss/latest/hadoop/installation.html#hdfs

You may also need to setup a writable HDFS home directory for DSS (typically “/user/dataiku”) if you plan to store DSS datasets in HDFS.

Error in DSS recipe build below:

Mkdirs failed to create file:/user/dataiku/dss_managed_datasets/ISMAIL_PROJ5/sales_prepared_gpby_by_month/_temporary/0/_temporary/attempt_dss_0000_r_000000_0 (exists=false, cwd=file:/home/dataiku/dss/designer/data/run)

[10:41:30] [ERROR] [org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter] - Mkdirs failed to create file:/user/dataiku/dss_managed_datasets/ISMAIL_PROJ5/sales_prepared_gpby_by_month/_temporary/0
[10:41:30] [INFO] [com.dataiku.dip.input.formats.parquet.DSSRowWriteSupport] - Output Parquet MessageType : 
message hive_schema {
  optional int96 month;
  optional int64 count;
}

[10:41:30] [INFO] [dku.flow.activity] - Run thread failed for activity compute_sales_prepared_gpby_by_month_NP
java.io.IOException: Mkdirs failed to create file:/user/dataiku/dss_managed_datasets/ISMAIL_PROJ5/sales_prepared_gpby_by_month/_temporary/0/_temporary/attempt_dss_0000_r_000000_0 (exists=false, cwd=file:/home/dataiku/dss/designer/data/run)
   at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:458)
   at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:443)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1052)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1032)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:921)
   at parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:176)
   at parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:160)
   at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:289)
   at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
   at com.dataiku.dip.input.formats.parquet.ParquetOutputWriter$1.run(ParquetOutputWriter.java:98)
   at com.dataiku.dip.input.formats.parquet.ParquetOutputWriter$1.run(ParquetOutputWriter.java:81)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:422)
   at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
   at com.dataiku.dip.input.formats.parquet.ParquetOutputWriter.init(ParquetOutputWriter.java:81)
   at com.dataiku.dip.dataflow.exec.stream.ToDatasetStreamer.init(ToDatasetStreamer.java:125)
   at com.dataiku.dip.dataflow.exec.stream.ToDatasetStreamer.getAsProcessor(ToDatasetStreamer.java:108)
   at com.dataiku.dip.dataflow.exec.stream.ToDatasetStreamer.getAsOutput(ToDatasetStreamer.java:112)
   at com.dataiku.dip.recipes.code.sql.SQLQueryRecipeUtils.buildProcessorOutput(SQLQueryRecipeUtils.java:283)
   at com.dataiku.dip.recipes.code.sql.AbstractSQLQueryRecipeRunner.runRegularToDataset(AbstractSQLQueryRecipeRunner.java:178)
   at com.dataiku.dip.dataflow.exec.sql.SQLQueryRecipeRunner.runRegular(SQLQueryRecipeRunner.java:223)
   at com.dataiku.dip.dataflow.exec.sql.SQLQueryRecipeRunner.run(SQLQueryRecipeRunner.java:164)
   at com.dataiku.dip.dataflow.exec.MultiEngineRecipeRunner.run(MultiEngineRecipeRunner.java:203)
   at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:380)
[10:41:30] [INFO] [dku.flow.activity] running compute_sales_prepared_gpby_by_month_NP - activity is finished
[10:41:30] [ERROR] [dku.flow.activity] running compute_sales_prepared_gpby_by_month_NP - Activity failed
java.io.IOException: Mkdirs failed to create file:/user/dataiku/dss_managed_datasets/ISMAIL_PROJ5/sales_prepared_gpby_by_month/_temporary/0/_temporary/attempt_dss_0000_r_000000_0 (exists=false, cwd=file:/home/dataiku/dss/designer/data/run)
   at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:458)
   at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:443)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1052)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1032)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:921)
   at parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:176)
   at parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:160)
   at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:289)
   at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
   at com.dataiku.dip.input.formats.parquet.ParquetOutputWriter$1.run(ParquetOutputWriter.java:98)
   at com.dataiku.dip.input.formats.parquet.ParquetOutputWriter$1.run(ParquetOutputWriter.java:81)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:422)
   at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
   at com.dataiku.dip.input.formats.parquet.ParquetOutputWriter.init(ParquetOutputWriter.java:81)
   at com.dataiku.dip.dataflow.exec.stream.ToDatasetStreamer.init(ToDatasetStreamer.java:125)
   at com.dataiku.dip.dataflow.exec.stream.ToDatasetStreamer.getAsProcessor(ToDatasetStreamer.java:108)
   at com.dataiku.dip.dataflow.exec.stream.ToDatasetStreamer.getAsOutput(ToDatasetStreamer.java:112)
   at com.dataiku.dip.recipes.code.sql.SQLQueryRecipeUtils.buildProcessorOutput(SQLQueryRecipeUtils.java:283)
   at com.dataiku.dip.recipes.code.sql.AbstractSQLQueryRecipeRunner.runRegularToDataset(AbstractSQLQueryRecipeRunner.java:178)
   at com.dataiku.dip.dataflow.exec.sql.SQLQueryRecipeRunner.runRegular(SQLQueryRecipeRunner.java:223)
   at com.dataiku.dip.dataflow.exec.sql.SQLQueryRecipeRunner.run(SQLQueryRecipeRunner.java:164)
   at com.dataiku.dip.dataflow.exec.MultiEngineRecipeRunner.run(MultiEngineRecipeRunner.java:203)
   at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:380)

Thanks is advance folks.

Best Answer

Answers

  • fchataigner2
    fchataigner2 Dataiker Posts: 355 Dataiker

    Hi,

    the URI in the log excerpt you pasted says `file:/user/...` which means that the Hadoop setup on the DSS instance or more generally on the machine is incomplete or incorrect. That URI implies the Hadoop property fs.defaultFS is still on its default value, which in turn implies that however Hadoop was set up, the Hadoop settings aren't available to DSS.

    You need to check that you can run `hadoop` on the command line, inspect the HDFS with commands like `hadoop fs -ls`. If your Hadoop installation is a local one, that you have set up manually (read: doesn't come from a Hadoop vendor), there's a good chance that you don't have a HADOOP_CONF_DIR environment variable set; in that case, define it in the `./bin/env-site.sh` of you DSS instance, and make it point to wherever the Hadoop conf files are, and restart DSS.

Setup Info
    Tags
      Help me…