How to ignore EmptyFileException

Solved!
gjoseph
Level 2
How to ignore EmptyFileException

I have a connection to a blob storage and I would like to build a dataset from xlsx files. Some of them are empty (0 byte files) but there's on option to ignore these in the GUI...

0 Kudos
1 Solution
Turribeach

Thanks, it's a lot easier to understand and reproduce the error now. I can confirm that I get the same error with empty Excel files. I also don't see any ways of filtering the empty files or avoid the exception in a visual no-code way.

Personally I think this is a bug for two reasons:

  1. The File for Test and preview is clever enough to look for non-empty files, then why wouldn't the loaded process do the same? Seems silly to do it for tester and not the loader
  2. Why would the loader attempt to load an empty file? It's obviously nothing to load

You should raise it with Dataiku Support but I suspect that even if it is accepted as a bug it will not have much priority for fixing since this is a problem of your own making (ie why do you have empty files in the blob storage?). And even if it was fixed you will probably not be able to go to the latest version that quickly.

So I think you have 2 options:

1) Fix your writer process so that it doesn't leave 0 bytes files

2) Pre-process the blob storage and remove any 0 bytes files before you attempt to load them. This will certainly require some Python code so not going to be possible with a visual recipe. 

Which option are you able to go for?

 

 

 

 

View solution in original post

0 Kudos
5 Replies
Turribeach

How are you loading your files? Where do you actually get the exception?

0 Kudos
gjoseph
Level 2
Author

Thank you for your response @Turribeach.

The files are loaded from an Azure blob storage with Excel files > created a dataset in DataIKU by explicitly selecting them > run any recipe to store the amalgamated dataset:

Screenshot 2023-10-02 115453.pngScreenshot 2023-10-02 114955.pngScreenshot 2023-10-02 115028.png

 

The error I'm receiving is this:

Oops: an unexpected error occurred

Failed to open Excel file, caused by: EmptyFileException: The supplied file was empty (zero bytes long)

Please see our options for getting help

HTTP code: , type: java.io.IOException

java.io.IOException: Failed to open Excel file
at com.dataiku.dip.formats.excel.ExcelFormatExtractor.readWorkbook(ExcelFormatExtractor.java:422) at com.dataiku.dip.formats.excel.ExcelFormatExtractor.doExtractStream(ExcelFormatExtractor.java:349) at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.extractSimple(ArchiveCapableFormatExtractor.java:154) at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.run(ArchiveCapableFormatExtractor.java:59) at com.dataiku.dip.datasets.AbstractSingleThreadPusher.pushSplits(AbstractSingleThreadPusher.java:184) at com.dataiku.dip.datasets.UniversalSingleThreadPusher.push(UniversalSingleThreadPusher.java:224) at com.dataiku.dip.dataflow.exec.stream.SingleThreadFSLikeDatasetRunnable.run(SingleThreadFSLikeDatasetRunnable.java:71) at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:378) Caused by: org.apache.poi.EmptyFileException: The supplied file was empty (zero bytes long) at org.apache.poi.util.IOUtils.peekFirstNBytes(IOUtils.java:111) at org.apache.poi.poifs.filesystem.FileMagic.valueOf(FileMagic.java:206) at org.apache.poi.openxml4j.opc.internal.ZipHelper.verifyZipHeader(ZipHelper.java:143) at org.apache.poi.openxml4j.opc.internal.ZipHelper.openZipFile(ZipHelper.java:201) at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:140) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:277) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:186) at com.github.pjfanning.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:123) at com.github.pjfanning.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:90) at com.github.pjfanning.xlsx.StreamingReader$Builder.open(StreamingReader.java:307) at com.dataiku.dip.formats.excel.ExcelFormatExtractor.readWorkbook(ExcelFormatExtractor.java:416) ... 7 more [09:13:25] [INFO] [dku.flow.activity] running compute_newar_dataset_stacked_prep_NP - activity is finished [09:13:25] [ERROR] [dku.flow.activity] running compute_newar_dataset_stacked_prep_NP - Activity failed java.io.IOException: Failed to open Excel file at com.dataiku.dip.formats.excel.ExcelFormatExtractor.readWorkbook(ExcelFormatExtractor.java:422) at com.dataiku.dip.formats.excel.ExcelFormatExtractor.doExtractStream(ExcelFormatExtractor.java:349) at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.extractSimple(ArchiveCapableFormatExtractor.java:154) at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.run(ArchiveCapableFormatExtractor.java:59) at com.dataiku.dip.datasets.AbstractSingleThreadPusher.pushSplits(AbstractSingleThreadPusher.java:184) at com.dataiku.dip.datasets.UniversalSingleThreadPusher.push(UniversalSingleThreadPusher.java:224) at com.dataiku.dip.dataflow.exec.stream.SingleThreadFSLikeDatasetRunnable.run(SingleThreadFSLikeDatasetRunnable.java:71) at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:378) Caused by: org.apache.poi.EmptyFileException: The supplied file was empty (zero bytes long) at org.apache.poi.util.IOUtils.peekFirstNBytes(IOUtils.java:111) at org.apache.poi.poifs.filesystem.FileMagic.valueOf(FileMagic.java:206) at org.apache.poi.openxml4j.opc.internal.ZipHelper.verifyZipHeader(ZipHelper.java:143) at org.apache.poi.openxml4j.opc.internal.ZipHelper.openZipFile(ZipHelper.java:201) at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:140) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:277) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:186) at com.github.pjfanning.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:123) at com.github.pjfanning.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:90) at com.github.pjfanning.xlsx.StreamingReader$Builder.open(StreamingReader.java:307) at com.dataiku.dip.formats.excel.ExcelFormatExtractor.readWorkbook(ExcelFormatExtractor.java:416)

 

0 Kudos
Turribeach

Thanks, it's a lot easier to understand and reproduce the error now. I can confirm that I get the same error with empty Excel files. I also don't see any ways of filtering the empty files or avoid the exception in a visual no-code way.

Personally I think this is a bug for two reasons:

  1. The File for Test and preview is clever enough to look for non-empty files, then why wouldn't the loaded process do the same? Seems silly to do it for tester and not the loader
  2. Why would the loader attempt to load an empty file? It's obviously nothing to load

You should raise it with Dataiku Support but I suspect that even if it is accepted as a bug it will not have much priority for fixing since this is a problem of your own making (ie why do you have empty files in the blob storage?). And even if it was fixed you will probably not be able to go to the latest version that quickly.

So I think you have 2 options:

1) Fix your writer process so that it doesn't leave 0 bytes files

2) Pre-process the blob storage and remove any 0 bytes files before you attempt to load them. This will certainly require some Python code so not going to be possible with a visual recipe. 

Which option are you able to go for?

 

 

 

 

0 Kudos
gjoseph
Level 2
Author

Went for another option. Asked the person responsible for the file sync to only deal with files over 0 bytes.

I agree with your points though; the reader should avoid empty files indeed.

0 Kudos
Turribeach

That's option 1) for me. ๐Ÿ˜‰

0 Kudos