How to ignore EmptyFileException
I have a connection to a blob storage and I would like to build a dataset from xlsx files. Some of them are empty (0 byte files) but there's on option to ignore these in the GUI...
Best Answer
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,049 Neuron
Thanks, it's a lot easier to understand and reproduce the error now. I can confirm that I get the same error with empty Excel files. I also don't see any ways of filtering the empty files or avoid the exception in a visual no-code way.
Personally I think this is a bug for two reasons:- The File for Test and preview is clever enough to look for non-empty files, then why wouldn't the loaded process do the same? Seems silly to do it for tester and not the loader
- Why would the loader attempt to load an empty file? It's obviously nothing to load
You should raise it with Dataiku Support but I suspect that even if it is accepted as a bug it will not have much priority for fixing since this is a problem of your own making (ie why do you have empty files in the blob storage?). And even if it was fixed you will probably not be able to go to the latest version that quickly.
So I think you have 2 options:
1) Fix your writer process so that it doesn't leave 0 bytes files2) Pre-process the blob storage and remove any 0 bytes files before you attempt to load them. This will certainly require some Python code so not going to be possible with a visual recipe.
Which option are you able to go for?
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,049 Neuron
How are you loading your files? Where do you actually get the exception?
-
Georghios Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 15 ✭✭✭
Thank you for your response @Turribeach
.The files are loaded from an Azure blob storage with Excel files > created a dataset in DataIKU by explicitly selecting them > run any recipe to store the amalgamated dataset:
The error I'm receiving is this:
Oops: an unexpected error occurred
Failed to open Excel file, caused by: EmptyFileException: The supplied file was empty (zero bytes long)
Please see our options for getting help
HTTP code: , type: java.io.IOException
java.io.IOException: Failed to open Excel file at com.dataiku.dip.formats.excel.ExcelFormatExtractor.readWorkbook(ExcelFormatExtractor.java:422) at com.dataiku.dip.formats.excel.ExcelFormatExtractor.doExtractStream(ExcelFormatExtractor.java:349) at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.extractSimple(ArchiveCapableFormatExtractor.java:154) at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.run(ArchiveCapableFormatExtractor.java:59) at com.dataiku.dip.datasets.AbstractSingleThreadPusher.pushSplits(AbstractSingleThreadPusher.java:184) at com.dataiku.dip.datasets.UniversalSingleThreadPusher.push(UniversalSingleThreadPusher.java:224) at com.dataiku.dip.dataflow.exec.stream.SingleThreadFSLikeDatasetRunnable.run(SingleThreadFSLikeDatasetRunnable.java:71) at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:378) Caused by: org.apache.poi.EmptyFileException: The supplied file was empty (zero bytes long) at org.apache.poi.util.IOUtils.peekFirstNBytes(IOUtils.java:111) at org.apache.poi.poifs.filesystem.FileMagic.valueOf(FileMagic.java:206) at org.apache.poi.openxml4j.opc.internal.ZipHelper.verifyZipHeader(ZipHelper.java:143) at org.apache.poi.openxml4j.opc.internal.ZipHelper.openZipFile(ZipHelper.java:201) at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:140) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:277) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:186) at com.github.pjfanning.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:123) at com.github.pjfanning.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:90) at com.github.pjfanning.xlsx.StreamingReader$Builder.open(StreamingReader.java:307) at com.dataiku.dip.formats.excel.ExcelFormatExtractor.readWorkbook(ExcelFormatExtractor.java:416) ... 7 more [09:13:25] [INFO] [dku.flow.activity] running compute_newar_dataset_stacked_prep_NP - activity is finished [09:13:25] [ERROR] [dku.flow.activity] running compute_newar_dataset_stacked_prep_NP - Activity failed java.io.IOException: Failed to open Excel file at com.dataiku.dip.formats.excel.ExcelFormatExtractor.readWorkbook(ExcelFormatExtractor.java:422) at com.dataiku.dip.formats.excel.ExcelFormatExtractor.doExtractStream(ExcelFormatExtractor.java:349) at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.extractSimple(ArchiveCapableFormatExtractor.java:154) at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.run(ArchiveCapableFormatExtractor.java:59) at com.dataiku.dip.datasets.AbstractSingleThreadPusher.pushSplits(AbstractSingleThreadPusher.java:184) at com.dataiku.dip.datasets.UniversalSingleThreadPusher.push(UniversalSingleThreadPusher.java:224) at com.dataiku.dip.dataflow.exec.stream.SingleThreadFSLikeDatasetRunnable.run(SingleThreadFSLikeDatasetRunnable.java:71) at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:378) Caused by: org.apache.poi.EmptyFileException: The supplied file was empty (zero bytes long) at org.apache.poi.util.IOUtils.peekFirstNBytes(IOUtils.java:111) at org.apache.poi.poifs.filesystem.FileMagic.valueOf(FileMagic.java:206) at org.apache.poi.openxml4j.opc.internal.ZipHelper.verifyZipHeader(ZipHelper.java:143) at org.apache.poi.openxml4j.opc.internal.ZipHelper.openZipFile(ZipHelper.java:201) at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:140) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:277) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:186) at com.github.pjfanning.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:123) at com.github.pjfanning.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:90) at com.github.pjfanning.xlsx.StreamingReader$Builder.open(StreamingReader.java:307) at com.dataiku.dip.formats.excel.ExcelFormatExtractor.readWorkbook(ExcelFormatExtractor.java:416) -
Georghios Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 15 ✭✭✭
Went for another option. Asked the person responsible for the file sync to only deal with files over 0 bytes.
I agree with your points though; the reader should avoid empty files indeed.
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,049 Neuron
That's option 1) for me.