How to ignore EmptyFileException

Options
Georghios
Georghios Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 15 ✭✭✭

I have a connection to a blob storage and I would like to build a dataset from xlsx files. Some of them are empty (0 byte files) but there's on option to ignore these in the GUI...

Best Answer

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,757 Neuron
    Answer ✓
    Options

    Thanks, it's a lot easier to understand and reproduce the error now. I can confirm that I get the same error with empty Excel files. I also don't see any ways of filtering the empty files or avoid the exception in a visual no-code way.

    Personally I think this is a bug for two reasons:

    1. The File for Test and preview is clever enough to look for non-empty files, then why wouldn't the loaded process do the same? Seems silly to do it for tester and not the loader
    2. Why would the loader attempt to load an empty file? It's obviously nothing to load

    You should raise it with Dataiku Support but I suspect that even if it is accepted as a bug it will not have much priority for fixing since this is a problem of your own making (ie why do you have empty files in the blob storage?). And even if it was fixed you will probably not be able to go to the latest version that quickly.

    So I think you have 2 options:

    1) Fix your writer process so that it doesn't leave 0 bytes files

    2) Pre-process the blob storage and remove any 0 bytes files before you attempt to load them. This will certainly require some Python code so not going to be possible with a visual recipe.

    Which option are you able to go for?

Answers

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,757 Neuron
    Options

    How are you loading your files? Where do you actually get the exception?

  • Georghios
    Georghios Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 15 ✭✭✭
    edited July 17
    Options

    Thank you for your response @Turribeach
    .

    The files are loaded from an Azure blob storage with Excel files > created a dataset in DataIKU by explicitly selecting them > run any recipe to store the amalgamated dataset:

    Screenshot 2023-10-02 115453.pngScreenshot 2023-10-02 114955.pngScreenshot 2023-10-02 115028.png

    The error I'm receiving is this:

    Oops: an unexpected error occurred

    Failed to open Excel file, caused by: EmptyFileException: The supplied file was empty (zero bytes long)

    Please see our options for getting help

    HTTP code: , type: java.io.IOException

    java.io.IOException: Failed to open Excel file
    at com.dataiku.dip.formats.excel.ExcelFormatExtractor.readWorkbook(ExcelFormatExtractor.java:422) at com.dataiku.dip.formats.excel.ExcelFormatExtractor.doExtractStream(ExcelFormatExtractor.java:349) at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.extractSimple(ArchiveCapableFormatExtractor.java:154) at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.run(ArchiveCapableFormatExtractor.java:59) at com.dataiku.dip.datasets.AbstractSingleThreadPusher.pushSplits(AbstractSingleThreadPusher.java:184) at com.dataiku.dip.datasets.UniversalSingleThreadPusher.push(UniversalSingleThreadPusher.java:224) at com.dataiku.dip.dataflow.exec.stream.SingleThreadFSLikeDatasetRunnable.run(SingleThreadFSLikeDatasetRunnable.java:71) at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:378) Caused by: org.apache.poi.EmptyFileException: The supplied file was empty (zero bytes long) at org.apache.poi.util.IOUtils.peekFirstNBytes(IOUtils.java:111) at org.apache.poi.poifs.filesystem.FileMagic.valueOf(FileMagic.java:206) at org.apache.poi.openxml4j.opc.internal.ZipHelper.verifyZipHeader(ZipHelper.java:143) at org.apache.poi.openxml4j.opc.internal.ZipHelper.openZipFile(ZipHelper.java:201) at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:140) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:277) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:186) at com.github.pjfanning.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:123) at com.github.pjfanning.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:90) at com.github.pjfanning.xlsx.StreamingReader$Builder.open(StreamingReader.java:307) at com.dataiku.dip.formats.excel.ExcelFormatExtractor.readWorkbook(ExcelFormatExtractor.java:416) ... 7 more [09:13:25] [INFO] [dku.flow.activity] running compute_newar_dataset_stacked_prep_NP - activity is finished [09:13:25] [ERROR] [dku.flow.activity] running compute_newar_dataset_stacked_prep_NP - Activity failed java.io.IOException: Failed to open Excel file at com.dataiku.dip.formats.excel.ExcelFormatExtractor.readWorkbook(ExcelFormatExtractor.java:422) at com.dataiku.dip.formats.excel.ExcelFormatExtractor.doExtractStream(ExcelFormatExtractor.java:349) at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.extractSimple(ArchiveCapableFormatExtractor.java:154) at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.run(ArchiveCapableFormatExtractor.java:59) at com.dataiku.dip.datasets.AbstractSingleThreadPusher.pushSplits(AbstractSingleThreadPusher.java:184) at com.dataiku.dip.datasets.UniversalSingleThreadPusher.push(UniversalSingleThreadPusher.java:224) at com.dataiku.dip.dataflow.exec.stream.SingleThreadFSLikeDatasetRunnable.run(SingleThreadFSLikeDatasetRunnable.java:71) at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:378) Caused by: org.apache.poi.EmptyFileException: The supplied file was empty (zero bytes long) at org.apache.poi.util.IOUtils.peekFirstNBytes(IOUtils.java:111) at org.apache.poi.poifs.filesystem.FileMagic.valueOf(FileMagic.java:206) at org.apache.poi.openxml4j.opc.internal.ZipHelper.verifyZipHeader(ZipHelper.java:143) at org.apache.poi.openxml4j.opc.internal.ZipHelper.openZipFile(ZipHelper.java:201) at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:140) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:277) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:186) at com.github.pjfanning.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:123) at com.github.pjfanning.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:90) at com.github.pjfanning.xlsx.StreamingReader$Builder.open(StreamingReader.java:307) at com.dataiku.dip.formats.excel.ExcelFormatExtractor.readWorkbook(ExcelFormatExtractor.java:416)

  • Georghios
    Georghios Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 15 ✭✭✭
    Options

    Went for another option. Asked the person responsible for the file sync to only deal with files over 0 bytes.

    I agree with your points though; the reader should avoid empty files indeed.

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,757 Neuron
    Options

    That's option 1) for me.

Setup Info
    Tags
      Help me…