S3 dataset skip zero size files

Options
yuriy-medvedev
yuriy-medvedev Registered Posts: 1
edited July 16 in Using Dataiku

Hi there, i have issue with dataset storing in S3, some files have 0 bytes size and sync job exit with error

alb.9a8582074f9ea6ff_20230427T0035Z_34.214.228.143_2ur2kmxf.log.gz**
[14:29:08] [ERROR] [dku.pipeline] - Parallel stream worker failed
java.io.EOFException
    at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:268)
    at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:258)
    at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:164)
    at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:79)
    at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:91)
    at com.dataiku.dip.input.stream.DecodedInputStreamFactory.addDecoding(DecodedInputStreamFactory.java:19)
    at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.extractSimple(ArchiveCapableFormatExtractor.java:142)
    at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.run(ArchiveCapableFormatExtractor.java:59)
    at com.dataiku.dip.dataflow.exec.stream.ParallelStreamSlaveRunnable.run(ParallelStreamSlaveRunnable.java:61)
    at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:375)
[14:29:08] [INFO] [dku.pipeline] - done running
[14:29:08] [INFO] [dku.flow.stream] - Parallel streamer done
[14:29:08] [INFO] [dku.flow.activity] - Run thread failed for activity compute_aws-logs-us_copy_NP
java.io.EOFException
    at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:268)
    at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:258)
    at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:164)
    at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:79)
    at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:91)
    at com.dataiku.dip.input.stream.DecodedInputStreamFactory.addDecoding(DecodedInputStreamFactory.java:19)
    at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.extractSimple(ArchiveCapableFormatExtractor.java:142)
    at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.run(ArchiveCapableFormatExtractor.java:59)
    at com.dataiku.dip.dataflow.exec.stream.ParallelStreamSlaveRunnable.run(ParallelStreamSlaveRunnable.java:61)
    at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:375)
[14:29:08] [INFO] [dku.flow.activity] - Run thread done for activity compute_aws-logs-us_copy_NP
[14:29:08] [INFO] [dku.flow.activity] running compute_aws-logs-us_copy_NP - activity is finished
[14:29:08] [ERROR] [dku.flow.activity] running compute_aws-logs-us_copy_NP - Activity failed
java.io.EOFException
    at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:268)
    at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:258)
    at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:164)
    at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:79)
    at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:91)
    at com.dataiku.dip.input.stream.DecodedInputStreamFactory.addDecoding(DecodedInputStreamFactory.java:19)
    at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.extractSimple(ArchiveCapableFormatExtractor.java:142)
    at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.run(ArchiveCapableFormatExtractor.java:59)
    at com.dataiku.dip.dataflow.exec.stream.ParallelStreamSlaveRunnable.run(ParallelStreamSlaveRunnable.java:61)
    at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:375)
[14:29:08] [INFO] [dku.flow.activity] running compute_aws-logs-us_copy_NP - Executing default post-activity lifecycle hook
[14:29:08] [INFO] [dku.flow.activity] running compute_aws-logs-us_copy_NP - Done post-activity tasks

Maybe someone has an idea how to skip this files

Tagged:

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    edited July 17
    Options

    Hi @yuriy-medvedev
    ,

    One possible approach is create a managed folder to the path of this dataset,
    Then eitherdelete or move all non-zero bytes files to another managed folder.

    This is example of delete all zero byte files, if you don't want to modify the original folder content you could simply use a merge folder recipe first and then perform the clean-up. Then use this newly created path to create your dataset using Files in Folder for example.

    # -*- coding: utf-8 -*-
    import dataiku
    import pandas as pd, numpy as np
    from dataiku import pandasutils as pdu
    
    # Read recipe inputs
    handle = dataiku.Folder("JUGh5UiU")
    test_info = handle.get_info()
    
    path_details = handle.get_path_details()
    for i in path_details['children']:
       if i ['size'] == 0:
            print(i['fullPath'])
            print("will delete: " + i['fullPath'] )
            handle.delete_path(i['fullPath'] )


    Hope that helps!

Setup Info
    Tags
      Help me…