S3 dataset skip zero size files
yuriy-medvedev
Registered Posts: 1 ✭
Hi there, i have issue with dataset storing in S3, some files have 0 bytes size and sync job exit with error
alb.9a8582074f9ea6ff_20230427T0035Z_34.214.228.143_2ur2kmxf.log.gz** [14:29:08] [ERROR] [dku.pipeline] - Parallel stream worker failed java.io.EOFException at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:268) at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:258) at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:164) at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:79) at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:91) at com.dataiku.dip.input.stream.DecodedInputStreamFactory.addDecoding(DecodedInputStreamFactory.java:19) at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.extractSimple(ArchiveCapableFormatExtractor.java:142) at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.run(ArchiveCapableFormatExtractor.java:59) at com.dataiku.dip.dataflow.exec.stream.ParallelStreamSlaveRunnable.run(ParallelStreamSlaveRunnable.java:61) at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:375) [14:29:08] [INFO] [dku.pipeline] - done running [14:29:08] [INFO] [dku.flow.stream] - Parallel streamer done [14:29:08] [INFO] [dku.flow.activity] - Run thread failed for activity compute_aws-logs-us_copy_NP java.io.EOFException at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:268) at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:258) at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:164) at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:79) at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:91) at com.dataiku.dip.input.stream.DecodedInputStreamFactory.addDecoding(DecodedInputStreamFactory.java:19) at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.extractSimple(ArchiveCapableFormatExtractor.java:142) at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.run(ArchiveCapableFormatExtractor.java:59) at com.dataiku.dip.dataflow.exec.stream.ParallelStreamSlaveRunnable.run(ParallelStreamSlaveRunnable.java:61) at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:375) [14:29:08] [INFO] [dku.flow.activity] - Run thread done for activity compute_aws-logs-us_copy_NP [14:29:08] [INFO] [dku.flow.activity] running compute_aws-logs-us_copy_NP - activity is finished [14:29:08] [ERROR] [dku.flow.activity] running compute_aws-logs-us_copy_NP - Activity failed java.io.EOFException at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:268) at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:258) at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:164) at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:79) at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:91) at com.dataiku.dip.input.stream.DecodedInputStreamFactory.addDecoding(DecodedInputStreamFactory.java:19) at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.extractSimple(ArchiveCapableFormatExtractor.java:142) at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.run(ArchiveCapableFormatExtractor.java:59) at com.dataiku.dip.dataflow.exec.stream.ParallelStreamSlaveRunnable.run(ParallelStreamSlaveRunnable.java:61) at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:375) [14:29:08] [INFO] [dku.flow.activity] running compute_aws-logs-us_copy_NP - Executing default post-activity lifecycle hook [14:29:08] [INFO] [dku.flow.activity] running compute_aws-logs-us_copy_NP - Done post-activity tasks
Maybe someone has an idea how to skip this files
Tagged:
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi @yuriy-medvedev
,One possible approach is create a managed folder to the path of this dataset,
Then eitherdelete or move all non-zero bytes files to another managed folder.
This is example of delete all zero byte files, if you don't want to modify the original folder content you could simply use a merge folder recipe first and then perform the clean-up. Then use this newly created path to create your dataset using Files in Folder for example.# -*- coding: utf-8 -*- import dataiku import pandas as pd, numpy as np from dataiku import pandasutils as pdu # Read recipe inputs handle = dataiku.Folder("JUGh5UiU") test_info = handle.get_info() path_details = handle.get_path_details() for i in path_details['children']: if i ['size'] == 0: print(i['fullPath']) print("will delete: " + i['fullPath'] ) handle.delete_path(i['fullPath'] )
Hope that helps!