Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hi there, i have issue with dataset storing in S3, some files have 0 bytes size and sync job exit with error
alb.9a8582074f9ea6ff_20230427T0035Z_34.214.228.143_2ur2kmxf.log.gz**
[14:29:08] [ERROR] [dku.pipeline] - Parallel stream worker failed
java.io.EOFException
at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:268)
at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:258)
at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:164)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:79)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:91)
at com.dataiku.dip.input.stream.DecodedInputStreamFactory.addDecoding(DecodedInputStreamFactory.java:19)
at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.extractSimple(ArchiveCapableFormatExtractor.java:142)
at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.run(ArchiveCapableFormatExtractor.java:59)
at com.dataiku.dip.dataflow.exec.stream.ParallelStreamSlaveRunnable.run(ParallelStreamSlaveRunnable.java:61)
at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:375)
[14:29:08] [INFO] [dku.pipeline] - done running
[14:29:08] [INFO] [dku.flow.stream] - Parallel streamer done
[14:29:08] [INFO] [dku.flow.activity] - Run thread failed for activity compute_aws-logs-us_copy_NP
java.io.EOFException
at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:268)
at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:258)
at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:164)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:79)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:91)
at com.dataiku.dip.input.stream.DecodedInputStreamFactory.addDecoding(DecodedInputStreamFactory.java:19)
at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.extractSimple(ArchiveCapableFormatExtractor.java:142)
at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.run(ArchiveCapableFormatExtractor.java:59)
at com.dataiku.dip.dataflow.exec.stream.ParallelStreamSlaveRunnable.run(ParallelStreamSlaveRunnable.java:61)
at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:375)
[14:29:08] [INFO] [dku.flow.activity] - Run thread done for activity compute_aws-logs-us_copy_NP
[14:29:08] [INFO] [dku.flow.activity] running compute_aws-logs-us_copy_NP - activity is finished
[14:29:08] [ERROR] [dku.flow.activity] running compute_aws-logs-us_copy_NP - Activity failed
java.io.EOFException
at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:268)
at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:258)
at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:164)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:79)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:91)
at com.dataiku.dip.input.stream.DecodedInputStreamFactory.addDecoding(DecodedInputStreamFactory.java:19)
at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.extractSimple(ArchiveCapableFormatExtractor.java:142)
at com.dataiku.dip.input.formats.ArchiveCapableFormatExtractor.run(ArchiveCapableFormatExtractor.java:59)
at com.dataiku.dip.dataflow.exec.stream.ParallelStreamSlaveRunnable.run(ParallelStreamSlaveRunnable.java:61)
at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:375)
[14:29:08] [INFO] [dku.flow.activity] running compute_aws-logs-us_copy_NP - Executing default post-activity lifecycle hook
[14:29:08] [INFO] [dku.flow.activity] running compute_aws-logs-us_copy_NP - Done post-activity tasks
Maybe someone has an idea how to skip this files
Hi @yuriy-medvedev ,
One possible approach is create a managed folder to the path of this dataset,
Then eitherdelete or move all non-zero bytes files to another managed folder.
This is example of delete all zero byte files, if you don't want to modify the original folder content you could simply use a merge folder recipe first and then perform the clean-up. Then use this newly created path to create your dataset using Files in Folder for example.
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
# Read recipe inputs
handle = dataiku.Folder("JUGh5UiU")
test_info = handle.get_info()
path_details = handle.get_path_details()
for i in path_details['children']:
if i ['size'] == 0:
print(i['fullPath'])
print("will delete: " + i['fullPath'] )
handle.delete_path(i['fullPath'] )
Hope that helps!