Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hello everyone !
I am facing an issue with jobs which sometimes get struck. I’ve found this error in the logs which I guess related to spark configs but I can't get which one causes this problem. Maybe there are any suggestions.
[2021/11/04-22:07:57.368] [null-err-83] [INFO] [dku.utils] - [2021/11/04-22:07:57.367] [rpc-server-4-1] [ERROR] [org.apache.spark.network.server.TransportRequestHandler] - Error sending result StreamResponse[streamId=/jars/aws-java-sdk-bundle-1.11.375.jar,byteCount=98732349,body=FileSegmentManagedBuffer[file=/opt/dataiku/lib/java/aws-java-sdk-bundle-1.11.375.jar,offset=0,length=98732349]] to /10.176.247.149:24459; closing connection
[2021/11/04-22:07:57.374] [null-err-83] [INFO] [dku.utils] - java.io.IOException: Connection reset by peer
[2021/11/04-22:07:57.376] [null-err-83] [INFO] [dku.utils] - at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
[2021/11/04-22:07:57.378] [null-err-83] [INFO] [dku.utils] - at sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428)
[2021/11/04-22:07:57.381] [null-err-83] [INFO] [dku.utils] - at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493)
[2021/11/04-22:07:57.383] [null-err-83] [INFO] [dku.utils] - at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:605)
[2021/11/04-22:07:57.385] [null-err-83] [INFO] [dku.utils] - at io.netty.channel.DefaultFileRegion.transferTo(DefaultFileRegion.java:130)
[2021/11/04-22:07:57.388] [null-err-83] [INFO] [dku.utils] - at org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:121)
[2021/11/04-22:07:57.390] [null-err-83] [INFO] [dku.utils] - at io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:362)
[2021/11/04-22:07:57.392] [null-err-83] [INFO] [dku.utils] - at io.netty.channel.nio.AbstractNioByteChannel.doWriteInternal(AbstractNioByteChannel.java:235)
[2021/11/04-22:07:57.394] [null-err-83] [INFO] [dku.utils] - at io.netty.channel.nio.AbstractNioByteChannel.doWrite0(AbstractNioByteChannel.java:209)
[2021/11/04-22:07:57.396] [null-err-83] [INFO] [dku.utils] - at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:400)
[2021/11/04-22:07:57.398] [null-err-83] [INFO] [dku.utils] - at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:930)
[2021/11/04-22:07:57.400] [null-err-83] [INFO] [dku.utils] - at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.forceFlush(AbstractNioChannel.java:361)
[2021/11/04-22:07:57.402] [null-err-83] [INFO] [dku.utils] - at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:708)
[2021/11/04-22:07:57.404] [null-err-83] [INFO] [dku.utils] - at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
[2021/11/04-22:07:57.406] [null-err-83] [INFO] [dku.utils] - at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
[2021/11/04-22:07:57.408] [null-err-83] [INFO] [dku.utils] - at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
[2021/11/04-22:07:57.410] [null-err-83] [INFO] [dku.utils] - at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
[2021/11/04-22:07:57.412] [null-err-83] [INFO] [dku.utils] - at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
[2021/11/04-22:07:57.414] [null-err-83] [INFO] [dku.utils] - at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
[2021/11/04-22:07:57.416] [null-err-83] [INFO] [dku.utils] - at java.lang.Thread.run(Thread.java:748)
[2021/11/04-22:07:57.713] [null-err-83] [INFO] [dku.utils] - [2021/11/04-22:07:57.713] [dispatcher-CoarseGrainedScheduler] [ERROR] [org.apache.spark.scheduler.TaskSchedulerImpl] - Lost executor 6 on 10.234.72.180: Unable to create executor due to Connection reset by peer
Will be glad for your help.
Thanks !
Hi,
Please open a support ticket (https://doc.dataiku.com/dss/latest/troubleshooting/obtaining-support.html), attaching a job diagnosis of the failing job (https://doc.dataiku.com/dss/latest/troubleshooting/problems/job-fails.html#getting-a-job-diagnosis)
Thanks,
I was getting the same error even if I tried many things.My job used to get stuck throwing this error after running a very long time. I tried few work around which helped me to resolve. Although, I still get the same error by at least my job runs fine.
one reason could be the executors kills themselves thinking that they lost the connection from the master. I added the below configurations in spark-defaults.conf file.
spark.network.timeout 10000000 spark.executor.heartbeatInterval 10000000 basically,I have increased the network timeout and heartbeat interval
The particular step which used to get stuck, I just cached the dataframe that is used for processing (in the step which used to get stuck)
Note:- These are work arounds,I still see the same error in error logs but the my job does not get terminated