Job error: Unable to create executor due to Connection reset by peer

karyan
Level 1
Job error: Unable to create executor due to Connection reset by peer

Hello everyone ! 

I am facing an issue with jobs which sometimes get struck. I’ve found this error in the logs which I guess related to spark configs but I can't get which one causes this problem. Maybe there are any suggestions.

 

[2021/11/04-22:07:57.368] [null-err-83] [INFO] [dku.utils]  - [2021/11/04-22:07:57.367] [rpc-server-4-1] [ERROR] [org.apache.spark.network.server.TransportRequestHandler]  - Error sending result StreamResponse[streamId=/jars/aws-java-sdk-bundle-1.11.375.jar,byteCount=98732349,body=FileSegmentManagedBuffer[file=/opt/dataiku/lib/java/aws-java-sdk-bundle-1.11.375.jar,offset=0,length=98732349]] to /10.176.247.149:24459; closing connection
[2021/11/04-22:07:57.374] [null-err-83] [INFO] [dku.utils]  - java.io.IOException: Connection reset by peer
[2021/11/04-22:07:57.376] [null-err-83] [INFO] [dku.utils]  - 	at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
[2021/11/04-22:07:57.378] [null-err-83] [INFO] [dku.utils]  - 	at sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428)
[2021/11/04-22:07:57.381] [null-err-83] [INFO] [dku.utils]  - 	at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493)
[2021/11/04-22:07:57.383] [null-err-83] [INFO] [dku.utils]  - 	at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:605)
[2021/11/04-22:07:57.385] [null-err-83] [INFO] [dku.utils]  - 	at io.netty.channel.DefaultFileRegion.transferTo(DefaultFileRegion.java:130)
[2021/11/04-22:07:57.388] [null-err-83] [INFO] [dku.utils]  - 	at org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:121)
[2021/11/04-22:07:57.390] [null-err-83] [INFO] [dku.utils]  - 	at io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:362)
[2021/11/04-22:07:57.392] [null-err-83] [INFO] [dku.utils]  - 	at io.netty.channel.nio.AbstractNioByteChannel.doWriteInternal(AbstractNioByteChannel.java:235)
[2021/11/04-22:07:57.394] [null-err-83] [INFO] [dku.utils]  - 	at io.netty.channel.nio.AbstractNioByteChannel.doWrite0(AbstractNioByteChannel.java:209)
[2021/11/04-22:07:57.396] [null-err-83] [INFO] [dku.utils]  - 	at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:400)
[2021/11/04-22:07:57.398] [null-err-83] [INFO] [dku.utils]  - 	at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:930)
[2021/11/04-22:07:57.400] [null-err-83] [INFO] [dku.utils]  - 	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.forceFlush(AbstractNioChannel.java:361)
[2021/11/04-22:07:57.402] [null-err-83] [INFO] [dku.utils]  - 	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:708)
[2021/11/04-22:07:57.404] [null-err-83] [INFO] [dku.utils]  - 	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
[2021/11/04-22:07:57.406] [null-err-83] [INFO] [dku.utils]  - 	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
[2021/11/04-22:07:57.408] [null-err-83] [INFO] [dku.utils]  - 	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
[2021/11/04-22:07:57.410] [null-err-83] [INFO] [dku.utils]  - 	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
[2021/11/04-22:07:57.412] [null-err-83] [INFO] [dku.utils]  - 	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
[2021/11/04-22:07:57.414] [null-err-83] [INFO] [dku.utils]  - 	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
[2021/11/04-22:07:57.416] [null-err-83] [INFO] [dku.utils]  - 	at java.lang.Thread.run(Thread.java:748)
[2021/11/04-22:07:57.713] [null-err-83] [INFO] [dku.utils]  - [2021/11/04-22:07:57.713] [dispatcher-CoarseGrainedScheduler] [ERROR] [org.apache.spark.scheduler.TaskSchedulerImpl]  - Lost executor 6 on 10.234.72.180: Unable to create executor due to Connection reset by peer

 

Will be glad for your help.

Thanks !

0 Kudos
2 Replies
Clément_Stenac
Dataiker
0 Kudos
dshurgatw
Level 1

I was getting the same error even if I tried many things.My job used to get stuck throwing this error after running a very long time. I tried few work around which helped me to resolve. Although, I still get the same error by at least my job runs fine.

  1. one reason could be the executors kills themselves thinking that they lost the connection from the master. I added the below configurations in spark-defaults.conf file.

    spark.network.timeout 10000000 spark.executor.heartbeatInterval 10000000 basically,I have increased the network timeout and heartbeat interval

  2. The particular step which used to get stuck, I just cached the dataframe that is used for processing (in the step which used to get stuck)

Note:- These are work arounds,I still see the same error in error logs but the my job does not get terminated

0 Kudos