Job Failed exactly after 360h 0s

pratikgujral · October 2023

Hi Dataiku Team,

I was running an LLM inference on a static dataset in a Python recipe. I had estimated my Dataiku job to be completed in approximately 700 hours. However, I ran into an error after exactly 360 hours and 0 seconds. Here is the error message from the logs:

2023-10-17T09:59:36.017: Unexpected ERROR waiting for job to complete
java.net.SocketTimeoutException: Read timed out
    at java.base/java.net.SocketInputStream.socketRead0(Native Method)
    at java.base/java.net.SocketInputStream.socketRead(SocketInputStream.java:115)
    at java.base/java.net.SocketInputStream.read(SocketInputStream.java:168)
    at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
    at com.dataiku.dss.shadelib.org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:161)
    at com.dataiku.dss.shadelib.org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:82)
    at com.dataiku.dss.shadelib.org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:276)
    at com.dataiku.dss.shadelib.org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138)
    at com.dataiku.dss.shadelib.org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56)
    at com.dataiku.dss.shadelib.org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259)
    at com.dataiku.dss.shadelib.org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:294)
    at com.dataiku.dss.shadelib.org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:257)
    at com.dataiku.dss.shadelib.org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:207)
    at com.dataiku.dss.shadelib.org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273)
    at com.dataiku.dss.shadelib.org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
    at com.dataiku.dss.shadelib.org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:679)
    at com.dataiku.dss.shadelib.org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:481)
    at com.dataiku.dss.shadelib.org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:835)
    at com.dataiku.dss.shadelib.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
    at com.dataiku.dss.shadelib.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108)
    at com.dataiku.dss.shadelib.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
    at com.dataiku.dip.dataflow.kernel.master.JobExecutionKernelHandle.executeCommand(JobExecutionKernelHandle.java:334)
    at com.dataiku.dip.dataflow.kernel.master.JobExecutionKernelHandle.executeCommand(JobExecutionKernelHandle.java:312)
    at com.dataiku.dip.dataflow.kernel.master.BuildService$CombinedExecWaitThread.run(BuildService.java:561)

Now, since I my Python recipe was reading data from a Dataiku-managed dataset, stored on the DSS server filesystem itself, there is no external dependency that led to the "Read timed out" error I received. For additional information, in my Python recipe I am writing the output generated by the code in chunks to another managed dataset. Hence, I can verify that the output written in the 360 hours of run is valid, and executed without any errors.

What could be the root cause of the problem and how can I fix it? I am also curious to know why the job failed at exactly 360 hours 0 minutes 0 seconds. I need to be able to run long-running jobs.

Dataiku version: 11.3.2

Operating System: Ubuntu 20.04 (focal)

Dataiku Edition: Free Edition (single user; using in an academic institution)

PS: I have shared only the error message as it appeared in the job log, and not the entire job log as it is ~180 MB in size.

Operating system used: Ubuntu 20.04

Alexandru · October 2023

Hi @pratikgujral ,

There is an internal DSS a very long-running job that would time out after 15 days. This limit is not configurable currently but will take note of this feature request.

You can try to split your job into several parts, perhaps you can use partitioning or multi-threading to get the job to complete faster.

Thanks

Job Failed exactly after 360h 0s

Best Answer

Categories

Setup Info

Tags