BUG: python multithreading leads to hanging jobs in partitioned data flows
Hi,
I am using dataiku 6.0.3 with partitioned datasets. When I run a python job between two partitioned datasets on 5 parallel jobs, using ProcessPoolExecutor with 2 or more cores (my machine has 16 cores), the jobs never finish. They get stalled and in the log [dku.job.slave] - Sending status update followd by "Status update sent" is repeated.
This seems to be a bug.
Answers
-
Hi,
Please note that when you run a Python recipe, DSS simply starts your code and Python runs it, DSS does not alter it in any way. If your Python code does not progress, you would need to investigate what your Python code is doing, possibly by grabbing Python stacks.
DSS does not have any specific handling for multi-processed code. However, a case where DSS might interfere with such a job is if you pass a "DatasetWriter" object (i.e. the result of dataset.get_writer) from one process to another. This is because this DatasetWriter has a finalizer and duplicating it will lead to unexpected things happening. You should in that case make sure that you perform all computations in subprocesses, then close the pool, then write the outputs.
If this does not help, could you attach a diagnosis of your job (From the job page, click Actions > Download job diagnosis) ?
-
Hi Clement,
if I run the code on 16 cores in the jupyter notebook the partition (that hangs otherwise) gets finished in 20 secs ! I also modified my python code now to not use multiprocessing and suddenly it also does not hang anymore when executing in flow.
So this clearly shows the incompatibility of DSS execution with multiprocessing.
Attached you find the output.log and log.log, whose file ending I had to change to be able to upload into this discussion.