Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I would have expected read and write operations to only be IO bound and not require any CPU if I ran them in separate threads within the same process. code below suggests otherwise
from multiprocessing.pool import ThreadPool
from dataiku import Dataset
POOL = ThreadPool(2)
# this takes twice as long if Dataiku api is reaidng. a regular pd.read_csv doens't slow this down.
x = 0
for i in range(100000000):
x += 1
x -= 1
Hi @somepunter ,
As we responded in the support ticket as well.
It's expected an that it takes the CPU to parse the incoming network stream or write the outgoing stream.
If you have any further questions, you can follow up directly in the support ticket created.
I'd humbly argue that taking up roughly half the CPU for an essentially 0 CPU cost IO operation like pd.to_csv() and pd.read_csv() isn't really "expected" behaviour amongst the pydata wranglers and data scientists. You may expect it as a dataiku implementation detail, but that's hard to justify to data practitioners.
As an aside, it's hard enough to justify to my end users why dataiku.Dataset api is 100x slower when writing the same dataset to the same disk via the Dataiku api vs a pd.to_parquet(). I now also have to justify to them why the Dataiku api also consumes half their CPU vs native python io operations that don't travel via the dataiku Java server?