Why do Python API Dataset Read and Write operations incur CPU resource?

somepunter · ‎06-13-2023

I would have expected read and write operations to only be IO bound and not require any CPU if I ran them in separate threads within the same process. code below suggests otherwise

from multiprocessing.pool import ThreadPool
from dataiku import Dataset

POOL = ThreadPool(2)
POOL.apply_async(Dataset('mydataset').get_dataframe

# this takes twice as long if Dataiku api is reaidng. a regular pd.read_csv doens't slow this down.
x = 0
with timer():
    for i in range(100000000):
        x += 1
        x -= 1

AlexT · ‎06-14-2023

Hi @somepunter ,
As we responded in the support ticket as well.

It's expected an that it takes the CPU to parse the incoming network stream or write the outgoing stream.
If you have any further questions, you can follow up directly in the support ticket created.

Thanks,

somepunter · ‎06-14-2023

Apologies,

I'd humbly argue that taking up roughly half the CPU for an essentially 0 CPU cost IO operation like pd.to_csv() and pd.read_csv() isn't really "expected" behaviour amongst the pydata wranglers and data scientists. You may expect it as a dataiku implementation detail, but that's hard to justify to data practitioners.

As an aside, it's hard enough to justify to my end users why dataiku.Dataset api is 100x slower when writing the same dataset to the same disk via the Dataiku api vs a pd.to_parquet(). I now also have to justify to them why the Dataiku api also consumes half their CPU vs native python io operations that don't travel via the dataiku Java server?

Sign up to take part

Why do Python API Dataset Read and Write operations incur CPU resource?

Why do Python API Dataset Read and Write operations incur CPU resource?