Why do Python API Dataset Read and Write operations incur CPU resource?

somepunter
somepunter Registered Posts: 20 ✭✭✭
edited July 16 in Using Dataiku

I would have expected read and write operations to only be IO bound and not require any CPU if I ran them in separate threads within the same process. code below suggests otherwise

from multiprocessing.pool import ThreadPool
from dataiku import Dataset

POOL = ThreadPool(2)
POOL.apply_async(Dataset('mydataset').get_dataframe

# this takes twice as long if Dataiku api is reaidng. a regular pd.read_csv doens't slow this down.
x = 0
with timer():
    for i in range(100000000):
        x += 1
        x -= 1

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,211 Dataiker

    Hi @somepunter
    ,
    As we responded in the support ticket as well.

    It's expected an that it takes the CPU to parse the incoming network stream or write the outgoing stream.
    If you have any further questions, you can follow up directly in the support ticket created.

    Thanks,

  • somepunter
    somepunter Registered Posts: 20 ✭✭✭

    Apologies,

    I'd humbly argue that taking up roughly half the CPU for an essentially 0 CPU cost IO operation like pd.to_csv() and pd.read_csv() isn't really "expected" behaviour amongst the pydata wranglers and data scientists. You may expect it as a dataiku implementation detail, but that's hard to justify to data practitioners.

    As an aside, it's hard enough to justify to my end users why dataiku.Dataset api is 100x slower when writing the same dataset to the same disk via the Dataiku api vs a pd.to_parquet(). I now also have to justify to them why the Dataiku api also consumes half their CPU vs native python io operations that don't travel via the dataiku Java server?

Setup Info
    Tags
      Help me…