Faster imports from Google BigQuery

ben_p
ben_p Neuron 2020, Registered, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant Posts: 143 ✭✭✭✭✭✭✭

Hi all,

We use Google BigQuery a lot, it's where all our data begins. We are seeing some real slowness when it comes to reading data from BigQuery into DSS, as a kind of work around we pass data to Google Cloud Storage and then import this into DSS, which is much faster.

However, we still need to read from BigQuery to do this initial load into GCS, and it's really slow. Here is an example, reading 1 million rows in batches of 2k:

ben_p_0-1587025293289.png

By contrast, the batch size and speed of import from GCS is much quicker, here is the same 1 million rows coming from GCS:

ben_p_1-1587025620580.png

Is there any way to increase this batch size, or speed up imports from BigQuery?

Ben

Best Answer

  • Clément_Stenac
    Clément_Stenac Dataiker, Dataiku DSS Core Designer, Registered Posts: 753 Dataiker
    Answer ✓

    Hi,

    If you use the Sync recipe to perform the data copy from BigQuery to GCS, an optimized "BigQuery to GCS" engine will be used. Are you using a Sync recipe or another kind of recipe? Do you see the "BigQuery to GCS" engine selected ? If not, you can click on the engine selection widget in the Sync recipe to get more details about why it was not selected.

Answers

  • ben_p
    ben_p Neuron 2020, Registered, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant Posts: 143 ✭✭✭✭✭✭✭

    Hi Clement,

    Thanks for your response - I was using the prepare recipe, not sync. However, have I test out Sync I see the same level of performance:

    ben_p_0-1587029196646.png

    Looking into the engine I see:

    ben_p_1-1587029252288.png

    My BigQuery dataset is built on a SQL query, I assume this relates to this? Is there any way around it? Otherwise I will have to create this table in BQ as an additional step, which is not preferred.

    Best,
    Ben

  • Clément_Stenac
    Clément_Stenac Dataiker, Dataiku DSS Core Designer, Registered Posts: 753 Dataiker

    The BigQuery dataset in DSS must indeed be in "table" mode, however the "table" mode in DSS can also read views, so the simplest would be to create a view in BigQuery (this way, no data copy is required)

  • ben_p
    ben_p Neuron 2020, Registered, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant Posts: 143 ✭✭✭✭✭✭✭

    Thanks @Clément_Stenac
    , I will switch to a view instead. One final question, are there any speed improvements from using a format other than .csv - such as AVRO?

    Ben

Setup Info
    Tags
      Help me…