Meet DSS user Ben Powis, Data Science Manager at UK retail company MandM Direct Read More

Faster imports from Google BigQuery

Level 4
Faster imports from Google BigQuery

Hi all,

We use Google BigQuery a lot, it's where all our data begins. We are seeing some real slowness when it comes to reading data from BigQuery into DSS,  as a kind of work around we pass data to Google Cloud Storage and then import this into DSS, which is much faster.

However, we still need to read from BigQuery to do this initial load into GCS, and it's really slow. Here is an example, reading 1 million rows in batches of 2k:

ben_p_0-1587025293289.png

By contrast, the batch size and speed of import from GCS is much quicker, here is the same 1 million rows coming from GCS:

ben_p_1-1587025620580.png

Is there any way to increase this batch size, or speed up imports from BigQuery?

Ben

0 Kudos
4 Replies
Dataiker
Dataiker

Hi,

If you use the Sync recipe to perform the data copy from BigQuery to GCS, an optimized "BigQuery to GCS" engine will be used. Are you using a Sync recipe or another kind of recipe? Do you see the "BigQuery to GCS" engine selected ? If not, you can click on the engine selection widget in the Sync recipe to get more details about why it was not selected.

Level 4
Author

Hi Clement,

Thanks for your response - I was using the prepare recipe, not sync. However, have I test out Sync I see the same level of performance:

ben_p_0-1587029196646.png

Looking into the engine I see:

ben_p_1-1587029252288.png

My BigQuery dataset is built on a SQL query, I assume this relates to this? Is there any way around it? Otherwise I will have to create this table in BQ as an additional step, which is not preferred.

Best,
Ben

 

0 Kudos
Dataiker
Dataiker

The BigQuery dataset in DSS must indeed be in "table" mode, however the "table" mode in DSS can also read views, so the simplest would be to create a view in BigQuery (this way, no data copy is required)

Level 4
Author

Thanks @Clément_Stenac, I will switch to a view instead. One final question, are there any speed improvements from using a format other than .csv - such as AVRO?

Ben

0 Kudos