We are using Dataiku exclusively in Google Cloud Platform (GCP from now on) and mostly with Google Cloud Storage buckets (GCS from now on) and Google BigQuery (BQ from now), Google's Datawarehousing solution. The basic functionality works but we have found several limitations in Dataiku which I like to share with the community in case other people are thinking in using BQ in Dataiku:
- The #1 limitation is the lack of support for Python code recipes (except Python with SQLExecutor2). Considering you can easily connect and manipulate BQ data using pydata_google_auth and pandas_gbq Dataiku should really add this feature. This means that if you want to manipulate your BQ in Python data frames you have to sync the data from BQ to GCS, perform your data manipulation in GCS and then sync it again to BQ. This means adding more steps, adding unnecessary complexity to the flow, more time to compute, unnecessary storage, etc
- Poor support for BigQuery functions: For instance functions that have direct support in BQ like substring are not supported in a formula / prepare recipe ("Cannot use SQL engine: the formula is not fully translatable to SQL"). This means that we need to fall back to SQL recipes which means more steps, adding unnecessary complexity to the flow, more time to compute, unnecessary storage, etc.
- In DSS 8.0.3 Dataiku added support for "cross-project operation". Great we thought! But when we went to test this functionality we found that this feature refers to cross-(GCP)-project operations not cross-(DSS)-connection operations. In other words, starting with DSS 8.0.3 you can now use a single BigQuery connection to read / write across BigQuery tables in different GCP projects; however, the DSS connection of the input and output datasets must still be the same. This is something that should be very easy to fix. If BQ connection B has permission to read table X from connection A why throw an error when trying to sync the data? While I understand that Dataiku wouldn't know if connection B has such permissions over connection A it should just work if the correct permissions are in place. This is a limitation from the old world where SQL Server 1 can't directly connect to SQL Server 2. But when you are in a GCP project there is only 1 BQ infrastructure, irrespective of which identity you use to connect to it. So cross-(DSS)-connection operations should work normally as long as the permissions are in place.
- Another big pitfall of the BQ "cross-project operation" support is the inability to restrict which datasets the connection can use. In GCS you can restrict the folder the connection uses. This means that you can host multiple team projects using a single bucket, a single service account and creating different connections and folders for each team so that they can't see each other's data. There is no equivalent way to do this in a BQ connection. So even if you create different BQ datasets (I mean Google BQ datasets) you can't prevent users from selecting from other BQ datasets (I mean Google BQ datasets) through that connection. While you can limit where users can write you can't prevent them from reading from the other datasets. This means that you have to fall back to having separate GCP service accounts for each group/team you want to keep separate rather than being able to use a single service account.
- If you are using an adhoc SQL Query in your BQ dataset instead of selecting a view or a table you will not be able to compute row counts. This is a silly bug that should hopefully be fixed soon.
- To load data into BQ you first put it on GCS and then sync it to BQ. Sounds easy right? Well it depends... When you Sync from GCS to BQ you need to select the format as "CSV (BigQuery-sync compatible)". Well it turns out this format is not really creating a BigQuery-sync compatible CSV. Below are the things that Dataiku doesn't do in this BigQuery-sync compatible format and will make your Sync recipe fail. Tip: you can solve the first one with the Simplify rename action but for the other two there is no direct fix. These should really be taken care by Dataiku to produce a true/real BigQuery-sync compatible CSV.
- Spaces in column names
- Columns names starting with a number
- Columns names longer than 128 characters
So I hope this helps other people in their use of BigQuery in Dataiku.