Move tables that are in a Data Center (SQL database) to a Google data warehouse
Hello everyone.
I want to move tables that are in a Data Center (SQL database) to a Google data warehouse.
I'm trying to find the most efficient way to do it through Dataiku (if possible).
I have 2 doubts:
1. Is it possible to list all tables (not necessarily DSS datasets) using a python recipe that are in a database through a connection configured from dataiku?
2. Is it better to make the connection through Python instead of the connection made by Dataiku to handle the data in the database?
The goal is to do it as automatically as possible, that's why I want to do it by Python code, maybe using SQLExecutor2.
One aspect to emphasize is that each table is updated daily, so I still want to know if it is advisable to use dataset partitions to only attach new data.
Can you recommend if it is a good idea?
Thank you very much!!!!
Operating system used: Linux
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,088 Neuron
@rafael_rosado97
wrote:Hello everyone.
I want to move tables that are in a Data Center (SQL database) to a Google data warehouse.
I'm trying to find the most efficient way to do it through Dataiku (if possible).Sync to Google Storage Bucket and then Sync to Google BigQuery
1. Is it possible to list all tables (not necessarily DSS datasets) using a python recipe that are in a database through a connection configured from dataiku?Yes, use the metadata dictionary tables/views from your database. All databases have this. You can use a Dataiku SQL Notebook (Notebooks => New Notebook => Write your own => SQL) to develop the query. Then convert the notebook to a Sqlk Query Recipe and use the output dataset as your list of tables to migrate
2. Is it better to make the connection through Python instead of the connection made by Dataiku to handle the data in the database?It really depends what you are trying to do. But if you want to do this programmatically then yes you will need to do it via Python. But not everything needs to be Python. For instance the query that gives you the tables to migrate can be a SQL Query dataset which you can use in a Python recipe. Then you will need to programmatically create a recipe to sync it to bucket dataset and then finally another sync recipe to a BigQuery dataset.
-
rafael_rosado97 Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 61 Partner
Thank you, @Turribeach
I don't understand why I need to Sync to Google Storage Bucket
Can you explain this part, please?
Rafael Rosado.
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,088 Neuron
Because that’s the fastest way to load data into Google BigQuery (BQ) and in most cases you will be able to use a Sync recipe to sync the data directly from a bucket to BQ without having to do any other changes. Also you can’t sync directly between Microsoft SQL Server and BQ so you will always need to use a bucket to pivot from one technology to another.
-
rafael_rosado97 Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 61 Partner
It is clearer now. Thanks, @Turribeach
I am thinking that making queries using python recipes would not be a good idea if I want to save memory, right? Because tables would be saved temporaly on file system. I just wanted to use Python for automation and to avoid many sync recipies (one for each table).
For that reason is better to use SQL recipes to make queries to leverage the SQL engine, right?
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,088 Neuron
You got everything wrong. You DO want to use Sync recipes to a bucket to avoid having to save data in local files or memory. Dataiku will stream the data to the bucket so it doesn’t matter how big the table is, it will just take more time. I will suggest you do a quick test on how this works so you see how simple it is.
SQL table > Sync > Bucket > Sync > GCP table