Run SQL ETL scripts from Git repo
Hi, I have existing ETL scripts (mainly pure SQL) on an existing Git repo.
I need to continue to write the SQL to the Git repo directly (i.e. outside DSS).
I'd like to use DSS run these scripts – either (a) just the SQL, or (b) using Python scripts in combination with SQL.
Current line of thinking: I can run scripts easily as Scenarios if I import the Git repo with the scripts, but there doesn't seem to be an automated way to pull a remote Git repo ("Update from Git") after it has new commits.
Is there an API to update Git references, or some other way to pull an external Git repo and use the scripts from that repo as input to a Scenario?
Or: what's the best way to do this? Thanks
Best Answer
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi,
One way you could do this could pull the latest SQL code from the git and execute it using SQLExecutor2 all in python.
Then execute this python code from your scenario. Here is a sample that will pull the raw sql file from github using personal token.
# -*- coding: utf-8 -*- import dataiku import pandas as pd, numpy as np from dataiku import pandasutils as pdu from dataiku.core.sql import SQLExecutor2 import requests # get the needed data to prepare the query # for example, load from another table e = SQLExecutor2(dataset=my_output_dataset) #this can vary depending on git and if this private repo. Creds can be stored in variables or # https://docs.github.com/en/github/authenticating-to-github/keeping-your-account-and-data-secure/creating-a-personal-access-token#creating-a-token file_url = "https://raw.githubusercontent.com/user/repo-name/branch/file.sql" query_file = requests.get(file_url, auth=('username','personal_token')).content #construct query further as needed based on git file sql = query_file + " FROM some_table_name" query = e.query_to_df(sql)
Let me know if this helps.
Answers
-
Looks great, thanks Alex