Run SQL ETL scripts from Git repo

Solved!
ocean_rhythm
Level 3
Run SQL ETL scripts from Git repo

Hi, I have existing ETL scripts (mainly pure SQL) on an existing Git repo.

I need to continue to write the SQL to the Git repo directly (i.e. outside DSS).

I'd like to use DSS run these scripts โ€“ either (a) just the SQL, or (b) using Python scripts in combination with SQL.

Current line of thinking: I can run scripts easily as Scenarios if I import the Git repo with the scripts, but there doesn't seem to be an automated way to pull a remote Git repo ("Update from Git") after it has new commits.

Is there an API to update Git references, or some other way to pull an external Git repo and use the scripts from that repo as input to a Scenario?

Or: what's the best way to do this? Thanks

1 Solution
AlexT
Dataiker

Hi,

One way you could do this could pull the latest SQL code from the git and execute it using SQLExecutor2 all in python.

Then execute this python code from your scenario.  Here is a sample that will pull the raw sql file from github using personal token. 

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
from dataiku.core.sql import SQLExecutor2
import requests

# get the needed data to prepare the query
# for example, load from another table
e = SQLExecutor2(dataset=my_output_dataset)

#this can vary depending on git and if this private repo. Creds can be stored in variables or 
# https://docs.github.com/en/github/authenticating-to-github/keeping-your-account-and-data-secure/creating-a-personal-access-token#creating-a-token


file_url = "https://raw.githubusercontent.com/user/repo-name/branch/file.sql"

query_file = requests.get(file_url, auth=('username','personal_token')).content

#construct query further as needed based on git file 
sql = query_file + " FROM some_table_name"

query = e.query_to_df(sql)

 

Let me know if this helps.

View solution in original post

2 Replies
AlexT
Dataiker

Hi,

One way you could do this could pull the latest SQL code from the git and execute it using SQLExecutor2 all in python.

Then execute this python code from your scenario.  Here is a sample that will pull the raw sql file from github using personal token. 

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
from dataiku.core.sql import SQLExecutor2
import requests

# get the needed data to prepare the query
# for example, load from another table
e = SQLExecutor2(dataset=my_output_dataset)

#this can vary depending on git and if this private repo. Creds can be stored in variables or 
# https://docs.github.com/en/github/authenticating-to-github/keeping-your-account-and-data-secure/creating-a-personal-access-token#creating-a-token


file_url = "https://raw.githubusercontent.com/user/repo-name/branch/file.sql"

query_file = requests.get(file_url, auth=('username','personal_token')).content

#construct query further as needed based on git file 
sql = query_file + " FROM some_table_name"

query = e.query_to_df(sql)

 

Let me know if this helps.

ocean_rhythm
Level 3
Author

Looks great, thanks Alex 

0 Kudos