Transform the experience of writing SQL

Background

SQL is a great language for querying and managing data but when it comes to good development practices like DRY (Don’t Repeat Yourself), it really falls down. The same values or logic may be repeated many times in any decently long SQL script making maintenance a much more risky endeavor than it should be.

Yet SQL is used extensively by data scientists and analysts as it offers unparalleled speed and power for working with large datasets.

Wouldn’t it be great if modern features available in languages like Python such as variables and functions could be added to SQL?  This would transform the experience of writing SQL. People could then DRY in SQL!

Idea

My idea is for DSS to provide this transformative experience for people who write SQL.

How? By building into SQL recipes the capability to use a templating language like Jinja within the recipe.

Jinja offers a syntax for defining variables and functions. So SQL coders could use this syntax to set and use variables and to create and call functions within their recipes. DSS would then use Jinja to translate the recipe into pure SQL before sending it to the database for execution.

Additionally, Jinja would enable writing SQL that was created from other SQL tables. For example, instead of typing out all of the columns in a table by hand, one could let Jinja fill in those column names automatically (by pulling the column names from that table).

Furthermore, functions could be written to shared across a project and an instance. This would be a great way of sharing common SQL logic that you want everyone on your team or in your company to use.

Another positive aspect of this idea is that it doesn’t seem like it would require a major development effort to accomplish. DSS would be leveraging the open source Jinja package for all the hard work. I’m basing this on my experience of developing a working prototype of using Jinja in DSS in a few hours. Obviously, it’s a lot more involved to add and test functionality across DSS but still seems like the investment would be relatively low especially compared with the return on that investment.

I have added some example use cases in a comment below. I can also share implementation ideas if that would be helpful. 

By doing this, Dataiku would be helping to bring basic software engineering principles to the practices of data science and analytics. Besides the benefit to us current DSS users, it seems like that is a message that could resonate out in the marketplace.

 

12 Comments
Mark_Treveil
Dataiker Alumni

While it's not as powerful as jinja, it's worth remembering that you can use project variables in SQL code: https://doc.dataiku.com/dss/latest/advanced/variables_expansion.html

While it's not as powerful as jinja, it's worth remembering that you can use project variables in SQL code: https://doc.dataiku.com/dss/latest/advanced/variables_expansion.html

CoreyS
Dataiker Alumni
 
Looking for more resources to help you use Dataiku effectively and upskill your knowledge? Check out these great resources: Dataiku Academy | Documentation | Knowledge Base

A reply answered your question? Mark as ‘Accepted Solution’ to help others like you!
Status changed to: In the Backlog
 

Hi @Mark_Treveil, yes, for sure - project variables in SQL code are super useful and I use them often in all of my work. Not intending to overlook them at all but rather build on this concept of using DSS to write more flexible SQL code. 

Marlan 

Hi @Mark_Treveil, yes, for sure - project variables in SQL code are super useful and I use them often in all of my work. Not intending to overlook them at all but rather build on this concept of using DSS to write more flexible SQL code. 

Marlan 

I thought it might be helpful to add more specific use cases for using Jinja in SQL scripts. Here are six cases that Jinja enhanced SQL would enable.  

Set recipe level variables

It's common to have values or code snippets (e.g., common where criteria) that you want to reuse multiple times in a script. One could certainly use project variables for this. However, for variables that are used in just one recipe it's inconvenient and less clear to use project variables than it would be define variables in the script itself.

Define functions for writing repetitive code

It's fairly common in SQL to write chunks of code that have a common structure. For example, one might have a fairly complicated column expression that is repeated many times. Another example would be joining multiple tables on the same couple of columns. It would be great to be able to write a function that built the column expression for join specification.

For example, "CAST(CASE WHEN CUST_TYPE IS NULL THEN -99 WHEN CUST_TYPE = 'A' THEN 1 ELSE 0 END AS TINYINT) AS CUST_TYPE_A" could be replaced by a function call SET_CUST_TYPE('A'). (Note I'm ignoring Jinja syntax here.) Or one could call CUST_MONTH_JOIN(tableref_a, tableref_b) to generate code for a half dozen joins on cust ID and month between two tables.

Create SQL code from SQL tables

Another use would be incorporate the results of a SQL query to build code in the SQL script. A common application would be building of scripts with lots of columns. Often we are dealing with tables with hundreds of columns. It's a real pain to type out all these columns manually especially if one is also handing null values and perhaps casting to a specific datatype. With this capability, one could automatically generate an expression like COALESCE(TBLREF.COLNAME, CAST(<default value> AS <type>) AS COLNAME for all of the / selected columns in a table.

Share Logic Within and Across Projects

Jinja macros could be used to share commonly used queries and logic both within one project and across multiple projects. A within project ML related example would be keeping separate train and score processes in sync. These processes usually have mostly common logic. Currently all of this logic must be duplicated. One or more Jinja functions could be written to abstract out and share the common logic. Or alternatively one Jinja function with mulitple arguments could serve both the train and score processes.

Across project examples include company specific functionality (e.g., pull eligible members in a standard way) or utilities like pivoting a table, getting columns from a widely used table, and checking that table is unique on keys.

Extend current DSS features

For example, applications as recipes could be more richly parameterized without needing to use Python recipes.

Use SQL with Jinja in lieu of Python

In some situations, SQL with Jinja recipes may be able to be used for a need that previously would have required a Python recipe. This would be helpful for users who don't know Python and may also be more efficient (e.g., if recipe can run in database).

I thought it might be helpful to add more specific use cases for using Jinja in SQL scripts. Here are six cases that Jinja enhanced SQL would enable.  

Set recipe level variables

It's common to have values or code snippets (e.g., common where criteria) that you want to reuse multiple times in a script. One could certainly use project variables for this. However, for variables that are used in just one recipe it's inconvenient and less clear to use project variables than it would be define variables in the script itself.

Define functions for writing repetitive code

It's fairly common in SQL to write chunks of code that have a common structure. For example, one might have a fairly complicated column expression that is repeated many times. Another example would be joining multiple tables on the same couple of columns. It would be great to be able to write a function that built the column expression for join specification.

For example, "CAST(CASE WHEN CUST_TYPE IS NULL THEN -99 WHEN CUST_TYPE = 'A' THEN 1 ELSE 0 END AS TINYINT) AS CUST_TYPE_A" could be replaced by a function call SET_CUST_TYPE('A'). (Note I'm ignoring Jinja syntax here.) Or one could call CUST_MONTH_JOIN(tableref_a, tableref_b) to generate code for a half dozen joins on cust ID and month between two tables.

Create SQL code from SQL tables

Another use would be incorporate the results of a SQL query to build code in the SQL script. A common application would be building of scripts with lots of columns. Often we are dealing with tables with hundreds of columns. It's a real pain to type out all these columns manually especially if one is also handing null values and perhaps casting to a specific datatype. With this capability, one could automatically generate an expression like COALESCE(TBLREF.COLNAME, CAST(<default value> AS <type>) AS COLNAME for all of the / selected columns in a table.

Share Logic Within and Across Projects

Jinja macros could be used to share commonly used queries and logic both within one project and across multiple projects. A within project ML related example would be keeping separate train and score processes in sync. These processes usually have mostly common logic. Currently all of this logic must be duplicated. One or more Jinja functions could be written to abstract out and share the common logic. Or alternatively one Jinja function with mulitple arguments could serve both the train and score processes.

Across project examples include company specific functionality (e.g., pull eligible members in a standard way) or utilities like pivoting a table, getting columns from a widely used table, and checking that table is unique on keys.

Extend current DSS features

For example, applications as recipes could be more richly parameterized without needing to use Python recipes.

Use SQL with Jinja in lieu of Python

In some situations, SQL with Jinja recipes may be able to be used for a need that previously would have required a Python recipe. This would be helpful for users who don't know Python and may also be more efficient (e.g., if recipe can run in database).

Adding a couple of several cases where I could use this functionality 🙂

1) We have developed a feature store that we use to speed up development of machine learning models. The features or columns are stored in multiple tables. It'd be great to provide to our data scientists Jinja macros that would automatically insert the columns and implement the table joins. This would help by not having type in hundreds of columns and not having to deal with the specifics of the joins.

 2) A more specific use is one where I can't get around handling multiple instances of a value by executing common set of code for each instance. In this case, I need to be able to handle up to 5 instances of the value so I'll need to copy the code 5 times. Would be much to use Jinja functionality to write the code once but have it execute 5 times.

3) I need to test a particularly complex SQL recipe. To do this, I am creating test versions of the input tables and then editing the script to use the test input tables. Then I'm running the modified recipe against the test tables and checking results. I'm doing this in a SQL query tool currently (and of course I could do it in a SQL notebook). It'd be great to use Jinja to enable testing of the script in place. I could use local (within recipe) variables to specify the table names (either test or production) and potentially additional logic to enable creating test tables and checking results. Ultimately I could set #test to 1 to run test set up, adjust the script, and then check request. #test = 0 would run production code. 

Marlan

Adding a couple of several cases where I could use this functionality 🙂

1) We have developed a feature store that we use to speed up development of machine learning models. The features or columns are stored in multiple tables. It'd be great to provide to our data scientists Jinja macros that would automatically insert the columns and implement the table joins. This would help by not having type in hundreds of columns and not having to deal with the specifics of the joins.

 2) A more specific use is one where I can't get around handling multiple instances of a value by executing common set of code for each instance. In this case, I need to be able to handle up to 5 instances of the value so I'll need to copy the code 5 times. Would be much to use Jinja functionality to write the code once but have it execute 5 times.

3) I need to test a particularly complex SQL recipe. To do this, I am creating test versions of the input tables and then editing the script to use the test input tables. Then I'm running the modified recipe against the test tables and checking results. I'm doing this in a SQL query tool currently (and of course I could do it in a SQL notebook). It'd be great to use Jinja to enable testing of the script in place. I could use local (within recipe) variables to specify the table names (either test or production) and potentially additional logic to enable creating test tables and checking results. Ultimately I could set #test to 1 to run test set up, adjust the script, and then check request. #test = 0 would run production code. 

Marlan

I just attached a printout of a notebook to the original post which shows how one could implement Jinja SQL in DSS in a couple of screens of Python code. 

This example is fully functional and implements all of the features I described above.

It is then used to show how Jinja SQL could be applied in SQL recipes across a number of use cases (7 examples are included).

Marlan

 

I just attached a printout of a notebook to the original post which shows how one could implement Jinja SQL in DSS in a couple of screens of Python code. 

This example is fully functional and implements all of the features I described above.

It is then used to show how Jinja SQL could be applied in SQL recipes across a number of use cases (7 examples are included).

Marlan

 

CoreyS
Dataiker Alumni

The Dataiku team is reviewing the idea and determining how this will fit into the future of DSS. We can't provide a timeline at this point, but be sure to check back often and subscribe.

Looking for more resources to help you use Dataiku effectively and upskill your knowledge? Check out these great resources: Dataiku Academy | Documentation | Knowledge Base

A reply answered your question? Mark as ‘Accepted Solution’ to help others like you!
Status changed to: Parked

The Dataiku team is reviewing the idea and determining how this will fit into the future of DSS. We can't provide a timeline at this point, but be sure to check back often and subscribe.

ElisaS
Dataiker

Hello, 

This idea has been integrated into our backlog but we can't provide a timeline at this point. 
In the meantime, it's worth noting that you can use a python recipe and SQLExecutor2 :  https://knowledge.dataiku.com/latest/courses/advanced-code/python/use-python-sql.html  and https://doc.dataiku.com/dss/latest/python-api/sql.html

Status changed to: In Backlog

Hello, 

This idea has been integrated into our backlog but we can't provide a timeline at this point. 
In the meantime, it's worth noting that you can use a python recipe and SQLExecutor2 :  https://knowledge.dataiku.com/latest/courses/advanced-code/python/use-python-sql.html  and https://doc.dataiku.com/dss/latest/python-api/sql.html

Good to hear, @ElisaS! Thanks for the update.

Marlan

Good to hear, @ElisaS! Thanks for the update.

Marlan

I think this is achieveable via a Plugin and the sqlexecutor2.We wrote a plugin that we use to insert data into a BigQuery using Python and the BG Python API and it writes the Insert statement dymamically.

I think this is achieveable via a Plugin and the sqlexecutor2.We wrote a plugin that we use to insert data into a BigQuery using Python and the BG Python API and it writes the Insert statement dymamically.