auto-fill datasets metastore name with variables instead of hardcoded values

Note : this post is just one recommendation following this broader idea.

Writing datasets aliases* in SQL scripts can be tedious as they can get pretty long. Fortunately, when clicking on an input dataset in a [spark] SQL recipe, dataiku fills it for the user in the script. However, the metastore value is hardcoded and not written with variables.

1st improvement: it would be great if this path could be filled using variables. For instance, at our organization, we have to fill 3 (!) prefixes before reaching the dataset's name. For both readability and portability issues, we prefer writing the dataset's alias using variables (see the following screenshot for an illustration).

1.png

 2nd improvement: I would love to have an auto-fill feature when trying to reference an input dataset (e.g. when hitting the "tab" key ; more generally speaking, auto-fill is crually missing in dataiku SQL scripts).

Implementing these 2 features would save a LOT of time for users prefering SQL to visual recipes.

 

* a dataset alias = metastore key for that dataset (for which the metastore returns the physical path)

cc : @Marlan

7 Comments

In Project settings on the Engines & connections tab, there is a setting to indicate "How to create default code for new Hive, SQL, SparkSQL and Impala recipes". The setting name is "SQL-like recipes init mode".

However, at least when creating SQL recipes, this setting only partially used.

We certainly find it valuable to make our SQL datasets relocatable as described here: https://doc.dataiku.com/dss/latest/connecting/relocation.html

Since we typically work with SQL Script recipes for greater control and input and output dataset references are fully resolved when inserted in the code (i.e., hover to right of input/output dataset and click "insert" on recipe window), we are having to type in project variables manually (i.e., ${projectKey}). It's certainly doable but seems completely unnecessary since the setting suggests it should be done for you. Moreover, including the project key variable is no longer the path of least resistance so use of these relocatable references is no longer consistent across our user base.

In fact, it was done for us through version 5. In version 6, this functionality apparently disappeared.

So for example if "SQL-like recipes init mode" is set to "Table reference with variables", when inserting a reference to a new SQL (or SQL like) recipe (by clicking the "insert" button to the right of the dataset name) the table reference should include the variable as the prefix (${projectKey}_TABLENAME) rather than the actual resolved project key prefix (MYPROJ_TABLENAME).

In Project settings on the Engines & connections tab, there is a setting to indicate "How to create default code for new Hive, SQL, SparkSQL and Impala recipes". The setting name is "SQL-like recipes init mode".

However, at least when creating SQL recipes, this setting only partially used.

We certainly find it valuable to make our SQL datasets relocatable as described here: https://doc.dataiku.com/dss/latest/connecting/relocation.html

Since we typically work with SQL Script recipes for greater control and input and output dataset references are fully resolved when inserted in the code (i.e., hover to right of input/output dataset and click "insert" on recipe window), we are having to type in project variables manually (i.e., ${projectKey}). It's certainly doable but seems completely unnecessary since the setting suggests it should be done for you. Moreover, including the project key variable is no longer the path of least resistance so use of these relocatable references is no longer consistent across our user base.

In fact, it was done for us through version 5. In version 6, this functionality apparently disappeared.

So for example if "SQL-like recipes init mode" is set to "Table reference with variables", when inserting a reference to a new SQL (or SQL like) recipe (by clicking the "insert" button to the right of the dataset name) the table reference should include the variable as the prefix (${projectKey}_TABLENAME) rather than the actual resolved project key prefix (MYPROJ_TABLENAME).

Interesting, we do this already but obviously we have to do it manually. Where will these variables come from? Will Dataiku create and define them for you? How will this work when you have multiple connections (ie how can the variables have different names?).

Interesting, we do this already but obviously we have to do it manually. Where will these variables come from? Will Dataiku create and define them for you? How will this work when you have multiple connections (ie how can the variables have different names?).

Ah! Did you implement automatic filling?

Regarding your question: indeed for the connections I do not have a clear answer. I guess it could be proposed to be configured at the project level (at least this is what we would like to do in our organization for GDPR compliance, so having a default connection for a project would suit our needs and would actually save us compliance auditing work). 

For the node type, I guess it can easily be detected by dataiku.

For the project key, most of the time the current project should suffice. If the table is not in the current project, dataiku could fall back to a hardcoded project. 

Ah! Did you implement automatic filling?

Regarding your question: indeed for the connections I do not have a clear answer. I guess it could be proposed to be configured at the project level (at least this is what we would like to do in our organization for GDPR compliance, so having a default connection for a project would suit our needs and would actually save us compliance auditing work). 

For the node type, I guess it can easily be detected by dataiku.

For the project key, most of the time the current project should suffice. If the table is not in the current project, dataiku could fall back to a hardcoded project. 

Hi @tanguy, I submitted a version of your idea a while back. However, your summary is much clearer and more to the point. Needless to say, I'd love to see this implemented. 

@CoreyS, since it appears we can merge ideas now, would you consider merging mine into this one? Not what the protocol is with doing this but I would say that my idea is exactly the same as @tanguy's 1st improvement.

A couple of other comments:

We also used to include both project key and node variables in our table names. However, since then we have been able to create databases corresponding to the node (we call our nodes and corresponding databases dev, test, and prod) which enables the use of a connection to select the appropriate database. Each of the three nodes have same connection name but the dev one points to the dev database, the test one points to the test database, and the prod one points to the prod database. So tables are segregated without needing a node variable in the table name. This has worked out nicely.

When we did use node variables, we set up node specific variables on each node. The variables are defined once in the node admin and can be used anywhere a project variable can be used. See documentation here

Marlan

 

Hi @tanguy, I submitted a version of your idea a while back. However, your summary is much clearer and more to the point. Needless to say, I'd love to see this implemented. 

@CoreyS, since it appears we can merge ideas now, would you consider merging mine into this one? Not what the protocol is with doing this but I would say that my idea is exactly the same as @tanguy's 1st improvement.

A couple of other comments:

We also used to include both project key and node variables in our table names. However, since then we have been able to create databases corresponding to the node (we call our nodes and corresponding databases dev, test, and prod) which enables the use of a connection to select the appropriate database. Each of the three nodes have same connection name but the dev one points to the dev database, the test one points to the test database, and the prod one points to the prod database. So tables are segregated without needing a node variable in the table name. This has worked out nicely.

When we did use node variables, we set up node specific variables on each node. The variables are defined once in the node admin and can be used anywhere a project variable can be used. See documentation here

Marlan

 

CoreyS
Dataiker Alumni

Hi @Marlan, yes I can confirm that your previous idea, SQL like recipe creation should respect recipe initiation mode setting, has been merged into this idea and the votes from that idea have been added to this one as well.

Looking for more resources to help you use Dataiku effectively and upskill your knowledge? Check out these great resources: Dataiku Academy | Documentation | Knowledge Base

A reply answered your question? Mark as ‘Accepted Solution’ to help others like you!

Hi @Marlan, yes I can confirm that your previous idea, SQL like recipe creation should respect recipe initiation mode setting, has been merged into this idea and the votes from that idea have been added to this one as well.

Great, thanks @CoreyS!

Great, thanks @CoreyS!

Coming back to:

  • @Turribeach's question, maybe dataiku could create a variable to catch the connector values inside the "Preferred fallback connection" and the "Preferred connection (forced)" fields?
    3.jpg

 

  • and to add my contribution to @Marlan's thread on the configuration of "SQL-like recipes init mode" set to "Tables references with variables":
    • unfortunately it does not apply to Spark SQL recipes
    • the connector is not taken into account
    • inside a recipe, when clicking on the table with the "insert" field, the alias is not autofilled with variables

 

 

Coming back to:

  • @Turribeach's question, maybe dataiku could create a variable to catch the connector values inside the "Preferred fallback connection" and the "Preferred connection (forced)" fields?
    3.jpg

 

  • and to add my contribution to @Marlan's thread on the configuration of "SQL-like recipes init mode" set to "Tables references with variables":
    • unfortunately it does not apply to Spark SQL recipes
    • the connector is not taken into account
    • inside a recipe, when clicking on the table with the "insert" field, the alias is not autofilled with variables