Your favorite Dataiku DSS tips & tricks?

LisaB
Dataiker
Your favorite Dataiku DSS tips & tricks?

Hi there,

What are some of your favorite tips & tricks in Dataiku DSS that make your life easier?

Sharing them here would surely be valuable to all new members who are joining us, and we might also have a special interest in handpicking a few of them for a dedicated newsletter - stay tuned ๐Ÿ˜‰

Thanks in advance for your input!

Lisa, Community Programs Manager at Dataiku

Don't forget to mark as "Accepted Solution" when someone provides the correct answer to your question ๐Ÿ˜‰
11 Replies
Mattsco
Dataiker

Use selection of some characters in a cell during a prepare recipe to add a step to split a column or to replace some strings! 

Mattsco

Mattsco
Marlan

Create a dataset metric that counts rows with duplicate keys in a SQL table by creating a SQL Probe (at the bottom of the metric editing page) like the following:


SELECT COUNT(*) AS "Duplicate Key Value Count" FROM (SELECT REC_KEY FROM ${DKU_DATASET_TABLE_NAME} GROUP BY 1 HAVING COUNT(*) > 1) T;

Of course one can then create a check on the metric that fails (or warns) a dataset rebuild if value > 0.

Marlan

Execute SQL data modification statements from a Python recipe or notebook. You might want to clear a table prior to inserting rows, delete specified rows, create or drop a table, etc.

Here is example code:

from dataiku.core.sql import SQLExecutor2

sqlexecutor = SQLExecutor2(connection='sql_database_connection_name')
main_query = "SELECT COUNT(*) AS REC_CNT FROM YOURTABLE" # or "SELECT 1" if don't want to check results
pre_queries = ["TRUNCATE TABLE YOURTABLE; COMMIT"] # A list is expected but ok if all of SQL is in one list item; a COMMIT seems to be required in the couple of databases I use

result_df = sqlexecutor.query_to_df(main_query, pre_queries=pre_queries)

assert (result_df['REC_CNT'].iloc[0] == 0), "Truncate of table YOURTABLE failed (record count > 0)"
rsingamsetty
Level 3

Hi Marlan

 

Can we run instert statements using SQLExecutor2

0 Kudos
Marlan

Hi @rsingamsetty, you are interested in running SQL insert statements? I haven't tried SQL insert statements specifically but the technique I described should work for any SQL statement so I'd expect that this would work.

Marlan

VinceDS
Dataiker

Hard to choose one so I'll list a few excellent DSS tricks: 

- drag and drop multiple files in upload box to stack at import

- after training models in the Lab, export the code to Jupyter notebook

- use the View button on the bottom left corner of the Flow to cycle between flow views (tags, connections, recipe engines...)

-  use the select tool (shift + click) to select multiple recipes or copy a part of your flow

- hit Space in the Flow to cycle between right panel tabs

akgaab
Level 1

If you are dealing with multiple source data and later if you have to merge or stack ADS coming from the all source datasets then for the sake of convenience, make sure the storage type of the respective  columns of all the datasets  are same.

Taylor
Level 3

I really like making R Shiny applications. I also really like developing these apps in RStudio (one of my favorite IDEs of all time). There are certain areas of my code base that must be different if it's executing on DSS or if it's executing in my local RStudio environment (reading in data is the perfect example). I've found a quick workaround for this by testing if I'm able to load the `dataiku` library or not within R. I haven't tried it yet, but I assume the same type of logic would work in python (might require the importlib standard library?).

This has definitely accelerated the R Shiny development process in DSS. I'm not saying copy/paste is a good deployment strategy, but it's much better than copy, paste, then edit all of the local-environment-specific code to fit DSS. 

 

# R:
# Determine if we're in DSS or not - this dictates how/where we read data from
libload_attempt_dataiku <- tryCatch(expr={library(dataiku)}, error = function(e) e)
in_DSS <- ifelse(!"packageNotFoundError" %in% class(libload_attempt_dataiku), yes=TRUE, no=FALSE)

if (in_DSS) {

# DSS Processing ===========================
df <- dkuReadDataset("<name of my dataset>")

} else {

# Local Processing =======================
df <- read.csv("<name of dataset>.csv", stringsAsFactors = F)
}

# ...application code...

 

aw30
Level 4

We use this line in a python script to convert column names from input data sets to lower case and underscore:

dataset_df.columns = dataset_df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '').str.replace('.', '_').str.replace('/', '_').str.replace('&', 'n').str.replace("'", '').str.replace('\n','_')

VinceDS
Dataiker

You can also do this without using code in a Prepare recipe

  • use the Column View
  • Select All Columns
  • Click Rename > Simplify and Lowercase

Screenshot 2021-02-05 at 09.28.45.png

 

tgb417

The first thing that comes to mind is that in visual prepare recipes you can select multiple steps or groups of steps and copy and paste them into other visual prepare recipes possibly in different projects,  Then use the little power button icon to turn them all off.  Then fix them for the new data set in order.    

--Tom