Your favorite Dataiku DSS tips & tricks?

LisaB · February 2020

Hi there,

What are some of your favorite tips & tricks in Dataiku DSS that make your life easier?

Sharing them here would surely be valuable to all new members who are joining us, and we might also have a special interest in handpicking a few of them for a dedicated newsletter - stay tuned

Thanks in advance for your input!

Mattsco · February 2020

Use selection of some characters in a cell during a prepare recipe to add a step to split a column or to replace some strings!

Mattsco

Marlan · February 2020

Create a dataset metric that counts rows with duplicate keys in a SQL table by creating a SQL Probe (at the bottom of the metric editing page) like the following:

SELECT COUNT(*) AS "Duplicate Key Value Count" FROM (SELECT REC_KEY FROM ${DKU_DATASET_TABLE_NAME} GROUP BY 1 HAVING COUNT(*) > 1) T;

Of course one can then create a check on the metric that fails (or warns) a dataset rebuild if value > 0.

Marlan · February 2020

Execute SQL data modification statements from a Python recipe or notebook. You might want to clear a table prior to inserting rows, delete specified rows, create or drop a table, etc.

Here is example code:

from dataiku.core.sql import SQLExecutor2

sqlexecutor = SQLExecutor2(connection='sql_database_connection_name')
main_query = "SELECT COUNT(*) AS REC_CNT FROM YOURTABLE" # or "SELECT 1" if don't want to check results
pre_queries = ["TRUNCATE TABLE YOURTABLE; COMMIT"] # A list is expected but ok if all of SQL is in one list item; a COMMIT seems to be required in the couple of databases I use

result_df = sqlexecutor.query_to_df(main_query, pre_queries=pre_queries)

assert (result_df['REC_CNT'].iloc[0] == 0), "Truncate of table YOURTABLE failed (record count > 0)"

VinceDS · February 2020

Hard to choose one so I'll list a few excellent DSS tricks:

- drag and drop multiple files in upload box to stack at import

- after training models in the Lab, export the code to Jupyter notebook

- use the View button on the bottom left corner of the Flow to cycle between flow views (tags, connections, recipe engines...)

- use the select tool (shift + click) to select multiple recipes or copy a part of your flow

- hit Space in the Flow to cycle between right panel tabs

akgaab · February 2020

If you are dealing with multiple source data and later if you have to merge or stack ADS coming from the all source datasets then for the sake of convenience, make sure the storage type of the respective columns of all the datasets are same.

Taylor · February 2020

I really like making R Shiny applications. I also really like developing these apps in RStudio (one of my favorite IDEs of all time). There are certain areas of my code base that must be different if it's executing on DSS or if it's executing in my local RStudio environment (reading in data is the perfect example). I've found a quick workaround for this by testing if I'm able to load the `dataiku` library or not within R. I haven't tried it yet, but I assume the same type of logic would work in python (might require the importlib standard library?).

This has definitely accelerated the R Shiny development process in DSS. I'm not saying copy/paste is a good deployment strategy, but it's much better than copy, paste, then edit all of the local-environment-specific code to fit DSS.

# R:
# Determine if we're in DSS or not - this dictates how/where we read data from
libload_attempt_dataiku <- tryCatch(expr={library(dataiku)}, error = function(e) e)
in_DSS <- ifelse(!"packageNotFoundError" %in% class(libload_attempt_dataiku), yes=TRUE, no=FALSE)

if (in_DSS) {

    # DSS Processing ===========================
    df <- dkuReadDataset("<name of my dataset>")

} else {

    # Local Processing =======================
    df <- read.csv("<name of dataset>.csv", stringsAsFactors = F)
}

# ...application code...

rsingamsetty · June 2020

Hi Marlan

Can we run instert statements using SQLExecutor2

Marlan · June 2020

Hi @rsingamsetty
, you are interested in running SQL insert statements? I haven't tried SQL insert statements specifically but the technique I described should work for any SQL statement so I'd expect that this would work.

Marlan

aw30 · February 2021

We use this line in a python script to convert column names from input data sets to lower case and underscore:

dataset_df.columns = dataset_df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '').str.replace('.', '_').str.replace('/', '_').str.replace('&', 'n').str.replace("'", '').str.replace('\n','_')

VinceDS · February 2021

You can also do this without using code in a Prepare recipe

use the Column View
Select All Columns
Click Rename > Simplify and Lowercase

Screenshot 2021-02-05 at 09.28.45.png

tgb417 · February 2021

The first thing that comes to mind is that in visual prepare recipes you can select multiple steps or groups of steps and copy and paste them into other visual prepare recipes possibly in different projects, Then use the little power button icon to turn them all off. Then fix them for the new data set in order.

Your favorite Dataiku DSS tips & tricks?

Answers

Categories

Setup Info

Tags