Your favorite Dataiku DSS tips & tricks?
Hi there,
What are some of your favorite tips & tricks in Dataiku DSS that make your life easier?
Sharing them here would surely be valuable to all new members who are joining us, and we might also have a special interest in handpicking a few of them for a dedicated newsletter - stay tuned
Thanks in advance for your input!
Answers
-
Use selection of some characters in a cell during a prepare recipe to add a step to split a column or to replace some strings!
Mattsco -
Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 320 Neuron
Create a dataset metric that counts rows with duplicate keys in a SQL table by creating a SQL Probe (at the bottom of the metric editing page) like the following:
SELECT COUNT(*) AS "Duplicate Key Value Count" FROM (SELECT REC_KEY FROM ${DKU_DATASET_TABLE_NAME} GROUP BY 1 HAVING COUNT(*) > 1) T;Of course one can then create a check on the metric that fails (or warns) a dataset rebuild if value > 0.
-
Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 320 Neuron
Execute SQL data modification statements from a Python recipe or notebook. You might want to clear a table prior to inserting rows, delete specified rows, create or drop a table, etc.
Here is example code:
from dataiku.core.sql import SQLExecutor2 sqlexecutor = SQLExecutor2(connection='sql_database_connection_name') main_query = "SELECT COUNT(*) AS REC_CNT FROM YOURTABLE" # or "SELECT 1" if don't want to check results pre_queries = ["TRUNCATE TABLE YOURTABLE; COMMIT"] # A list is expected but ok if all of SQL is in one list item; a COMMIT seems to be required in the couple of databases I use result_df = sqlexecutor.query_to_df(main_query, pre_queries=pre_queries) assert (result_df['REC_CNT'].iloc[0] == 0), "Truncate of table YOURTABLE failed (record count > 0)"
-
Hard to choose one so I'll list a few excellent DSS tricks:
- drag and drop multiple files in upload box to stack at import
- after training models in the Lab, export the code to Jupyter notebook
- use the View button on the bottom left corner of the Flow to cycle between flow views (tags, connections, recipe engines...)
- use the select tool (shift + click) to select multiple recipes or copy a part of your flow
- hit Space in the Flow to cycle between right panel tabs
-
If you are dealing with multiple source data and later if you have to merge or stack ADS coming from the all source datasets then for the sake of convenience, make sure the storage type of the respective columns of all the datasets are same.
-
I really like making R Shiny applications. I also really like developing these apps in RStudio (one of my favorite IDEs of all time). There are certain areas of my code base that must be different if it's executing on DSS or if it's executing in my local RStudio environment (reading in data is the perfect example). I've found a quick workaround for this by testing if I'm able to load the `dataiku` library or not within R. I haven't tried it yet, but I assume the same type of logic would work in python (might require the importlib standard library?).
This has definitely accelerated the R Shiny development process in DSS. I'm not saying copy/paste is a good deployment strategy, but it's much better than copy, paste, then edit all of the local-environment-specific code to fit DSS.
# R:
# Determine if we're in DSS or not - this dictates how/where we read data from
libload_attempt_dataiku <- tryCatch(expr={library(dataiku)}, error = function(e) e)
in_DSS <- ifelse(!"packageNotFoundError" %in% class(libload_attempt_dataiku), yes=TRUE, no=FALSE)
if (in_DSS) {
# DSS Processing ===========================
df <- dkuReadDataset("<name of my dataset>")
} else {
# Local Processing =======================
df <- read.csv("<name of dataset>.csv", stringsAsFactors = F)
}
# ...application code... -
Hi Marlan
Can we run instert statements using SQLExecutor2
-
Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 320 Neuron
Hi @rsingamsetty
, you are interested in running SQL insert statements? I haven't tried SQL insert statements specifically but the technique I described should work for any SQL statement so I'd expect that this would work.Marlan
-
We use this line in a python script to convert column names from input data sets to lower case and underscore:
dataset_df.columns = dataset_df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '').str.replace('.', '_').str.replace('/', '_').str.replace('&', 'n').str.replace("'", '').str.replace('\n','_')
-
You can also do this without using code in a Prepare recipe
- use the Column View
- Select All Columns
- Click Rename > Simplify and Lowercase
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
The first thing that comes to mind is that in visual prepare recipes you can select multiple steps or groups of steps and copy and paste them into other visual prepare recipes possibly in different projects, Then use the little power button icon to turn them all off. Then fix them for the new data set in order.