Visual recipes — Dataiku Community

Full outer join

Thu, 10 Mar 2016 23:27:25 +0000

Is there a way to do a full outer join between to datasets stored in the DSS memory (so basically made by recipes or analyses) ?

Censored Regression

Nick_Geitner — Wed, 04 Mar 2026 21:12:58 +0000

It is often the case that modelers encounter censored data, or data that falls >x or

Select multiple column

Axel_ULLERN — Fri, 20 Feb 2026 07:32:46 +0000

Hello Is it possible to select multiple columns in order to remove them ? , i tried the remove/keep recipe which allows to select column one by one but if i have 100 to select and if they are contiguous a selection of the first and last would be easy like in Excel (or providing first col name, last col name to select all of them ) , does someone has a way to do it in Dataiku ?

(As there is no obvious pattern in the col names i can't use the pattern in recipe remove col )

Many Thanks

Axel

How to stack columns from one dataset

Theo_from_EPSI — Mon, 02 Feb 2026 23:11:36 +0000

Hi,
Here is a simplified schema of a basic dataset structure I need to reshape:

firstname	name	vote	col4	col5	col6	col7	col8	col9	etc..
ARTHAUD	Nathalie	5	ARMAND	Thierry	9	ARNAUD	Bernard	6	etc..
ARTHAUD	Nathalie	7	ARMAND	Thierry	3	ARNAUD	Bernard	8	etc..

The number of columns in this is variable but it will always be a multiple of 3.

The expected output:

firstname	name	vote
ARTHAUD	Nathalie	5
ARTHAUD	Nathalie	7
ARMAND	Thierry	9
ARMAND	Thierry	3
ARNAUD	Bernard	6
ARNAUD	Bernard	8

Feel free to suggest a solution in Python, but I would prefer visual recipes.

Dataiku version used: 14.3.3

Python script to visual recipes conversion

Amruta — Fri, 23 Jan 2026 15:54:18 +0000

I have an existing Python script that needs to be converted into Dataiku visual recipes. Is there any supported or automated way in Dataiku to generate visual recipes from Python code, or does this need to be done manually?

Does DSS have a recipe for imbalanced sample? Like SMOTE?

Frank — Mon, 12 Aug 2019 09:24:35 +0000

Sync Recipe from Redshift to Oracle RDS

SHughes_BAE — Thu, 03 Jul 2025 18:05:08 +0000

I am trying to replicate a table in Redshift to a table in Oracle RDS using a sync recipe. I am getting the correct number of records created in the target Oracle RDS table, but all of the fields are empty (null).

Operating system used: Linux

Select Columns Outside of Join Recipe

Laurie — Mon, 10 Feb 2025 22:52:28 +0000

I would like to be able to select the columns of data outside of a join recipe. A couple of examples:

1 - Usage of "unmatched rows". The column selection occurs after the join does not apply to data that isn't joined. In this case I am using both sets of data so need the option to select columns from both sets.

2 - Removal of unneeded/unwanted columns after filtering. This is especially important when using sensitive HR data.

This enhances the automation of processing the data vs. adding to it by doing cleanup after the project has completed running. It also allows me to confirm that I have sensitive data removed before sharing the data with others rather than relying on a manual process to remove it.

Push to editable recipe

UserBird — Sat, 22 Apr 2017 15:30:54 +0000

Hello, Could you take an example to use "Push to editable" recipe? It seems like group or windows.. What exactly is it used for?

Oops: an unexpected error occurred java.lang.IllegalStateException: Expected a double but was BEGIN

WeiDU_Geodis3306 — Wed, 31 Jul 2024 14:48:26 +0000

Hi,

I am working on the project "Advanced Designer Assessment"

after modified Prepare recipe to add column "qualifies", when i open dataset "Online_Retail_Prepared", i got this error message.

Oops: an unexpected error occurred

java.lang.IllegalStateException: Expected a double but was BEGIN_ARRAY at line 377 column 21 path $.charts[0].def.scatterZoomOptions.scale, caused by: IllegalStateException: Expected a double but was BEGIN_ARRAY at line 377 column 21 path $.charts[0].def.scatterZoomOptions.scale

would you please help me to figure it out?

thanks in advance.

Option to rearrange output columns in join recipe

Antal — Wed, 30 Oct 2024 08:05:15 +0000

I would like to have the option to rearrange output columns in the join recipe.

Perhaps by making the 'hamburger' icons on the Output panel draggable.

RAG LLM for multiple datasets

Zidan — Tue, 06 Aug 2024 09:50:54 +0000

Greetings,

While working with the embedding recipe, we faced a limitation where we have two datasets, we want to apply the rag on, how can we apply the knowledge bank on them specifically?

Regards

How can I replace a dataset created from a csv?

Amber_Beasock_Z — Fri, 09 Mar 2018 03:20:23 +0000

I have uploaded a CSV and stored it in the filesystem_folders. I have built several recipes from this dataset. I have now received an updated version of the CSV, but cannot figure out how to upload it and overwrite the original dataset. It seems to require I create a new dataset. If I do create a new dataset, there doesn't seem to be a way to disconnect the current recipe flow from the old dataset and connect it to the new dataset.

Window recipe not producing expected results when using DSS engine

KKhatib — Fri, 14 Jun 2024 17:49:37 +0000

Hi there,

The issue I am having is that the DSS engine is producing a completely different result than when I use the SQL engine. Has anyone faced a similar issue? I would appreciate some insight on this.

Basically, all I want to do is produce a columns with the MAX() value inferred from another column. No partitions, no order bys, simple enough? At least that's what I thought.

It looks like DSS is Ordering By a hidden index on its own and then creating a Window Frame that takes the current row and all preceding rows. Here is an example table to show you what is supposed to happen and what is in fact happening:

Supposed to happen: (This is what is happening in SQL (In-database) engine)

Salary	Max(Salary)
22000	25000
23000	25000
24000	25000
25000	25000

What is in fact happening: (This is what is happening in DSS Engine)

Salary	Max(Salary)
22000	22000
23000	23000
24000	24000
25000	25000 Operating system used: Windows

How to output to / update my snowflake table using Dataiku

abalo006 — Tue, 11 Jun 2024 18:14:13 +0000

I have a snowflake table and I've set up the connection and everything looks good, Dataiku requires me to create a dataset using that snowflake table that I can use as my input / output. The issue is I have that dataset as my output and when I run my flow, I can see my results, but it isn't actually outputting to my snowflake table directly.

After running my flow, my snowflake table is still empty, I thought the whole point of creating a connection using a data table was to be able to read / write to that table?

Am I understanding this wrong? is there any way I can set up my flow so that my results are outputting to my snowflake table / connection and actually writing to the table?

Operating system used: windows

How to correctly do time conversions

abalo006 — Mon, 10 Jun 2024 16:23:13 +0000

I have a column that has been parsed and is in UTC, when I try to format the date to be in eastern / New York time I get a new column that is -5 hours, but isn't the current the current difference -4 hours? I'm sure this has something to do with daylight savings time vs normal time, but I just want to ensure that my formula remains working even when times are changed by an hour due to daylight savings.

does anybody know how I can get the correct time difference from UTC to ET?

I've attached the steps I'm currently using below

Operating system used: windows

Trigger on Dataset Modified for Partitioned Dataset

Satish — Mon, 10 Jun 2024 03:28:09 +0000

Hi Team

I'm reading the data from SharePoint and the format of the file is Cost Center_06092024.xlsx

As the file comes with the date format, I partitioned reading the data as /Cost Center_%M%D%Y.xlsx and in my prepare recipe set the option as Last available by that the flow ONLY get the latest file.

I'm trying to create a scenario as a Trigger on dataset change. Can you please help me with the option to use here?

Attached is the screenshot for reference

Thanks

Satish

Operating system used: Browser

Bug in Stack Recipe

yashpuranik — Tue, 26 Mar 2024 14:26:19 +0000

Hi All,

I am sharing below a minimum reproducible project that triggered an error in one of our larger workflows involving the stack recipes.We have been seeing these errors for Snowflake tables (they may exist in others) around string length and truncation.

The culprit seems to be that Dataiku is automatically recognizing string length when Snowflake tables are created with specific queries but using the smallest column length to create the output schema for the Stacked Recipe.

Of course I can get around this by manually defining the "Table Creation SQL", but would prefer this is addressed on the product level if possible.

Thanks,

Yash

Scenario Reporters

Satish — Wed, 22 May 2024 20:59:56 +0000

Currently using Scenario reporters to send data to a dataset with below configuration.

{
"flowname": "${scenarioName}",
"status": "${outcome}",
"summary": "${failedEventsSummary}"
}

The issue is failedEventsSummary is providing too much text. How can we get just the ERROR on why the scenario failed.

Operating system used: Browser

Generate Tile Num and Tile Sequence

satishkurra — Tue, 14 May 2024 21:42:47 +0000

Hi team

I'm trying to populate the Tile Num and Tile Sequence Number in the attached picture format. Trying to use windows recipe with no luck.

Can someone please help with this?

Attached is the data, the ask is to make sure generate a tile num for INS column. Highlighted the color combinations in the picture.

Operating system used: Browser

Fuzzy Join: When to use Relative to the Left vs Right Tables.

tgb417 — Fri, 10 May 2024 23:02:59 +0000

I'm starting to work with the Fuzzy Joins and having good luck.

However, I'm trying to figure out when I might want to use a Relative Threshold related to the Right or Left Table when doing a overall Left Join to find duplicate records.

I understand that the proportions of items that need to match will be different based on the difference in the length of each the left and right table data elements.

But, my question is why might one be better than the other when I don't necessarily know the length of the strings in my left table and right tables.

My us case is a self join (the table to itself as both the left and right table) I've got text strings that can vary from just a few characters to a few thousand characters. So these strings will appear in both the left and right tables at some point.

I think I understand that relative joins are good for me. Because if I have two short vales as the left and right tables. Then only a few substitutions are checked, and for longer data elements more characters are checked before the items are considered to be joined.

But for example if I have a short string and a long string say:

This is a short string. And this is a short string made longer.

Lets say that the relative values is 50%

Why would I use relative to left vs relative to right in a deduplication use case.

Operating system used: Mac OS Senoma 14.4.1

Error using Embed recipe in RAG tutorial in Dataiku

VaishnaviRam — Mon, 22 Apr 2024 06:46:06 +0000

Hi,

I am following the RAG tutorial link -> https://knowledge.dataiku.com/latest/ml-analytics/gen-ai/tutorial-question-answering-using-rag-approach.html#

While trying to run the Embed Recipe I am getting error as follows.

Oops: an unexpected error occurred

Error in Python process: : com.dataiku.dip.io.SocketBlockLink$SecretKernelTimeoutException: Subprocess failed to connect, it probably crashed at startup. Check the logs., caused by: SocketException: Socket operation on nonsocket: configureBlocking

HTTP code: , type:

Kindly help me to fix this issue. Have attached the logs

Operating system used: Windows 10

when training a model with a visual recipe, does dataiku fit the model on the entire dataset?

Tanguy — Mon, 12 Dec 2022 16:51:14 +0000

Context:

I have deployed a model to the flow
I want to retrain that model with its associated "train" recipe
I understand that the model's performance is evaluated using a test set or K-folds under a cross-validation strategy

My question: after retraining the model using the "train" recipe, is the resulting new active model fit on the entire dataset (as best practice sometimes suggests to do so)?

I can't find any information on this final fitting strategy in the recipe (see screenshot below) and failed to find such information in dataiku's documentation.

Operating system used: WIndows 10

Coalesce function doesn't work properly in prepare recipe

kentnardGaleria — Tue, 22 Aug 2023 10:00:47 +0000

Hi everyone!

I have a question regarding the coalesce recipe in dataiku. I wanted to use the coalesce funtion in dataiku Formula and the preview that I have in the prepare recipe shows that the function works and it shows the value that I want. But after executing the recipe, the resulting column shows a different output from the preview.

I have made sure that the order of the values in the coalesce function is correct and that the empty cells are NULL instead of an empty string. I could not comprehend where the mistake is. The pictures are attached below. Picture 1 shows the preview in prepare recipe and Picture 2 shows the resulting dataset.

Thanks in advance!

DSS visual recipes defaulting to max column length with Redshift tables

veenacalambur — Wed, 08 Jan 2020 18:08:45 +0000

Hi everyone,

When working with Redshift tables in DSS visual recipes we noticed that the table creation settings sometimes defaults to setting certain column lengths to the redshift max (65,000). In many cases this becomes excessive. For example, in the screenshot below the "brand" column has a length of 65k but most of the column has text that span less than 10 characters.

We wanted to better understand the logic of column length setting defaults for Redshift and if there is a safe / proper way to modify this.

Feature handling Dummy encoding

stoch — Fri, 09 Feb 2024 00:02:45 +0000

Dataiku's category handling = Dummy encoding with dropping dummy option seems to be using a level with the least exposure/volume as a dummy.

Q1. Is there a way to set this dummy manually instead of Dataiku's default method? Want to avoid using category handling = custom preprocessing option.

Q2. Using Variable type = Categorical with Drop one dummy option on input variable of double type seems to be dropping 2 levels. For example, there are only 3 regression coefficients from a variable with 5 levels). I would of expected there would be 4 regression coefficients since 1 is used as a dummy). Does anyone know the reason for this?

Many thanks in advance.

set the random state in visual ML models

Tanguy — Fri, 26 Jan 2024 18:40:24 +0000

I have an ongoing project in production that I intend to replace with another project currently in development. As part of this transition, I find myself comparing a dataset that has undergone scoring from a model in each project. Initially, I anticipated the model scores to be identical or, at the very least, very similar. However, I have observed significant differences despite the fact that the underlying data provided to both models is the same.

Consequently, I am seeking a method to standardize the model training between the two projects by setting the random state. I am utilizing a random forest classifier within a visual recipe, and random forests in scikit-learn have a `random_state` attribute.

Is there a recommended approach to achieve this?

Operating system used: Redhat 8

Force substring to integer

yesitsmeoffical — Thu, 04 Jan 2024 19:23:26 +0000

Here is the sample table:

ID	Column A
1	AA2001
2	BB2002

I want to add a Column B, in which the values are forced to be integer.

I know I can do Column B = substring (Column A, -4), and Dataiku will automatically convert the values to integer, but the conversion process is a black box to me, and I don't know what's the conversion criteria/logic and when it might fail.

I thought I could add "numval" in front of the substring to force the conversion but it didn't work and returned blank.

Is there a logic I could apply to achieve this? Basically something like:

pd.to_numeric(df['Column A'].str[:4], errors='coerce')

Operating system used: win 11

Confused on how to use RAG (Retrieval Augmented Generation)

Antal — Thu, 02 Nov 2023 13:14:47 +0000

I'm playing with the new LLM recipes and getting a bit confused with the RAG functionality.

I can use an Embed recipe to create an Embedding dataset / Vector Store.

Then I can setup an LLM two query the resulting object in its settings.

But, how to go from there? How can I ask a question / query to the Embedding object? Clicking on it only gives the option of a Python recipe and there's also nothing like a Visual webapp.

Operating system used: AWS Linux