Safe sums across columns in DSS Formulas

Ashley
Ashley Dataiker, Alpha Tester, Dataiku DSS Core Designer, Registered, Product Ideas Manager Posts: 163 Dataiker

People will usually tell you that 1+1=2. In general, you can believe them; however, in some cases in DSS, the following may also be true from time to time : 1+1=11.

🤯 “What in the name of the bird?!

TL;DR

Why might these two numbers be concatenating when they’re not strings? That's because there are 3 things you need to know about operations in the Formulas editor:

  1. Be careful when using + or -: only do it if your data does not contain any null values.
  2. Use the sum() function instead: sum([col_A, col_B, col _C, col_D]) ← use those brackets
  3. Or use a nested IF: if(isnull(col_A),0,col_A)+if(isnull(col_B),0,col_B) ← tedious, but also effective

When do 1+1=11 and 1+1=2 both happen?

I was recently playing with a dataset in which I needed to create a sum across columns, and I found something that astonished me with a formula that went A+B+C+D :

A

int

B

int

C

int

D

int

=A+B+C+D

int

1

2

3

4

10

null

2

3

4

234

1

null

3

4

134

1

2

null

4

34

1

2

3

null

6

"But all my data is stored as integers! What is the logic behind this add/concatenate behavior?"

Why?

🧐 Let’s see what’s happening:

1+2+3+4=10 ← add 1+2 , add 3, add 4

null+2+3+4=234 ← (null+2), then concatenate 3, concatenate 4

1+null+3+4=134 ← (1+null), then concatenate 3, concatenate 4

1+2+null+4=34 ← add 1+2, then (+ null), then concatenate 4

1+2+3+null=6 ← add 1+2 , add 3, (+null) then...

See the pattern of how addition turns into concatenation? It always happens after a null value. Why? The short answer is, "It’s complicated".

Switches from Add → Concatenate

Let's K.I.S.S. the long answer:

Auto-detection for operations occurs by value and not by column.

The detection of the column datatype occurs after all the operations have been completed.

Concatenate always follows null + anything. It doesn’t matter that all of the data in your columns are numbers, or that they’re stored as a number data type, or that logically 1+nothing+3+4 should be 8, not 134.

What happened in the case of the lovely A+B+C+D above is that null values are turned into “”. Whenever the processor sees anything with a + “” it defaults to concatenation and turns the result into a string.

What happens IRL

Voice of the processor

A+B are two numbers...

Two numbers? We add those...

yet as soon as C comes into the picture, let’s assume C is a null value...

I see that C is null.

...it flips a switch...

null is “”

“” is string.

..and the processor will forever concatenate whatever comes after.

+ and string is always concatenate.

Result is string.

Repeat to

Another Example

In the following example, everything works according to how we’d expect it to because G is the result of several, nested operations:

A

int

B

int

C

int

D

int

E=A+B

Int

F=E+C

Int

G=F+D

Int

1

2

3

4

3

6

10

2

3

4

2

5

9

1

3

4

1

4

8

1

2

4

3

3

7

1

2

3

3

6

6

Let’s walk through an example of how the numbers in column E get crunched:

What happens IRL: A+B

Voice of the processor

1+2→ 3

Number and number? Add.

Null +2 → “”+2→ 2

String and number? Concatenate.

1+ Null → 1+“” → 1

One number and string? Concatenate.

Col type = int

Column type is number because column detection happens at the end!

Tips and tricks for safe sums in DSS

  • Your first option is to get rid of null values. Maybe you replace them with 0 or maybe you do something else; it'll depend on your data.
  • There are many cases, however, where you want to keep null values to do other types of analyses (count null, anyone?). If so, do either of these:
    • Use the sum() function instead: sum([A,B,C,D]) ← see those brackets? Use them!
    • Or use a nested IF: if(isnull(A),0,A)+if(isnull(B),0,B) ← tedious, but also effective
Tagged:

Comments

Setup Info
    Tags
      Help me…