Precision on Mathematical formulas on a recipe

Solved!
saraa1
Level 2
Precision on Mathematical formulas on a recipe

Hello all, 

I want to make a mathematical calculation via a formula in a recipe of graphic preparation but I do not know for which reason the expected result is not correct (example below):

Formula : (C/10)-A
=> Expected result : (1038/10) - 90 = 13.8
=> In DSS : 13.7999999

 

does this mean that dss isn't precise on giving results ? I know that we might round the results but why this couldn't be calculated from the beginning on the formula ? 

 

thank you ๐Ÿ™‚

1 Solution
AlexandreL
Dataiker

Hi,

This behaviour is due to the way computers store floating-point numbers. If you try to save the 13.8 value as a 32-bit floating point number, it will be stored as 13.7999999999 as 13.8 can not be represented exactly as a float with variable precision. You can check this article for more details about floating-point numbers: https://en.wikipedia.org/wiki/Single-precision_floating-point_format

If you want to fix your precision to 2 decimals, you can indeed round this value to a fixed precision and you will get exactly 13.80. You can use the round step for this: https://doc.dataiku.com/dss/latest/preparation/processors/round.html#round-numbers

Best regards,

Alex

View solution in original post

8 Replies
AlexandreL
Dataiker

Hi,

This behaviour is due to the way computers store floating-point numbers. If you try to save the 13.8 value as a 32-bit floating point number, it will be stored as 13.7999999999 as 13.8 can not be represented exactly as a float with variable precision. You can check this article for more details about floating-point numbers: https://en.wikipedia.org/wiki/Single-precision_floating-point_format

If you want to fix your precision to 2 decimals, you can indeed round this value to a fixed precision and you will get exactly 13.80. You can use the round step for this: https://doc.dataiku.com/dss/latest/preparation/processors/round.html#round-numbers

Best regards,

Alex

tgb417

@AlexandreL ,

I see that this problem also exists in Python in Jupiter Notebooks. 

However, there are lots of places that use math libraries that get this kind of thing more correct.  I checked two or three just now, Excel, the Calculator is Apple Spotlight, a cheap phone calculator application.  I also notice the R will get the intent of the calculation "correct".

Is there a way to have Python and built-in DSS Visual Recipies do these calculations using mathematical libraries in a more intuitive manner?

In Python I see the use of the Decimal Library as one option.  However, the syntax is a bit complicated.

from decimal import *
getcontext()
Context(prec=28, rounding=ROUND_HALF_EVEN, Emin=-999999, Emax=999999,
capitals=1, clamp=0, flags=[], traps=[Overflow, DivisionByZero,
InvalidOperation])

getcontext().prec = 7 # Set a new precision


Decimal(1038/10)-Decimal(90)

13.80000.png

Is this something that needs to be put in as a Product Enhancement idea?  This sort of thing just leads to a loss of trust.  

--Tom
0 Kudos
Ignacio_Toledo

Hi @tgb417 ,

In this particular case there is no problem with "trust" as this is not an error, specially when you look at this from the numerical programming point of view. If you want more background with examples (in python) on why this happens, I recommend you this resource:

https://pythonnumericalmethods.berkeley.edu/notebooks/chapter09.00-Representation-of-Numbers.html

There is a section on floating point numbers and round-off errors. Comparing with Excel, or calculators in Apple or phones is not a good comparison, as those are not usually made to perform complex numerical calculations (although you could add some plugins in Excel, I believe). But is not a matter of Python making errors! It has to do with the way numbers are stored.

For example, do the following in a python notebook:

import numpy as np

print(np.float16(1e4), np.float32(1e4), np.float64(1e4), np.float128(1e4))

print(np.float16(1e5), np.float32(1e5), np.float64(1e5), np.float128(1e5))

print(np.float16(1e50), np.float32(1e50), np.float64(1e50), np.float128(1e50))

print(np.float16(1038 / 10 - 90), np.float32(1038 / 10 - 90), np.float64(1038 / 10 - 90), np.float128(1038 / 10 - 90))

print(np.float16(np.pi), np.float32(np.pi), np.float64(np.pi), np.float128(np.pi))

That last line might be the more relevant, as it shows how each kind of float is capable of doing calculations with increasing support for bigger numbers or higher precision. (By the way, python's default float precision matches numpy's float64)

I'm not an expert on the subtleties behind the scenes happening here, but I just have learnt with time that when doing computational analysis with python or other programming languages, dealing with numbers is not so straight forward!

By the way, you can check all the limitations of a particular float dtype using

np.info

For example:

np.finfo(np.float16)

gives you

finfo(resolution=0.001, min=-6.55040e+04, max=6.55040e+04, dtype=float16)

 

Hope this helps, and doesn't just add confusion!

0 Kudos
tgb417

@Ignacio_Toledo ,

Thanks for your note.  I do think that this will be of some help to others reading this thread.

I did understand the concept of a binary numerical approximate representation of floating-point numbers.  And I do understand now that Python as a language chooses not to "hide" these representational complexities.  Those are valid choices.  And in Python and Python Jupyter Notebooks we will have to live with those choices in DSS.

However, from my point of view, those are just choices on a complicated problem.  For example, the R language chooses to take a different path for these kinds of examples.

I'd like to invite all of us to consider as we try to widen the adoption of data science and numerical literacy to a broader community.  Every time we choose a more technically accurate solution as a default solution that comes up with results that do not make logical sense to beginners.  We end up placing a hurtle in front of our colleagues who don't have the same level of numerical sophistication.

(1038/10) - 90

Python Answer: 13.799999999999997
R Answer 13.8

The less intuitive, but accurate, python answer is harder to trust for the beginner.  When other systems that folks are accustomed to using produce a more intuitive answer and hide some of the binary representational complexities.

Personally, I would prefer a default behavior to have a more R-like answer to the problem listed above.  Rather than the Python answer.  And I'd like to invite the Dataiku staff to consider what choices should be made with the formula language built into visual recipes that are typically designed for beginner users. 

Maybe, I'm being a bit lazy not wanting to have to explain the binary representation of floating-point numbers to my beginning users who want to divide a few numbers.

Those are just my $0.02. Other please feel free to share your perspectives.

--Tom
0 Kudos
Ignacio_Toledo

Hi @tgb417 

I was wondering why would R behave differently if is supposed to use the equivalent of numpy float64, and I find out is that R is just choosing to print the result of

 

1038.0 / 10.0 - 90.0

 

as 13.8, but if you try this

 

n <- 1038.0 / 10.0 - 90.0

n == 13.8

 

you'll see that the issue is the same (hint, the output of the last line is FALSE).

So R, apparently, was doing what you expected (print out a number that hid the binary complexity) but at some point or another, you were going to have to deal with it.

In my opinion, this is part of the experience every data analyst or scientist should have: not necessarily why it happens, but to be careful when using code under certain circumstances.

Cheers!

 

(In fact, this reminds me a little bit of the thread about dealing with dates and times: for humans is complicated, for computers, even more)

tgb417

Yeah, you are right, R does "hide" the issue. And maybe not in a good way.  You are correct that when one gets to comparisons one gets an unexpected answer with no clue that we have to be careful.

However, Excel, where a lot of folks are coming to Data Science from produces a result where you don't have to know about the precision of the binary representation (at least in this case):

Excel Divitions.png

Google Sheet also gets this done in the same more intuitive manner as MS Excel.

Wolfram Alpha is another example that also seems to get this done in a more intuitive manner.

Wolfram Alpha.png

I've also tried to check some other tools in the field.  Knime for example has the same challenge as R it seems to hides the lack of precision when displaying results. But when asked to do a test for equality comes up with False.  So hiding the lack of precision in an inconsistent way leading to results that are hard to explain to beginners.

(UPDATED) MS SQL Server has its own solution to this case.  This approach does show that precision might be a problem with the extra 0s,  By using a string representation of the number we can show the lack of precision. And we come up with the more intuitive true answer on the evaluation of equivalency.  

SQL Server.png

Here is a bit more about SQL server calculations.

https://www.red-gate.com/hub/product-learning/sql-prompt/the-dangers-of-using-float-or-real-datatype...

Thank you all.  I've learned something here. I've not been as careful about these issues as maybe I should be. And I should be more careful in the future.

@saraa1 thank you for asking this great simple question.  It has helped me learn and grow.

--Tom
saraa1
Level 2
Author

Thank you for your answer, Alex ,

regards,

0 Kudos