Precision on Mathematical formulas on a recipe

Options
saraa1
saraa1 Registered Posts: 13 ✭✭✭

Hello all,

I want to make a mathematical calculation via a formula in a recipe of graphic preparation but I do not know for which reason the expected result is not correct (example below):

Formula : (C/10)-A
=> Expected result : (1038/10) - 90 = 13.8
=> In DSS : 13.7999999

does this mean that dss isn't precise on giving results ? I know that we might round the results but why this couldn't be calculated from the beginning on the formula ?

thank you

Best Answer

Answers

  • saraa1
    saraa1 Registered Posts: 13 ✭✭✭
    Options

    Thank you for your answer, Alex ,

    regards,

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    edited July 17
    Options

    @AlexandreL
    ,

    I see that this problem also exists in Python in Jupiter Notebooks.

    However, there are lots of places that use math libraries that get this kind of thing more correct. I checked two or three just now, Excel, the Calculator is Apple Spotlight, a cheap phone calculator application. I also notice the R will get the intent of the calculation "correct".

    Is there a way to have Python and built-in DSS Visual Recipies do these calculations using mathematical libraries in a more intuitive manner?

    In Python I see the use of the Decimal Library as one option. However, the syntax is a bit complicated.

    from decimal import *
    getcontext()
    Context(prec=28, rounding=ROUND_HALF_EVEN, Emin=-999999, Emax=999999,
    capitals=1, clamp=0, flags=[], traps=[Overflow, DivisionByZero,
    InvalidOperation])

    getcontext().prec = 7 # Set a new precision


    Decimal(1038/10)-Decimal(90)

    13.80000.png

    Is this something that needs to be put in as a Product Enhancement idea? This sort of thing just leads to a loss of trust.

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 411 Neuron
    edited July 17
    Options

    Hi @tgb417
    ,

    In this particular case there is no problem with "trust" as this is not an error, specially when you look at this from the numerical programming point of view. If you want more background with examples (in python) on why this happens, I recommend you this resource:

    https://pythonnumericalmethods.berkeley.edu/notebooks/chapter09.00-Representation-of-Numbers.html

    There is a section on floating point numbers and round-off errors. Comparing with Excel, or calculators in Apple or phones is not a good comparison, as those are not usually made to perform complex numerical calculations (although you could add some plugins in Excel, I believe). But is not a matter of Python making errors! It has to do with the way numbers are stored.

    For example, do the following in a python notebook:

    np.info

    That last line might be the more relevant, as it shows how each kind of float is capable of doing calculations with increasing support for bigger numbers or higher precision. (By the way, python's default float precision matches numpy's float64)

    I'm not an expert on the subtleties behind the scenes happening here, but I just have learnt with time that when doing computational analysis with python or other programming languages, dealing with numbers is not so straight forward!

    By the way, you can check all the limitations of a particular float dtype using

    import numpy as np
    
    print(np.float16(1e4), np.float32(1e4), np.float64(1e4), np.float128(1e4))
    
    print(np.float16(1e5), np.float32(1e5), np.float64(1e5), np.float128(1e5))
    
    print(np.float16(1e50), np.float32(1e50), np.float64(1e50), np.float128(1e50))
    
    print(np.float16(1038 / 10 - 90), np.float32(1038 / 10 - 90), np.float64(1038 / 10 - 90), np.float128(1038 / 10 - 90))
    
    print(np.float16(np.pi), np.float32(np.pi), np.float64(np.pi), np.float128(np.pi))

    For example:

    np.finfo(np.float16)

    gives you

    finfo(resolution=0.001, min=-6.55040e+04, max=6.55040e+04, dtype=float16)

    Hope this helps, and doesn't just add confusion!

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    edited July 17
    Options

    @Ignacio_Toledo
    ,

    Thanks for your note. I do think that this will be of some help to others reading this thread.

    I did understand the concept of a binary numerical approximate representation of floating-point numbers. And I do understand now that Python as a language chooses not to "hide" these representational complexities. Those are valid choices. And in Python and Python Jupyter Notebooks we will have to live with those choices in DSS.

    However, from my point of view, those are just choices on a complicated problem. For example, the R language chooses to take a different path for these kinds of examples.

    I'd like to invite all of us to consider as we try to widen the adoption of data science and numerical literacy to a broader community. Every time we choose a more technically accurate solution as a default solution that comes up with results that do not make logical sense to beginners. We end up placing a hurtle in front of our colleagues who don't have the same level of numerical sophistication.

    (1038/10) - 90

    Python Answer: 13.799999999999997
    R Answer 13.8

    The less intuitive, but accurate, python answer is harder to trust for the beginner. When other systems that folks are accustomed to using produce a more intuitive answer and hide some of the binary representational complexities.

    Personally, I would prefer a default behavior to have a more R-like answer to the problem listed above. Rather than the Python answer. And I'd like to invite the Dataiku staff to consider what choices should be made with the formula language built into visual recipes that are typically designed for beginner users.

    Maybe, I'm being a bit lazy not wanting to have to explain the binary representation of floating-point numbers to my beginning users who want to divide a few numbers.

    Those are just my $0.02. Other please feel free to share your perspectives.

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 411 Neuron
    edited July 17
    Options

    Hi @tgb417

    I was wondering why would R behave differently if is supposed to use the equivalent of numpy float64, and I find out is that R is just choosing to print the result of

    1038.0 / 10.0 - 90.0

    as 13.8, but if you try this

    n <- 1038.0 / 10.0 - 90.0
    
    n == 13.8

    you'll see that the issue is the same (hint, the output of the last line is FALSE).

    So R, apparently, was doing what you expected (print out a number that hid the binary complexity) but at some point or another, you were going to have to deal with it.

    In my opinion, this is part of the experience every data analyst or scientist should have: not necessarily why it happens, but to be careful when using code under certain circumstances.

    Cheers!

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 411 Neuron
    Options

    (In fact, this reminds me a little bit of the thread about dealing with dates and times: for humans is complicated, for computers, even more)

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Options

    Yeah, you are right, R does "hide" the issue. And maybe not in a good way. You are correct that when one gets to comparisons one gets an unexpected answer with no clue that we have to be careful.

    However, Excel, where a lot of folks are coming to Data Science from produces a result where you don't have to know about the precision of the binary representation (at least in this case):

    Excel Divitions.png

    Google Sheet also gets this done in the same more intuitive manner as MS Excel.

    Wolfram Alpha is another example that also seems to get this done in a more intuitive manner.

    Wolfram Alpha.png

    I've also tried to check some other tools in the field. Knime for example has the same challenge as R it seems to hides the lack of precision when displaying results. But when asked to do a test for equality comes up with False. So hiding the lack of precision in an inconsistent way leading to results that are hard to explain to beginners.

    (UPDATED) MS SQL Server has its own solution to this case. This approach does show that precision might be a problem with the extra 0s, By using a string representation of the number we can show the lack of precision. And we come up with the more intuitive true answer on the evaluation of equivalency.

    SQL Server.png

    Here is a bit more about SQL server calculations.

    https://www.red-gate.com/hub/product-learning/sql-prompt/the-dangers-of-using-float-or-real-datatypes

    Thank you all. I've learned something here. I've not been as careful about these issues as maybe I should be. And I should be more careful in the future.

    @saraa1
    thank you for asking this great simple question. It has helped me learn and grow.

Setup Info
    Tags
      Help me…