Numpy to weak to calculate a precise mean value

Numpy to weak to calculate a precise mean value - python

This question is very similar to this post - but not exactly
I have some data in a .csv file. The data has precision to the 4th digit (#.####).
Calculating the mean in Excel or SAS gives a result with precision to 5th digit (#.#####) but using numpy gives:
import numpy as np
data = np.recfromcsv(path2file, delimiter=';', names=['measurements'], dtype=np.float64)
rawD = data['measurements']
print np.average(rawD)
gives a number like this
#.#####999999999994
Clearly something is wrong..
using
from math import fsum
print fsum(rawD.ravel())/rawD.size
gives
#.#####
Is there anything in the np.average that I set wrong _______?
BONUS info:
I'm only working with 200 data points in the array
UPDATE
I thought I should make my case more clear.
I have numbers like 4.2730 in my csv (giving a 4 decimal precision - even though the 4th always is zero [not part of the subject so don't mind that])
Calculating an average/mean by numpy gives me this
4.2516499999999994
Which gives a print by
>>>print "%.4f" % np.average(rawD)
4.2516
During the same thing in Excel or SAS gives me this:
4.2517
Which I actually believe as being the true average value because it finds it to be 4.25165.
This code also illustrate it:
answer = 0
for number in rawD:
answer += int(number*1000)
print answer/2
425165
So how do I tell np.average() to calculate this value ___?
I'm a bit surprised that numpy did this to me... I thought that I only needed to worry if I was dealing with 16 digits numbers. Didn't expect a round off on the 4 decimal place would be influenced by this..
I know I could use
fsum(rawD.ravel())/rawD.size
But I also have other things (like std) I want to calculate with the same precision
UPDATE 2
I thought I could make a temp solution by
>>>print "%.4f" % np.float64("%.5f" % np.mean(rawD))
4.2416
Which did not solve the case. Then I tried
>>>print "%.4f" % float("4.24165")
4.2416
AHA! There is a bug in the formatter: Issue 5118
To be honest I don't care if python stores 4.24165 as 4.241649999... It's still a round off error - NO MATTER WHAT.
If the interpeter can figure out how to display the number
>>>print float("4.24165")
4.24165
Then should the formatter as well and deal with that number when rounding..
It still doesn't change the fact that I have a round off problem (now both with the formatter and numpy)
In case you need some numbers to help me out then I have made this modified .csv file:
Download it from here
(I'm aware that this file does not have the number of digits I explained earlier and that the average gives ..9988 at the end instead of ..9994 - it's modified)
Guess my qeustion boils down to how do I get a string output like the one excel gives me if I use =average()
and have it round off correctly if I choose to show only 4 digits
I know that this might seem strange for some.. But I have my reasons for wanting to reproduce the behavior of Excel.
Any help would be appreciated, thank you.

To get exact decimal numbers, you need to use decimal arithmetic instead of binary. Python provides the decimal module for this.
If you want to continue to use numpy for the calculations and simply round the result, you can still do this with decimal. You do it in two steps, rounding to a large number of digits to eliminate the accumulated error, then rounding to the desired precision. The quantize method is used for rounding.
from decimal import Decimal,ROUND_HALF_UP
ten_places = Decimal('0.0000000001')
four_places = Decimal('0.0001')
mean = 4.2516499999999994
print Decimal(mean).quantize(ten_places).quantize(four_places, rounding=ROUND_HALF_UP)
4.2517

The result value of average is a double. When you print out a double, by default all digits are printed. What you see here is the result of limited digital precision, which is not a problem of numpy, but a general computing problem. When you care of the presentation of your float value, use "%.4f" % avg_val. There is also a package for rational numbers, to avoid representing fractions as real numbers, but I guess that's not what you're looking for.
For your second statement, summarizing all the values by hand and then dividing it, I suppose you're using python 2.7 and all your input values are integer. In that way, you would have an integer division, which truncates everything after the dot, resulting in another integer value.

Related

Having issues making math calculations in python, I am receiving the wrong answer in the end

Beginner here. I've been following the 100 days of code course on Udemy and I have been trying to figure out the Tip Calculator project.
https://gyazo.com/285baf25f0c803fc893faa32d23d9fd1
I am receiving the wrong tip per person. For example I make the total bill amount 40.53, percent tip 15 and make it a 3 way split, and it gives me 15.33... and if I do this online on another program it would give me 15.54. Any tips for a beginner?

Your issue is that it is converting those numbers to integers, however, integers don't have decimal places and also always round down, you need to use floats for maximum precision

Without seeing your code, as BeRT2me has requested to see, it sounds like the variable you are assigning your intermediary total to is being set as an integer or is being truncated/rounded down perhaps by the output type of a function call you are making/returning.
Likely, change it to be an appropriate form of decimal type.
e.g. Using your values, 40.53 * 1.15 = 46.6095
46.6095 / 3 = 15.5365 (or 15.54 rounded to 2 decimal places)
46 / 3 = 15.33 (rounded to 2 decimal places)
If you are using that code example of total_final_bill. The type would be implicitly set as an integer since you are adding to integer values.
You are performing an implicit rounding function on the tip_amount and the total_bill by using the int() function. These should not be integers as they are decimal/float values.
So, each time you use int() other than int(tip_percent) and int(split_question) which are actually integer values for your formulas you are rounding the decimals.

How do I get Decimal to reflect value correctly?

I need to perform many math operations on numbers that look like 104.950178 - i.e. having multiple decimal places. It ranges from 4 decimal places to 8 decimal places with an occasional 10 decimal place number.
After much reading on Stack Overflow, I understand that I should not use float but instead use Decimal, because float is base 2 and Decimal is base 10. Having many math operations, float will give a result with error, which was not what I wanted, so I went with Decimal.
Everything was working fine until I encountered:
from decimal import *
Decimal(0.1601).quantize(Decimal('.0001'), rounding=ROUND_FLOOR) #returned Decimal('0.1600')
I need the numbers to be reflected exactly as they are. Can you please advise if there is another number system to use in Python 3, or if there is a way to get Decimal to take the number given as it is without changing its value?
Note: I added the "quantize..." portion of the code due to the comment by Mark to make it a MVCE. I took for granted the output and forgot that every Decimal in this part of my code was processed to truncate at 4 decimal points.
Nevertheless, #jonrsharpe is really sharp. He saw my question and immediately knew (before the "quantize..." edit) where the problem was. Thanks!

Instead of:
a = 0.1601
Decimal(a)
I used:
a = '0.1601'
Decimal(a)
Defined it as a string instead, and Decimal reflected the value exactly. I hope this helps someone.

Arithmetic error: Python is incorrectly dividing variables

I'm getting something that doesn't seem to be making a lot of sense. I was practicing my coding by making a little program that would give me the probability of getting certain cards within a certain timeframe of a card game. In order to calculate the chances, I would need to create a method that would perform division, and report the chances as a fraction and as a decimal. so I designed this:
from fractions import Fraction
def time_odds(card_count,turns,deck_size=60):
chance_of_occurence = float(card_count)/float(deck_size)
opening_hand_odds = 7*chance_of_occurence
turn_odds = (7 + turns)*chance_of_occurence
print ("Chance of it being in the opening hand: %s or %s" %(opening_hand_odds,Fraction(opening_hand_odds)))
print ("Chance of it being acquired by turn %s : %s or %s" %(turns,turn_odds,Fraction(turn_odds) ))
and then I used it like so:
time_odds(3,5)
but for whatever reason I got this as the answer:
"Chance of it being in the opening hand: 0.35000000000000003 or
6305039478318695/18014398509481984"
"Chance of it being acquired by turn 5 : 0.6000000000000001 or
1351079888211149/2251799813685248"
so it's like, almost right, except the decimal is just slightly off, giving like a 0.0000000000003 difference or a 0.000000000000000000001 difference.
Python doesn't do this when I just make it do division like this:
print (7*3/60)
This gives me just 0.35, which is correct. The only difference that I can observe, is that I get the slightly incorrect values when I am dividing with variables rather than just numbers.
I've looked around a little for an answer, and most incorrect division problems have to do with integer division (or I think it can be called floor division) , but I didn't manage to find anything addressing this.
I've had a similar problem with python doing this when I was dividing really big numbers. What's going on?
Why is this so? what can I do to correct it?

The extra digits you're seeing are floating point precision errors. As you do more and more operations with floating point numbers, the errors have a chance of compounding.
The reason you don't see them when you try to replicate the computation by hand is that your replication performs the operations in a different order. If you compute 7 * 3 / 60, the mutiplication happens first (with no error), and the division introduces a small enough error that Python's float type hides it for you in its repr (because 0.35 unambiguously refers to the same float value as the computation). If you do 7 * (3 / 60), the division happens first (introducing error) and then the multiplication increases the size of the error to the point that it can't be hidden (because 0.35000000000000003 is a different float value than 0.35).
To avoid printing out the the extra digits that are probably error, you may want to explicitly specify a precision to use when turning your numbers into strings. For instance, rather than using the %s format code (which calls str on the value), you could use %.3f, which will round off your number after three decimal places.
There's another issue with your Fractions. You're creating the Fraction directly from the floating point value, which already has the error calculated in. That's why you're seeing the fraction print out with a very large numerator and denominator (it's exactly representing the same number as the inaccurate float). If you instead pass integer numerator and denominator values to the Fraction constructor, it will take care of simplifying the fraction for you without any floating point inaccuracy:
print("Chance of it being in the opening hand: %.3f or %s"
% (opening_hand_odds, Fraction(7*card_count, deck_size)))
This should print out the numbers as 0.350 and 7/20. You can of course choose whatever number of decimal places you want.
Completely separate from the floating point errors, the calculation isn't actually getting the probability right. The formula you're using may be a good enough one for doing in your head while playing a game, but it's not completely accurate. If you're using a computer to crunch the numbers for you, you might as well get it right.
The probability of drawing at least one of N specific cards from a deck of size M after D draws is:
1 - (comb(M-N, D) / comb(M, D))
Where comb is the binary coefficient or "combination" function (often spoken as "N choose R" and written "nCr" in mathematics). Python doesn't have an implementation of that function in the standard library, but there are a lot of add on modules you may already have installed that provide one or you can pretty easily write your own. See this earlier question for more specifics.
For your example parameters, the correct odds are '5397/17110' or 0.315.

floats inside tuples changing values when accessed

So I have a list of tuples of two floats each. Each tuple represents a range. I am going through another list of floats which represent values to be fit into the ranges. All of these floats are < 1 but positive, so precision matter. One of my tests to determine if a value fits into a range is failing when it should pass. If I print the value and the range that is causing problems I can tell this much:
curValue = 0.00145000000671
range = (0.0014500000067055225, 0.0020968749796738849)
The conditional that is failing is:
if curValue > range[0] and ... blah :
# do some stuff
From the values given by curValue and range, the test should clearly pass (don't worry about what is in the conditional). Now, if I print explicitly what the value of range[0] is I get:
range[0] = 0.00145000000671
Which would explain why the test is failing. So my question then, is why is the float changing when it is accessed. It has decimal values available up to a certain precision when part of a tuple, and a different precision when accessed. Why would this be? What can I do to ensure my data maintains a consistent amount of precision across my calculations?

The float doesn't change. The built-in numberic types are all immutable. The cause for what you're observing is that:
print range[0] uses str on the float, which (up until very recent versions of Python) printed less digits of a float.
Printing a tuple (be it with repr or str) uses repr on the individual items, which gives a much more accurate representation (again, this isn't true anymore in recent releases which use a better algorithm for both).
As for why the condition doesn't work out the way you expect, it's propably the usual culprit, the limited precision of floats. Try print repr(curVal), repr(range[0]) to see if what Python decided was the closest representation of your float literal possible.

In modern day PC's floats aren't that precise. So even if you enter pi as a constant to 100 decimals, it's only getting a few of them accurate. The same is happening to you. This is because in 32-bit floats you only get 24 bits of mantissa, which limits your precision (and in unexpected ways because it's in base2).
Please note, 0.00145000000671 isn't the exact value as stored by Python. Python only diplays a few decimals of the complete stored float if you use print. If you want to see exactly how python stores the float use repr.
If you want better precision use the decimal module.

It isn't changing per se. Python is doing its best to store the data as a float, but that number is too precise for float, so Python modifies it before it is even accessed (in the very process of storing it). Funny how something so small is such a big pain.
You need to use a arbitrary fixed point module like Simple Python Fixed Point or the decimal module.

Not sure it would work in this case, because I don't know if Python's limiting in the output or in the storage itself, but you could try doing:
if curValue - range[0] > 0 and...

Significant figures in the decimal module

So I've decided to try to solve my physics homework by writing some python scripts to solve problems for me. One problem that I'm running into is that significant figures don't always seem to come out properly. For example this handles significant figures properly:
from decimal import Decimal
>>> Decimal('1.0') + Decimal('2.0')
Decimal("3.0")
But this doesn't:
>>> Decimal('1.00') / Decimal('3.00')
Decimal("0.3333333333333333333333333333")
So two questions:
Am I right that this isn't the expected amount of significant digits, or do I need to brush up on significant digit math?
Is there any way to do this without having to set the decimal precision manually? Granted, I'm sure I can use numpy to do this, but I just want to know if there's a way to do this with the decimal module out of curiosity.

Changing the decimal working precision to 2 digits is not a good idea, unless you absolutely only are going to perform a single operation.
You should always perform calculations at higher precision than the level of significance, and only round the final result. If you perform a long sequence of calculations and round to the number of significant digits at each step, errors will accumulate. The decimal module doesn't know whether any particular operation is one in a long sequence, or the final result, so it assumes that it shouldn't round more than necessary. Ideally it would use infinite precision, but that is too expensive so the Python developers settled for 28 digits.
Once you've arrived at the final result, what you probably want is quantize:
>>> (Decimal('1.00') / Decimal('3.00')).quantize(Decimal("0.001"))
Decimal("0.333")
You have to keep track of significance manually. If you want automatic significance tracking, you should use interval arithmetic. There are some libraries available for Python, including pyinterval and mpmath (which supports arbitrary precision). It is also straightforward to implement interval arithmetic with the decimal library, since it supports directed rounding.
You may also want to read the Decimal Arithmetic FAQ: Is the decimal arithmetic ‘significance’ arithmetic?

Decimals won't throw away decimal places like that. If you really want to limit precision to 2 d.p. then try
decimal.getcontext().prec=2
EDIT: You can alternatively call quantize() every time you multiply or divide (addition and subtraction will preserve the 2 dps).

Just out of curiosity...is it necessary to use the decimal module? Why not floating point with a significant-figures rounding of numbers when you are ready to see them? Or are you trying to keep track of the significant figures of the computation (like when you have to do an error analysis of a result, calculating the computed error as a function of the uncertainties that went into the calculation)? If you want a rounding function that rounds from the left of the number instead of the right, try:
def lround(x,leadingDigits=0):
"""Return x either as 'print' would show it (the default)
or rounded to the specified digit as counted from the leftmost
non-zero digit of the number, e.g. lround(0.00326,2) --> 0.0033
"""
assert leadingDigits>=0
if leadingDigits==0:
return float(str(x)) #just give it back like 'print' would give it
return float('%.*e' % (int(leadingDigits),x)) #give it back as rounded by the %e format
The numbers will look right when you print them or convert them to strings, but if you are working at the prompt and don't explicitly print them they may look a bit strange:
>>> lround(1./3.,2),str(lround(1./3.,2)),str(lround(1./3.,4))
(0.33000000000000002, '0.33', '0.3333')

Decimal defaults to 28 places of precision.
The only way to limit the number of digits it returns is by altering the precision.

What's wrong with floating point?
>>> "%8.2e"% ( 1.0/3.0 )
'3.33e-01'
It was designed for scientific-style calculations with a limited number of significant digits.

If I undertand Decimal correctly, the "precision" is the number of digits after the decimal point in decimal notation.
You seem to want something else: the number of significant digits. That is one more than the number of digits after the decimal point in scientific notation.
I would be interested in learning about a Python module that does significant-digits-aware floating point point computations.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.