Avoiding overflow while handling 64-bit floats in Python - python

Let me know if my code will be able to handle 64-bit float overflow (along with maintaining accuracy with precision) in Python.
My task is to calculate mean of an input stream of floats.
count = 0
mean = 0
input = [1.0, 2.0, ...]
for i in input:
mean = (mean*(count/count+1)) + (input/count+1)

Related

Discrepancy between floats in C and Python [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I am making an implementation of a MobileNetV2 in C and comparing it, layer by layer, to a Keras model of the same network to make sure I'm doing things right. I managed to get to a very close approximation of the inference result of the network, but there is an error in the 5th decimal place or so. Looking for the reason for the imprecision I came across something strange.
I am working exclusively with float objects in C, and all of the arrays in Python, including all of the weight arrays and other parameters, are float32.
When exporting my processed image from Python to a .csv file, I put seven decimal points to the export function: np.savetxt(outfile, twoD_data_slice, fmt='%-1.7e') which would still result in a float but with certain limitations. Namely, that last decimal place does not have full precision. However, one of the numbers I got was "0.98431373". When trying to convert this to C it instead gave me "0.98431377".
I asked a question here about this result and I was told of my mistake to use seven decimal places, but this still doesn't explain why Python can handle a number like "0.98431373" as a float32 when in C that gets changed to "0.98431377".
My guess is that Python is using a different 32-bit float than the one I'm using in C, as evidenced by how their float32 can handle a number like "0.98431373" and the float in C cannot. And I think this is what is causing the imprecision of my implementation when compared to the final result in Python. Because if Python can handle numbers like these, then the precision it has while doing calculations for the neural network is higher than in C, or at least different, so the answer should be different as well.
Is the floating point standard different in Python compared to C? And if so, is there a way I can tell Python to use the same format as the one in C?
Update
I changed the way I import files using atof, like so:
void import_image(float data0[224][224][3]) {
// open file
FILE *fptr;
fptr = fopen("image.csv", "r");
if (fptr == NULL) {
perror("fopen()");
exit(EXIT_FAILURE);
}
char c = fgetc(fptr); // generic char
char s[15]; // maximum number of characters is "-x.xxxxxxxe-xx" = 14
for (int y = 0; y < 224; ++y) { // lines
for (int x = 0; x < 224; ++x) { // columns
for (int d = 0; d < 3; ++d) { // depth
// write string
int i;
for (i = 0; c != '\n' && c != ' '; ++i) {
assert( 0 <= i && i <= 14 );
s[i] = c;
c = fgetc(fptr);
}
s[i] = '\0';
float f = atof(s); // convert to float
data0[y][x][d] = f; // save on array
c = fgetc(fptr);
}
}
}
fclose(fptr);
}
I also exported the images from python using seven decimal places and the result seems more accurate. If the float standard from both is the same, even the half-precision digit should be the same. And indeed, there doesn't seem to be any error in the image I import when compared to the one I exported from Python.
There is still, however, an error in the last digits of the final answer in the system. Python displays this answer with eight significant places. I mimic this with %.8g.
My answer:
'tiger cat', 0.42557633
'tabby, tabby cat', 0.35453162
'Egyptian cat', 0.070309319
'lynx, catamount', 0.0073038512
'remote control, remote', 0.0032443549
Python's Answer:
('n02123159', 'tiger_cat', 0.42557606)
('n02123045', 'tabby', 0.35453174)
('n02124075', 'Egyptian_cat', 0.070309244)
('n02127052', 'lynx', 0.007303906)
('n04074963', 'remote_control', 0.0032443653)
The error seems to start appearing after the first convolutional layer, which is where I start making mathematical operations with these values. There could be an error in my implementation, but assuming there isn't, could this be caused by a difference in the way Python operates with floats compared to C? Is this imprecision expected, or is it likely an error in the code?
A 32-bit floating point number can encode about 232 different values.
0.98431373 is not one of them.
Finite floating point values are of the form: some_integer * power-of-two.
The closest choice to 0.98431373 is 0.98431372_64251708984375 which is 16514044 * 2-24.
Printing 0.98431372_64251708984375 to 8 fractional decimal places is 0.98431373. That may appear to be the 32-bit float value, but its exact value differs a small amount.
in C that gets changed to "0.98431377"
0.98431377 is not an expected output of a 32-bit float as the next larger float is 0.98431378_6029815673828125. Certainly OP's conversion code to C results in a 64-bit double with some unposted TBD conversion artifacts.
"the way I import the data to C is I take the mantissa, convert it to float, then take then exponent, convert it to long, and then multiply the mantissa by 10^exponent" is too vague. Best to post code than only a description of code.
Is the floating point standard different in Python compared to C?
They could differ, but likely are the same.
And if so, is there a way I can tell Python to use the same format as the one in C?
Not really. More likely the other way around. I am certain C allows more variations on FP than python.

OverflowError: Python int too large to convert to C long when feed data into numpy array

I am trying to feed large number after encryption into a numpy array, but it says the number is too long and it gets overflow. I checked the code, everything is correct before I feed the numbers into the numpy array, but it got an error at the step of feeding in the data, which is en1[i,j] = pk.raw_encrypt(int(test1[i,j])).
The encrypted number I have got here is 3721469428823308171852407981126958588051758293498563443424608937516905060542577505841168884360804470051297912859925781484960893520445514263696476240974988078627213135445788309778740044751099235295077596597798031854813054409733391824335666742083102231195956761512905043582400348924162387787806382637700241133312260811836700206345239790866810211695141313302624830782897304864254886141901824509845380817669866861095878436032979919703752065248359420455460486031882792946889235009894799954640035281227429200579186478109721444874188901886905515155160376705016979283166216642522595345955323818983998023048631350302980936674. Python3 still claims it to be a int type. The number itself did not get overflow, but the numpy array does not allow it to be filled in.
What property of the numpy caused this, and is there any solution to this problem? I have considered using list to substitute numpy array but it will be rather hard to implement when it is not a 1-D array. I have attached the full test code below.
test1 = np.array([[1,2,3],[1,2,4]])
test2 = np.array([[4,1,3],[6,1,5]])
en1 = np.copy(test1)
en2 = np.copy(test2)
pk, sk = paillier.generate_paillier_keypair()
en_sum = np.copy(en1)
pl_sum = np.copy(en1)
for i in range(test1.shape[0]):
for j in range(test2.shape[1]):
en1[i,j] = pk.raw_encrypt(int(test1[i,j]))
en2[i,j] = pk.raw_encrypt(int(test2[i,j]))
en_sum[i,j] = en1[i,j]*en2[i,j]
pl_sum[i,j] = sk.raw_decrypt(en_sum[i,j])
sum = sk.raw_decrypt(en_sum)
Python integers are stored with arbitrary precision, while numpy integers are stored in standard 32-bit or 64-bit representations depending on your platform.
What this means is that while the maximum representable Python integer is bounded only by your system memory, the maximum representable Numpy integer is bounded by what is representable in 64-bits.
You can see the maximum representable unsigned integer value here:
>>> import numpy as np
>>> np.iinfo(np.uint64).max
18446744073709551615
>>> 2 ** 64 - 1
18446744073709551615
The best approach for your application depends on what you want to do with these extremely large integers, but I'd lean toward avoiding Numpy arrays for integers of this size.

Python: How to sum two signed int16 arrays into one without overflow

I have several int16 streams in strings and I want them sum together (without overflow) and return it as an int16 string. Background is mixing several wave files into one stream.
decodeddata1 = numpy.fromstring(data, numpy.int16)
decodeddata2 = numpy.fromstring(data2, numpy.int16)
newdata = decodeddata1 + decodeddata2
return newdata.tostring()
Is there a way doing this with numpy or is there another library?
Processing each single value in python is too slow and results in stutter.
The most important thing is performance, since this code is used in a callback method feeding the audio.
#edit:
test input data:
a = np.int16([20000,20000,-20000,-20000])
b = np.int16([10000,20000,-10000,-20000])
print a + b --> [ 30000 -25536 -30000 25536]
but I want to keep the maximum levels:
[ 30000 40000 -30000 -40000]
The obvious consequence of mixing two signals together with a dynamic range of -32768<x<32767 is a resulting signal of with range of -65537<x<65536 - which requires 17 bits to represent it.
To avoid clipping, you will need to gain-scale the inputs - the obvious way is to divide the sum (or both of the inputs) by 2.
numpy looks as thought it should be quite fast for this - at least faster than python's builtin variable-size integer type. If the additional arithmetic is a performance concern, you should consider your choice of language.

Extremely low values from NumPy

I am attempting to do a few different operations in Numpy (mean and interp), and with both operations I am getting the result 2.77555756156e-17 at various times, usually when I'm expecting a zero. Even attempting to filter these out with array[array < 0.0] = 0.0 fails to remove the values.
I assume there's some sort of underlying data type or environment error that's causing this. The data should all be float.
Edit: It's been helpfully pointed out that I was only filtering out the values of -2.77555756156e-17 but still seeing positive 2.77555756156e-17. The crux of the question is what might be causing these wacky values to appear when doing simple functions like interpolating values between 0-10 and taking a mean of floats in the same range, and how can I avoid it without having to explicitly filter the arrays after every statement.
You're running into numerical precision, which is a huge topic in numerical computing; when you do any computation with floating point numbers, you run the risk of running into tiny values like the one you've posted here. What's happening is that your calculations are resulting in values that can't quite be expressed with floating-point numbers.
Floating-point numbers are expressed with a fixed amount of information (in Python, this amount defaults to 64 bits). You can read more about how that information is encoded on the very good Floating point Wikipedia page. In short, some calculation that you're performing in the process of computing your mean produces an intermediate value that cannot be precisely expressed.
This isn't a property of numpy (and it's not even really a property of Python); it's really a property of the computer itself. You can see this is normal Python by playing around in the repl:
>>> repr(3.0)
'3.0'
>>> repr(3.0 + 1e-10)
'3.0000000001'
>>> repr(3.0 + 1e-18)
'3.0'
For the last result, you would expect 3.000000000000000001, but that number can't be expressed in a 64-bit floating point number, so the computer uses the closest approximation, which in this case is just 3.0. If you were trying to average the following list of numbers:
[3., -3., 1e-18]
Depending on the order in which you summed them, you could get 1e-18 / 3., which is the "correct" answer, or zero. You're in a slightly stranger situation; two numbers that you expected to cancel didn't quite cancel out.
This is just a fact of life when you're dealing with floating point mathematics. The common way of working around it is to eschew the equals sign entirely and to only perform "numerically tolerant comparison", which means equality-with-a-bound. So this check:
a == b
Would become this check:
abs(a - b) < TOLERANCE
For some tolerance amount. The tolerance depends on what you know about your inputs and the precision of your computer; if you're using a 64-bit machine, you want this to be at least 1e-10 times the largest amount you'll be working with. For example, if the biggest input you'll be working with is around 100, it's reasonable to use a tolerance of 1e-8.
You can round your values to 15 digits:
a = a.round(15)
Now the array a should show you 0.0 values.
Example:
>>> a = np.array([2.77555756156e-17])
>>> a.round(15)
array([ 0.])
This is most likely the result of floating point arithmetic errors. For instance:
In [3]: 0.1 + 0.2 - 0.3
Out[3]: 5.551115123125783e-17
Not what you would expect? Numpy has a built in isclose() method that can deal with these things. Also, you can see the machine precision with
eps = np.finfo(np.float).eps
So, perhaps something like this could work too:
a = np.array([[-1e-17, 1.0], [1e-16, 1.0]])
a[np.abs(a) <= eps] = 0.0

Normalization using Numpy vs hard coded

import numpy as np
import math
def normalize(array):
mean = sum(array) / len(array)
deviation = [(float(element) - mean)**2 for element in array]
std = math.sqrt(sum(deviation) / len(array))
normalized = [(float(element) - mean)/std for element in array]
numpy_normalized = (array - np.mean(array)) / np.std(array)
print normalized
print numpy_normalized
print ""
normalize([2, 4, 4, 4, 5, 5, 7, 9])
normalize([1, 2])
normalize(range(5))
Outputs:
[-1.5, -0.5, -0.5, -0.5, 0.0, 0.0, 1.0, 2.0]
[-1.5 -0.5 -0.5 -0.5 0. 0. 1. 2. ]
[0.0, 1.414213562373095]
[-1. 1.]
[-1.414213562373095, -0.7071067811865475, 0.0, 0.7071067811865475, 1.414213562373095]
[-1.41421356 -0.70710678 0. 0.70710678 1.41421356]
Can someone explain to me why this code behaves differently in the second example, but similarly in the other two examples?
Did I do anything wrong in the hard coded example? What does NumPy do to end up with [-1, 1]?
As seaotternerd explains, you're using integers. And in Python 2 (unless you from __future__ import division), dividing an integer by an integer gives you an integer.
So, why aren't all three wrong? Well, look at the values. In the first one, the sum is 40 and the len is 8, and 40 / 8 = 5. And in the third one, 10 / 5 = 2. But in the second one, 3 / 2 = 1.5. Which is why only that one gets the wrong answer when you do integer division.
So, why doesn't NumPy also get the second one wrong? NumPy doesn't treat an array of integers as floats, it treats them as integers—print np.array(array).dtype and you'll see int64. However, as the docs for np.mean explain, "float64 intermediate and return values are used for integer inputs". And, although I don't know this for sure, I'd guess they designed it that way specifically to avoid problems like this.
As a side note, if you're interested in taking the mean of floats, there are other problems with just using sum / div. For example, the mean of [1, 2, 1e200, -1e200] really ought to be 0.75, but if you just do sum / div, you're going to get 0. (Why? Well, 1 + 2 + 1e200 == 1e200.) You may want to look at a simple stats library, even if you're not using NumPy, to avoid all these problems. In Python 3 (which would have avoided your problem in the first place), there's one in the stdlib, called statistics; in Python 2, you'll have to go to PyPI.
You aren't converting the numbers in the array to floats when calculating the mean. This isn't a problem for your second or third inputs, because they happen to work out neatly (as explained by #abarnert), but since the second input does not, and is composed exclusively of ints, you end up calculating the mean as 1 when it should be 1.5. This propagates through, resulting in your discrepancy with the results of using NumPy's functions.
If you replace the line where you calculate the mean with this, which forces Python to use float division:
mean = sum(array) / float(len(array))
you will ultimately get [-1, 1] as a result for the second set of inputs, just like NumPy.

Categories