Why does the numpy.sum and numpy.prod function return int32 when the input is a list of int, and int64 if it's a generator for the same list? What's the best way to coerce them to use int64 when operating on a list?
E.g.
sum([x for x in range(800000)]) == -2122947200
sum((x for x in range(800000))) == 319999600000L
Python 2.7
You are likely using numpy.sum instead of the built-in sum, a side effect of from numy import *. Be advised not to do so, as it will cause no end of confusion. Instead, use something like import numpy as np, and refer to the numpy namespace with the short np prefix.
To answer your question, numpy.sum makes the accumulator type the same as the type of the array. On a 32-bit system, numpy coerces the list into an int32 array, which causes numpy.sum to use a 32-bit accumulator. When given a generator expression, numpy.sum falls back to calling sum, which promotes integers to longs. To force the use of a 64-bit accumulator for array/list input, use the dtype parameter:
>>> np.sum([x for x in range(800000)], dtype=np.int64)
319999600000
Related
Suppose I have my array like this:
from decimal import Decimal
array = [Decimal(np.nan), Decimal(np.nan), Decimal(0.231411)]
I know that if the types are float, I can check if all the values are nan or not
, as:
np.isnan(array).all()
Is there a way for type Decimal?
The solution would be better without iteration.
You could use NumPy's vectorize to avoid iteration.
In [40]: from decimal import Decimal
In [41]: import numpy as np
In [42]: nums = [Decimal(np.nan), Decimal(np.nan), Decimal(0.231411)]
In [43]: nums
Out[43]:
[Decimal('NaN'),
Decimal('NaN'),
Decimal('0.2314110000000000055830895462349872104823589324951171875')]
In [44]: np.all(np.vectorize(lambda x: x.is_nan())(np.asarray(nums)))
Out[44]: False
In [45]: np.all(np.vectorize(lambda x: x.is_nan())(np.asarray(nums[:-1])))
Out[45]: True
In the snippet above nums is a list of instances of class Decimal. Notice that you need to convert that list into a NumPy array.
From my comment above, I realise it’s an iteration. The reason is that np.isnan does not support Decimal as an input type; therefore, I don’t believe this can be done via broadcasting, without converting the datatype - which means a potential precision loss, which is a reason to use a Decimal type.
Additionally, as commented by #juanpa.arrivillaga, as the Decimal objects are in a list, iteration is the only way. Numpy is not necessary in this operation.
One method is:
all([i.is_nan() for i in array])
I have a dataset on which I'm trying to apply some arithmetical method.
The thing is it gives me relatively large numbers, and when I do it with numpy, they're stocked as 0.
The weird thing is, when I compute the numbers appart, they have an int value, they only become zeros when I compute them using numpy.
x = np.array([18,30,31,31,15])
10*150**x[0]/x[0]
Out[1]:36298069767006890
vector = 10*150**x/x
vector
Out[2]: array([0, 0, 0, 0, 0])
I have off course checked their types:
type(10*150**x[0]/x[0]) == type(vector[0])
Out[3]:True
How can I compute this large numbers using numpy without seeing them turned into zeros?
Note that if we remove the factor 10 at the beggining the problem slitghly changes (but I think it might be a similar reason):
x = np.array([18,30,31,31,15])
150**x[0]/x[0]
Out[4]:311075541538526549
vector = 150**x/x
vector
Out[5]: array([-329406144173384851, -230584300921369396, 224960293581823801,
-224960293581823801, -368934881474191033])
The negative numbers indicate the largest numbers of the int64 type in python as been crossed don't they?
As Nils Werner already mentioned, numpy's native ctypes cannot save numbers that large, but python itself can since the int objects use an arbitrary length implementation.
So what you can do is tell numpy not to convert the numbers to ctypes but use the python objects instead. This will be slower, but it will work.
In [14]: x = np.array([18,30,31,31,15], dtype=object)
In [15]: 150**x
Out[15]:
array([1477891880035400390625000000000000000000L,
191751059232884086668491363525390625000000000000000000000000000000L,
28762658884932613000273704528808593750000000000000000000000000000000L,
28762658884932613000273704528808593750000000000000000000000000000000L,
437893890380859375000000000000000L], dtype=object)
In this case the numpy array will not store the numbers themselves but references to the corresponding int objects. When you perform arithmetic operations they won't be performed on the numpy array but on the objects behind the references.
I think you're still able to use most of the numpy functions with this workaround but they will definitely be a lot slower than usual.
But that's what you get when you're dealing with numbers that large :D
Maybe somewhere out there is a library that can deal with this issue a little better.
Just for completeness, if precision is not an issue, you can also use floats:
In [19]: x = np.array([18,30,31,31,15], dtype=np.float64)
In [20]: 150**x
Out[20]:
array([ 1.47789188e+39, 1.91751059e+65, 2.87626589e+67,
2.87626589e+67, 4.37893890e+32])
150 ** 28 is way beyond what an int64 variable can represent (it's in the ballpark of 8e60 while the maximum possible value of an unsigned int64 is roughly 18e18).
Python may be using an arbitrary length integer implementation, but NumPy doesn't.
As you deduced correctly, negative numbers are a symptom of an int overflow.
I want to do binary operations (like add and multiply) between np.float32 and builtin Python int and float and get a np.float32 as the return type. However, it gets automatically up-casted to a np.float64.
Example code:
>>> a = np.float32(5)
>>> a.dtype
dtype('float32')
>>> b = a + 2
>>> b.dtype
dtype('float64')
If I do this with a np.float128, b also becomes a np.float128. This is good, as it thereby preserves precision. However, no up-casting to np.float64 is necessary to preserve precision in my example, but it still occurs. Had I added 2.0 (a Python float (64 bit)) to a instead of 2, the casting would make sense. But even here, I do not want it.
So my question is: How can I alter the casting done when applying a binary operator to a np.float32 and a builtin Python int/float? Alternatively, making single precision the standard in all calculations rather than double, would also count as a solution, as I do not ever need double precision. Other people have asked for this, and it seems that no solution has been found.
I know about numpy arrays and there dtypes. Here I get the wanted behavior, as an array always preserves its dtype. It is however when I do an operation on a single element of an array that I get the unwanted behavior.
I have a vague idea to a solution, involving subclassing np.ndarray (or np.float32) and changing the value of __array_priority__. So far I have not been able to get it working.
Why do I care? I am trying to write an n-body code using Numba. This is why I cannot simply do operations on the array as a whole. Changing all np.float64 to np.float32 makes for a speed up of about a factor of 2, which is important. The np.float64-casting behavior serves to ruin this speed up completely, as all operations on my np.float32 array are done in 64-precision and then downcasted to 32-precision.
I'm not sure about the NumPy behavior, or how exactly you're trying to use Numba, but being explicit about the Numba types might help. For example, if you do something like this:
#jit
def foo(a):
return a[0] + 2;
a = np.array([3.3], dtype='f4')
foo(a)
The float32 value in a[0] is promoted to a float64 before the add operation (if you don't mind diving into llvm IR, you can see this for yourself by running the code using the numba command and using the --dump-llvm or --dump-optimized flag: numba --dump-optimized numba_test.py). However, by specifying the function signature, including the return type as float32:
#jit('f4(f4[:]'))
def foo(a):
return a[0] + 2;
The value in a[0] is not promoted to float64, although the result is cast to a float64 so it can be converted to a Python float object when the function returns to Python land.
If you can allocate an array beforehand to hold the result, you can do something like this:
#jit
def foo():
a = np.arange(1000000, dtype='f4')
result = np.zeros(1000000, dtype='f4')
for i in range(a.size):
result[0] = a[0] + 2
Even though you're doing the looping yourself, the performance of the compiled code should be comparable to a NumPy ufunc, and no casts to float64 should occur (Again, this can be verified by looking at the llvm IR that Numba generates).
Let's consider a list of large integers, for example one given by:
def primesfrom2to(n):
# http://stackoverflow.com/questions/2068372/fastest-way-to-list-all-primes-below-n-in-python/3035188#3035188
""" Input n>=6, Returns a array of primes, 2 <= p < n """
sieve = np.ones(n/3 + (n%6==2), dtype=np.bool)
sieve[0] = False
for i in xrange(int(n**0.5)/3+1):
if sieve[i]:
k=3*i+1|1
sieve[ ((k*k)/3) ::2*k] = False
sieve[(k*k+4*k-2*k*(i&1))/3::2*k] = False
return np.r_[2,3,((3*np.nonzero(sieve)[0]+1)|1)]
primesfrom2to(2000000)
I want to calculate the sum of that, and the expected result is 142913828922.
But if I do:
sum(primesfrom2to(2000000))
I get 1179908154, which is clearly wrong. The problem is that I have an int overflow, but I don't understand why. Let's me explain.Consider this testing code:
a=primesfrom2to(2000000)
b=[float(i) for i in a]
c=[long(i) for i in a]
sumI=0
sumF=0
sumL=0
m=0
for i,j,k in zip(a,b,c):
m=m+1
sumI=sumI+i
sumF=sumF+j
sumL=sumL+k
print sumI,sumF,sumL
if sumI<0:
print i,m
break
I found out that the first integer overflow is happening at a[i=20444]=225289
If I do:
>>> sum(a[:20043])+225289
-2147310677
But if I do:
>>> sum(a[:20043])
2147431330
>>> 2147431330+225289
2147656619L
What's happening? Why such a different behaviour? Why can't sum switch automatically to long type and give the correct result?
Look at the types of your results. You are summing a numpy array, which is using numpy datatypes, which can overflow. When you do sum(a[:20043]), you get a numpy object back (some sort of int32 or the like), which overflows when added to another number. When you manually type in the same number, you're creating a Python builtin int, which can auto-promote to long. Numpy arrays cannot autopromote like Python builtin types, because the array type (and its memory layout) have to be fixed when the array is created. This makes operations much faster at the expense of type flexibility.
You may be able to get around the problem by using a different datatype (like np.int64) instead of np.bool. However, it depends how big your numbers are. A simple example:
# Python types ok
>>> 2**62
4611686018427387904L
>>> 2**63
9223372036854775808L
# numpy types overflow
>>> np.int64(2)**62
4611686018427387904
>>> np.int64(2)**63
-9223372036854775808
Your example works correctly for me on 64-bit Python, so I guess you're using 32-bit Python. If you can use 64-bit types you will be able to get past the limit you found, but as my example shows you will eventually overflow 64-bit ints too if your numbers get super huge.
Is this documented anywhere? Why such a drastic difference?
# Python 3.2
# numpy 1.6.2 using Intel's Math Kernel Library
>>> import numpy as np
>>> x = np.float64(-0.2)
>>> x ** 0.8
__main__:1: RuntimeWarning: invalid value encountered in double_scalars
nan
>>> x = -0.2 # note: `np.float` is same built-in `float`
>>> x ** 0.8
(-0.2232449487530631+0.16219694943147778j)
This is especially confusing since according to this, np.float64 and built-in float are identical except for __repr__.
I can see how the warning from np may be useful in some cases (especially since it can be disabled or enabled in np.seterr); but the problem is that the return value is nan rather than the complex value provided by the built-in. Therefore, this breaks code when you start using numpy for some of the calculations, and don't convert its return values to built-in float explicitly.
numpy.float may or may not be float, but complex numbers are not float at all:
In [1]: type((-0.2)**0.8)
Out[1]: builtins.complex
So there's no float result of the operation, hence nan.
If you don't want to do an explicit conversion to float (which is recommended), do the numpy calculation in complex numbers:
In [3]: np.complex(-0.2)**0.8
Out[3]: (-0.2232449487530631+0.16219694943147778j)
The behaviour of returning a complex number from a float operation is certainly not usual, and was only introduced with Python 3 (like the float division of integers with the / operator). In Python 2.7 you get the following:
In [1]: (-0.2)**0.8
ValueError: negative number cannot be raised to a fractional power
On a scalar, if instead of np.float64 you use np.float, you'll get the same float type as Python uses. (And you'll either get the above error in 2.7 or the complex number in 3.x.)
For arrays, all the numpy operators return the same type of array, and most ufuncs do not support casting from float > complex (e.g., check np.<ufunc>.type).
If what you want is a consistent operation on scalars, use np.float
If you are interested in array operations, you'll have to cast the array as complex: x = x.astype('complex')