Is this documented anywhere? Why such a drastic difference?
# Python 3.2
# numpy 1.6.2 using Intel's Math Kernel Library
>>> import numpy as np
>>> x = np.float64(-0.2)
>>> x ** 0.8
__main__:1: RuntimeWarning: invalid value encountered in double_scalars
nan
>>> x = -0.2 # note: `np.float` is same built-in `float`
>>> x ** 0.8
(-0.2232449487530631+0.16219694943147778j)
This is especially confusing since according to this, np.float64 and built-in float are identical except for __repr__.
I can see how the warning from np may be useful in some cases (especially since it can be disabled or enabled in np.seterr); but the problem is that the return value is nan rather than the complex value provided by the built-in. Therefore, this breaks code when you start using numpy for some of the calculations, and don't convert its return values to built-in float explicitly.
numpy.float may or may not be float, but complex numbers are not float at all:
In [1]: type((-0.2)**0.8)
Out[1]: builtins.complex
So there's no float result of the operation, hence nan.
If you don't want to do an explicit conversion to float (which is recommended), do the numpy calculation in complex numbers:
In [3]: np.complex(-0.2)**0.8
Out[3]: (-0.2232449487530631+0.16219694943147778j)
The behaviour of returning a complex number from a float operation is certainly not usual, and was only introduced with Python 3 (like the float division of integers with the / operator). In Python 2.7 you get the following:
In [1]: (-0.2)**0.8
ValueError: negative number cannot be raised to a fractional power
On a scalar, if instead of np.float64 you use np.float, you'll get the same float type as Python uses. (And you'll either get the above error in 2.7 or the complex number in 3.x.)
For arrays, all the numpy operators return the same type of array, and most ufuncs do not support casting from float > complex (e.g., check np.<ufunc>.type).
If what you want is a consistent operation on scalars, use np.float
If you are interested in array operations, you'll have to cast the array as complex: x = x.astype('complex')
Related
I am using numba to calculate MSE. The input are images which are ready as numpy arrays of uint8. Each element is 0-255.
When calculating the squared difference between two images the python function returns (expectedly) a uint8 result, but the same function when using numba returns int64.
#numba.jit(nopython=True)
def test1(var_a: np.ndarray, var_b: np.ndarray) -> float:
return var_a - var_b
#numba.jit(nopython=True)
def test2(var_a: np.ndarray, var_b: np.ndarray) -> float:
return (var_a - var_b) ** 2
def test3(var_a: np.ndarray, var_b: np.ndarray) -> float:
return (var_a - var_b) ** 2
a = np.array([2, 2]).astype(np.uint8).reshape(2, 1)
b = np.array([255, 255]).astype(np.uint8).reshape(2, 1)
test1(a, b) # output: array([[3, 3]], dtype=uint8)
test2(a, b) # output: array([[64009, 64009]], dtype=int64)
test3(a, b) # output: array([[9, 9]], dtype=uint8)
What's unclear to me is why the python-only code preserves the data-type while the numba-code adjusts the returned type to int64?
For my purpose, the numba result is ideal, but I don't understand why. I'm trying to avoid needing to .astype(int) all of my images, since this will eat a lot of RAM, when I'm only interested that the result of the subtraction be int (i.e., not unsigned).
So, why does numba "fixes" the datatype in test2()?
Numba is a JIT compiler that first uses static type inference to deduce the type of the variables and then compile the function before it can be called. This means all literals like integers are typed before running anything. Numba choose to set the type of integer literals to int64 so to avoid overflows on 64-bit machines (and int32 on 32-bit machines). This means var_a - var_b is evaluated as an array of uint8 as expected. (var_a - var_b) ** 2 is like var_tmp ** np.uin64(2) where var_tmp is of type uint8[:]. In this case, The Numba type inference system needs to do a type promotion like in any statically typed language (eg. C/C++). Like most languages, Numba choose to do a relatively safe type promotion by casting the array to int64 because int64 include all the possible values of uint8. In practice, the type promotion can be quite unsafe un pathological cases: for example, when you mix uint64 values with int64 ones, the result can be a float64 with a large but more limited precision and no warning is raised. If you use (var_a - var_b) ** np.uint8(2), then the output type is the one you expect (ie. uint8) because there is no type promotion.
Numpy uses dynamic type inference. Moreover, integers have a variable length in Python so their type has to be defined by Numpy at runtime (not by CPython which only define the generic variable-sized int type). Numpy can thus adapt the type of integer literals based on their runtime value. For example, (np.uint8(7) * 1000_000).dtype is of type int32 on my machine, while (np.uint8(7) * 100_000_000_000).dtype is of type int64 (because the type of the right-most integer literal is set to int64 since it is too big for a 32-bit integer. This is something Numba cannot do because of JIT compilation [1]. Thus, the semantics is a bit different between Numba and Numpy. The type promotion should be the same though (so to get results as close to Numpy with Numba).
A good practice is to explicitly type arrays so to avoid sneaky overflow in both Numpy and Numba. Casting integers to specific types is generally not needed but it is also a good practice when the types are small and performance matters (eg. integer arithmetic with intended overflow like for hash computations).
Note you can do your_function.inspect_types() so to get additional information about the type inference (though it is not easy to read).
[1] In fact, it Numba could type integer literals based on their value, but not variables. The thing is it would be very unexpected for users to get different output types (and behaviour due to possible overflows) when users change literals to runtime variables.
Inserting a nan in Python into a complex numpy array gives some (for me) unexpected behavior:
a = np.array([5+6*1j])
print a
array([5.+6.j])
a[0] = np.nan
print a
array([nan+0.j])
I expected Python to write nan+nanj. For analyses it often might not matter, since np.isnan of any complex with either real and/or imaginary parts is True. However, I did not know the behavior and when plotting the real and imaginary parts of my array it gave me the impression I had info on the imaginary (however there is none). A workaround is to write a[0] = np.nan + np.nan*1j. Can somebody explain the reason for this behavior to me?
The issue here is that when you create an array with complex values:
a = np.array([5+6*1j])
You've created an array of dtype complex:
a.dtype
# dtype('complex128')
So by adding a value which only contains real part, it will be converted to a complex value, and you will thus be inserting a number with a complex component equal to 0j, so:
np.complex(np.nan)
# (nan+0j)
Which explains the behaviour:
a[0] = np.array([np.nan])
print(a)
# [nan+0.j]
It probably hast to do with numpy representation of nan:
NumPy uses the IEEE Standard for Binary Floating-Point for Arithmetic
(IEEE 754). This means that Not a Number is not equivalent to
infinity.
Essentially np.nan is a float. By setting x[0] = np.nan you are setting its value to a "real" float (but not changing the dtype of the array, which remains complex), so the imaginary part remains untouched as 0j.
That also explains why you can change the imaginary part by doing np.nan * 0j
I want to do binary operations (like add and multiply) between np.float32 and builtin Python int and float and get a np.float32 as the return type. However, it gets automatically up-casted to a np.float64.
Example code:
>>> a = np.float32(5)
>>> a.dtype
dtype('float32')
>>> b = a + 2
>>> b.dtype
dtype('float64')
If I do this with a np.float128, b also becomes a np.float128. This is good, as it thereby preserves precision. However, no up-casting to np.float64 is necessary to preserve precision in my example, but it still occurs. Had I added 2.0 (a Python float (64 bit)) to a instead of 2, the casting would make sense. But even here, I do not want it.
So my question is: How can I alter the casting done when applying a binary operator to a np.float32 and a builtin Python int/float? Alternatively, making single precision the standard in all calculations rather than double, would also count as a solution, as I do not ever need double precision. Other people have asked for this, and it seems that no solution has been found.
I know about numpy arrays and there dtypes. Here I get the wanted behavior, as an array always preserves its dtype. It is however when I do an operation on a single element of an array that I get the unwanted behavior.
I have a vague idea to a solution, involving subclassing np.ndarray (or np.float32) and changing the value of __array_priority__. So far I have not been able to get it working.
Why do I care? I am trying to write an n-body code using Numba. This is why I cannot simply do operations on the array as a whole. Changing all np.float64 to np.float32 makes for a speed up of about a factor of 2, which is important. The np.float64-casting behavior serves to ruin this speed up completely, as all operations on my np.float32 array are done in 64-precision and then downcasted to 32-precision.
I'm not sure about the NumPy behavior, or how exactly you're trying to use Numba, but being explicit about the Numba types might help. For example, if you do something like this:
#jit
def foo(a):
return a[0] + 2;
a = np.array([3.3], dtype='f4')
foo(a)
The float32 value in a[0] is promoted to a float64 before the add operation (if you don't mind diving into llvm IR, you can see this for yourself by running the code using the numba command and using the --dump-llvm or --dump-optimized flag: numba --dump-optimized numba_test.py). However, by specifying the function signature, including the return type as float32:
#jit('f4(f4[:]'))
def foo(a):
return a[0] + 2;
The value in a[0] is not promoted to float64, although the result is cast to a float64 so it can be converted to a Python float object when the function returns to Python land.
If you can allocate an array beforehand to hold the result, you can do something like this:
#jit
def foo():
a = np.arange(1000000, dtype='f4')
result = np.zeros(1000000, dtype='f4')
for i in range(a.size):
result[0] = a[0] + 2
Even though you're doing the looping yourself, the performance of the compiled code should be comparable to a NumPy ufunc, and no casts to float64 should occur (Again, this can be verified by looking at the llvm IR that Numba generates).
Why does the numpy.sum and numpy.prod function return int32 when the input is a list of int, and int64 if it's a generator for the same list? What's the best way to coerce them to use int64 when operating on a list?
E.g.
sum([x for x in range(800000)]) == -2122947200
sum((x for x in range(800000))) == 319999600000L
Python 2.7
You are likely using numpy.sum instead of the built-in sum, a side effect of from numy import *. Be advised not to do so, as it will cause no end of confusion. Instead, use something like import numpy as np, and refer to the numpy namespace with the short np prefix.
To answer your question, numpy.sum makes the accumulator type the same as the type of the array. On a 32-bit system, numpy coerces the list into an int32 array, which causes numpy.sum to use a 32-bit accumulator. When given a generator expression, numpy.sum falls back to calling sum, which promotes integers to longs. To force the use of a 64-bit accumulator for array/list input, use the dtype parameter:
>>> np.sum([x for x in range(800000)], dtype=np.int64)
319999600000
I have an numpy array of floats in Python.
When I print the array, the first value is:
[7.14519700e+04, ....
If, however, I print out just the first value on it's own, the print out reads:
71451.9699799
Obviously these numbers should be identical, so I just wondered, is the array just showing me a rounded version of the element? The second number here has 12 significant figures, and the first only has 9.
I guess I just wonder why these numbers are different?
It's just in the printing, not in the storage. The only confusion might occur because the first example uses numpy's print precision settings, the second example general python's print settings.
You can adjust the numpy precision and print by
numpy.set_printoptions(precision=20)
print myarray`
(adjust precision to your needs), or select the number of significant figures in standard python formatted print:
print ('%.20f' % myarray[0])
The internal representation of the number is always the same.
The types in a numpy array are well defined. You can get how they are stored by inspecting the numpy.dtype property of an array.
For example:
import numpy
a = numpy.zeros(10)
print a.dtype
will show float64, that is a 64-bit floating point number.
You can specify the type of the array explicitly using either the commonly accepted dtype argument, or the dtype type object (that is, the thing that makes the dtype).
a = numpy.zeros(10, dtype='complex32') # a 32-bit floating point
b = numpy.longdouble(a) # create a long-double array from a
Regarding the printing, this is just a formatting issue. You can twiddle how numpy prints an array using numpy.set_printoptions:
>>> a = numpy.random.randn(3) # for interest, randn annoyingly doesn't support the dtype arg
>>> print a
[ 0.12584756 0.73540009 -0.17108244 -0.96818512]
>>> numpy.set_printoptions(precision=3)
>>> print a
[ 0.126 0.735 -0.171 -0.968]