In numpy, why does subtraction of integers sometimes produce floating point numbers?
>>> x = np.int64(2) - np.uint64(1)
>>> x
1.0
>>> x.dtype
dtype('float64')
This seems to only occur when using multiple different integer types (e.g. signed and unsigned), and when no larger integer type is available.
This is a conscious design decision by the numpy authors. When deciding on the resulting type, only the types of the operands are considered, not their actual values. And for the operation you perform, there is a risk of having a result outside the valid range, e.g. if you subtract a very large uint64 number, the result would not fit in an int64. The safe selection is thus to convert to float64, which certainly will fit the result (possibly with reduced precision, though).
Compare with an example of x = np.int32(2) - np.uint32(1). This can always be safely represented as an int64, therefore that type is chosen. The same would be true for x = np.int64(2) - np.uint32(1). This will also yield an int64.
The alternative would be to follow e.g. the c rules, which would cast everything to uint64. But that could, of course, lead to very strange results with over/underflows.
If you want to know ahead of time what type you will end up with, look into np.result_type(), np.can_cast(), and np.promote_types(). Reading about this in the docs might also help you understand the issue a bit better.
I'm no expert on numpy, however, I suspect that since float64 is the smallest data type that can fit both the domain of int64 and uint64 that the subtraction converts both operands into a float64 so that the operation always succeeds.
For example, in a with int8 and uint8: +128 - (256) cannot fit in a int8 since -128 is not valid in int8, as it can only fit back to -127. Similarly, we can't use a uint8 since we obviously need the sign in this case. Hence, we settle on a float/double as it can fit both directions fine.
Related
I am using numba to calculate MSE. The input are images which are ready as numpy arrays of uint8. Each element is 0-255.
When calculating the squared difference between two images the python function returns (expectedly) a uint8 result, but the same function when using numba returns int64.
#numba.jit(nopython=True)
def test1(var_a: np.ndarray, var_b: np.ndarray) -> float:
return var_a - var_b
#numba.jit(nopython=True)
def test2(var_a: np.ndarray, var_b: np.ndarray) -> float:
return (var_a - var_b) ** 2
def test3(var_a: np.ndarray, var_b: np.ndarray) -> float:
return (var_a - var_b) ** 2
a = np.array([2, 2]).astype(np.uint8).reshape(2, 1)
b = np.array([255, 255]).astype(np.uint8).reshape(2, 1)
test1(a, b) # output: array([[3, 3]], dtype=uint8)
test2(a, b) # output: array([[64009, 64009]], dtype=int64)
test3(a, b) # output: array([[9, 9]], dtype=uint8)
What's unclear to me is why the python-only code preserves the data-type while the numba-code adjusts the returned type to int64?
For my purpose, the numba result is ideal, but I don't understand why. I'm trying to avoid needing to .astype(int) all of my images, since this will eat a lot of RAM, when I'm only interested that the result of the subtraction be int (i.e., not unsigned).
So, why does numba "fixes" the datatype in test2()?
Numba is a JIT compiler that first uses static type inference to deduce the type of the variables and then compile the function before it can be called. This means all literals like integers are typed before running anything. Numba choose to set the type of integer literals to int64 so to avoid overflows on 64-bit machines (and int32 on 32-bit machines). This means var_a - var_b is evaluated as an array of uint8 as expected. (var_a - var_b) ** 2 is like var_tmp ** np.uin64(2) where var_tmp is of type uint8[:]. In this case, The Numba type inference system needs to do a type promotion like in any statically typed language (eg. C/C++). Like most languages, Numba choose to do a relatively safe type promotion by casting the array to int64 because int64 include all the possible values of uint8. In practice, the type promotion can be quite unsafe un pathological cases: for example, when you mix uint64 values with int64 ones, the result can be a float64 with a large but more limited precision and no warning is raised. If you use (var_a - var_b) ** np.uint8(2), then the output type is the one you expect (ie. uint8) because there is no type promotion.
Numpy uses dynamic type inference. Moreover, integers have a variable length in Python so their type has to be defined by Numpy at runtime (not by CPython which only define the generic variable-sized int type). Numpy can thus adapt the type of integer literals based on their runtime value. For example, (np.uint8(7) * 1000_000).dtype is of type int32 on my machine, while (np.uint8(7) * 100_000_000_000).dtype is of type int64 (because the type of the right-most integer literal is set to int64 since it is too big for a 32-bit integer. This is something Numba cannot do because of JIT compilation [1]. Thus, the semantics is a bit different between Numba and Numpy. The type promotion should be the same though (so to get results as close to Numpy with Numba).
A good practice is to explicitly type arrays so to avoid sneaky overflow in both Numpy and Numba. Casting integers to specific types is generally not needed but it is also a good practice when the types are small and performance matters (eg. integer arithmetic with intended overflow like for hash computations).
Note you can do your_function.inspect_types() so to get additional information about the type inference (though it is not easy to read).
[1] In fact, it Numba could type integer literals based on their value, but not variables. The thing is it would be very unexpected for users to get different output types (and behaviour due to possible overflows) when users change literals to runtime variables.
I'm getting surprising behavior trying to convert a microsecond string date to an integer:
n = 20181231235959383171
int_ = np.int(n) # Works
int64_ = np.int64(n) # "OverflowError: int too big to convert"
Any idea why?
Edit - Thank you all, this is informative, however please see my actual problem:
Dataframe column won't convert from integer string to an actual integer
An np.int can be arbitrarily large, like a python integer.
An np.int64 can only range from -263 to 263 - 1. Your number happens to fall outside this range.
When used as dtype, np.int is equivalent to np.int_ (architecture-dependent size), which is probably np.int64. So np.array([n], dtype=np.int) will fail. Outside dtype, np.int behaves as Python int. Numpy is basically helping you calculate as much stuff in C-land as possible in order to speed up the calculations and conserve memory; but (AFAIK) integers larger than 64 bits do not exist in standard C (though the new GCC does support them on some architectures). So you are stuck using either Python integers, slow but of unlimited size, or C integers, fast but not big enough for this.
There are two obvious ways to stuff a large integer into a numpy array:
You can use the Python type, signified by dtype=object: np.array([n], dtype=object) will work, but you are getting no speedup or memory benefits from numpy.
You can split the microsecond time into second time (n // 1000000) and second fractions (n % 1000000), as two separate columns.
Inserting a nan in Python into a complex numpy array gives some (for me) unexpected behavior:
a = np.array([5+6*1j])
print a
array([5.+6.j])
a[0] = np.nan
print a
array([nan+0.j])
I expected Python to write nan+nanj. For analyses it often might not matter, since np.isnan of any complex with either real and/or imaginary parts is True. However, I did not know the behavior and when plotting the real and imaginary parts of my array it gave me the impression I had info on the imaginary (however there is none). A workaround is to write a[0] = np.nan + np.nan*1j. Can somebody explain the reason for this behavior to me?
The issue here is that when you create an array with complex values:
a = np.array([5+6*1j])
You've created an array of dtype complex:
a.dtype
# dtype('complex128')
So by adding a value which only contains real part, it will be converted to a complex value, and you will thus be inserting a number with a complex component equal to 0j, so:
np.complex(np.nan)
# (nan+0j)
Which explains the behaviour:
a[0] = np.array([np.nan])
print(a)
# [nan+0.j]
It probably hast to do with numpy representation of nan:
NumPy uses the IEEE Standard for Binary Floating-Point for Arithmetic
(IEEE 754). This means that Not a Number is not equivalent to
infinity.
Essentially np.nan is a float. By setting x[0] = np.nan you are setting its value to a "real" float (but not changing the dtype of the array, which remains complex), so the imaginary part remains untouched as 0j.
That also explains why you can change the imaginary part by doing np.nan * 0j
Situation
I need to read data from a file using fits from astropy.io, which uses in numpy.
Some of the values I get when reading are very small negative float32 numbers, when there actually shouldn't exist negative values on the data (because of the data characteristics).
Questions
Can it be that those numbers were very small float64, that when read and casted to float32 became negative? If yes, how small do they have to be?
Is there a way to rewind the process, i.e., to get the original positive very small float64 value?
Can it be that those numbers were very small float64, that when read and casted to float32 became negative? If yes, how small do they have to be?
No - if the original float64 value was smaller than the smallest representable float32 number then it would simply be equal to zero after casting:
tiny = np.finfo(np.float64).tiny # smallest representable float64 value
print(tiny)
# 2.22507385851e-308
print(tiny == 0)
# False
print(np.float32(tiny))
# 0.0
print(np.float32(tiny) == 0)
# True
Casting from one signed representation to another always preserves the sign bit.
Is there a way to rewind the process, i.e., to get the original positive very small float64 value?
No - casting down from 64 to 32 bit means you are effectively throwing away half of the information in the original representation, and once it's gone there's no magic way to recover it.
A much more plausible explanation for the negative values is that they result from rounding errors on calculations performed on the data before it was stored.
I need to use a module that does some math on integers, however my input is in floats.
What I want to achieve is to convert a generic float value into a corresponding integer value and loose as little data as possible.
For example:
val : 1.28827339907e-08
result : 128827339906934
Which is achieved after multiplying by 1e22.
Unfortunately the range of values can change, so I cannot always multiply them by the same constant. Any ideas?
ADDED
To put it in other words, I have a matrix of values < 1, let's say from 1.323224e-8 to 3.457782e-6.
I want to convert them all into integers and loose as little data as possible.
The answers that suggest multiplying by a power of ten cause unnecessary rounding.
Multiplication by a power of the base used in the floating-point representation has no error in IEEE 754 arithmetic (the most common floating-point implementation) as long as there is no overflow or underflow.
Thus, for binary floating-point, you may be able to achieve your goal by multiplying the floating-point number by a power of two and rounding the result to the nearest integer. The multiplication will have no error. The rounding to integer may have an error up to .5, obviously.
You might select a power of two that is as large as possible without causing any of your numbers to exceed the bounds of the integer type you are using.
The most common conversion of floating-point to integer truncates, so that 3.75 becomes 3. I am not sure about Python semantics. To round instead of truncating, you might use a function such as round before converting to integer.
If you want to preserve the values for operations on matrices I would choose some value to multiply them all by.
For Example:
1.23423
2.32423
4.2324534
Multiply them all by 10000000 and you get
12342300
23242300
42324534
You can perform you multiplications, additions etc with your matrices. Once you have performed all your calculations you can convert them back to floats by dividing them all by the appropriate value depending on the operation you performed.
Mathematically it makes sense because
(Scalar multiplication)
M1` = M1 * 10000000
M2` = M2 * 10000000
Result = M1`.M2`
Result = (M1 x 10000000).(M2 x 10000000)
Result = (10000000 x 10000000) x (M1.M2)
So in the case of multiplication you would divide your result by 10000000 x 10000000.
If its addition / subtraction then you simply divide by 10000000.
You can either choose the value to multiply by through your knowledge of what decimals you expect to find or by scanning the floats and generating the value yourself at runtime.
Hope that helps.
EDIT: If you are worried about going over the maximum capacity of integers - then you would be happy to know that python automatically (and silently) converts integers to longs when it notices overflow is going to occur. You can see for yourself in a python console:
>>> i = 3423
>>> type(i)
<type 'int'>
>>> i *= 100000
>>> type(i)
<type 'int'>
>>> i *= 100000
>>> type(i)
<type 'long'>
If you are still worried about overflow, you can always choose a lower constant with a compromise for slightly less accuracy (since you will be losing some digits towards then end of the decimal point).
Also, the method proposed by Eric Postpischil seems to make sense - but I have not tried it out myself. I gave you a solution from a more mathematical perspective which also seems to be more "pythonic"
Perhaps consider counting the number of places after the decimal for each value to determine the value (x) of your exponent (1ex). Roughly something like what's addressed here. Cheers!
Here's one solution:
def to_int(val):
return int(repr(val).replace('.', '').split('e')[0])
Usage:
>>> to_int(1.28827339907e-08)
128827339907