np.int64 is a smaller container than np.int....? - python

I'm getting surprising behavior trying to convert a microsecond string date to an integer:
n = 20181231235959383171
int_ = np.int(n) # Works
int64_ = np.int64(n) # "OverflowError: int too big to convert"
Any idea why?
Edit - Thank you all, this is informative, however please see my actual problem:
Dataframe column won't convert from integer string to an actual integer

An np.int can be arbitrarily large, like a python integer.
An np.int64 can only range from -263 to 263 - 1. Your number happens to fall outside this range.

When used as dtype, np.int is equivalent to np.int_ (architecture-dependent size), which is probably np.int64. So np.array([n], dtype=np.int) will fail. Outside dtype, np.int behaves as Python int. Numpy is basically helping you calculate as much stuff in C-land as possible in order to speed up the calculations and conserve memory; but (AFAIK) integers larger than 64 bits do not exist in standard C (though the new GCC does support them on some architectures). So you are stuck using either Python integers, slow but of unlimited size, or C integers, fast but not big enough for this.
There are two obvious ways to stuff a large integer into a numpy array:
You can use the Python type, signified by dtype=object: np.array([n], dtype=object) will work, but you are getting no speedup or memory benefits from numpy.
You can split the microsecond time into second time (n // 1000000) and second fractions (n % 1000000), as two separate columns.

Related

Why does numpy integer subtraction produce a float64?

In numpy, why does subtraction of integers sometimes produce floating point numbers?
>>> x = np.int64(2) - np.uint64(1)
>>> x
1.0
>>> x.dtype
dtype('float64')
This seems to only occur when using multiple different integer types (e.g. signed and unsigned), and when no larger integer type is available.
This is a conscious design decision by the numpy authors. When deciding on the resulting type, only the types of the operands are considered, not their actual values. And for the operation you perform, there is a risk of having a result outside the valid range, e.g. if you subtract a very large uint64 number, the result would not fit in an int64. The safe selection is thus to convert to float64, which certainly will fit the result (possibly with reduced precision, though).
Compare with an example of x = np.int32(2) - np.uint32(1). This can always be safely represented as an int64, therefore that type is chosen. The same would be true for x = np.int64(2) - np.uint32(1). This will also yield an int64.
The alternative would be to follow e.g. the c rules, which would cast everything to uint64. But that could, of course, lead to very strange results with over/underflows.
If you want to know ahead of time what type you will end up with, look into np.result_type(), np.can_cast(), and np.promote_types(). Reading about this in the docs might also help you understand the issue a bit better.
I'm no expert on numpy, however, I suspect that since float64 is the smallest data type that can fit both the domain of int64 and uint64 that the subtraction converts both operands into a float64 so that the operation always succeeds.
For example, in a with int8 and uint8: +128 - (256) cannot fit in a int8 since -128 is not valid in int8, as it can only fit back to -127. Similarly, we can't use a uint8 since we obviously need the sign in this case. Hence, we settle on a float/double as it can fit both directions fine.

Numpy matrix exponentiation gives negative value

I wanted to use NumPy in a Fibonacci question because of its efficiency in matrix multiplication. You know that there is a method for finding Fibonacci numbers with the matrix [[1, 1], [1, 0]].
I wrote some very simple code but after increasing n, the matrix is starting to give negative numbers.
import numpy
def fib(n):
return (numpy.matrix("1 1; 1 0")**n).item(1)
print fib(90)
# Gives -1581614984
What could be the reason for this?
Note: linalg.matrix_power also gives negative values.
Note2: I tried numbers from 0 to 100. It starts to give negative values after 47. Is it a large integer issue because NumPy is coded in C ? If so, how could I solve this ?
Edit: Using regular python list matrix with linalg.matrix_power also gave negative results. Also let me add that not all results are negative after 47, it occurs randomly.
Edit2: I tried using the method #AlbertoGarcia-Raboso suggested. It resolved the negative number problem, however another issues occured. It gives the answer as -5.168070885485832e+19 where I need -51680708854858323072L. So I tried using int(), it converted it to L, but now it seems the answer is incorrect because of a loss in precision.
The reason you see negative values appearing is because NumPy has defaulted to using the np.int32 dtype for your matrix.
The maximum positive integer this dtype can represent is 231-1 which is 2147483647. Unfortunately, this is less the 47th Fibonacci number, 2971215073. The resulting overflow is causing the negative number to appear:
>>> np.int32(2971215073)
-1323752223
Using a bigger integer type (like np.int64) would fix this, but only temporarily: you'd still run into problems if you kept on asking for larger and larger Fibonacci numbers.
The only sure fix is to use an unlimited-size integer type, such as Python's int type. To do this, modify your matrix to be of np.object type:
def fib_2(n):
return (np.matrix("1 1; 1 0", dtype=np.object)**n).item(1)
The np.object type allows a matrix or array to hold any mix of native Python types. Essentially, instead of holding machine types, the matrix is now behaving like a Python list and simply consists of pointers to integer objects in memory. Python integers will be used in the calculation of the Fibonacci numbers now and overflow is not an issue.
>>> fib_2(300)
222232244629420445529739893461909967206666939096499764990979600
This flexibility comes at the cost of decreased performance: NumPy's speed originates from direct storage of integer/float types which can be manipulated by your hardware.

Stocking large numbers into numpy array

I have a dataset on which I'm trying to apply some arithmetical method.
The thing is it gives me relatively large numbers, and when I do it with numpy, they're stocked as 0.
The weird thing is, when I compute the numbers appart, they have an int value, they only become zeros when I compute them using numpy.
x = np.array([18,30,31,31,15])
10*150**x[0]/x[0]
Out[1]:36298069767006890
vector = 10*150**x/x
vector
Out[2]: array([0, 0, 0, 0, 0])
I have off course checked their types:
type(10*150**x[0]/x[0]) == type(vector[0])
Out[3]:True
How can I compute this large numbers using numpy without seeing them turned into zeros?
Note that if we remove the factor 10 at the beggining the problem slitghly changes (but I think it might be a similar reason):
x = np.array([18,30,31,31,15])
150**x[0]/x[0]
Out[4]:311075541538526549
vector = 150**x/x
vector
Out[5]: array([-329406144173384851, -230584300921369396, 224960293581823801,
-224960293581823801, -368934881474191033])
The negative numbers indicate the largest numbers of the int64 type in python as been crossed don't they?
As Nils Werner already mentioned, numpy's native ctypes cannot save numbers that large, but python itself can since the int objects use an arbitrary length implementation.
So what you can do is tell numpy not to convert the numbers to ctypes but use the python objects instead. This will be slower, but it will work.
In [14]: x = np.array([18,30,31,31,15], dtype=object)
In [15]: 150**x
Out[15]:
array([1477891880035400390625000000000000000000L,
191751059232884086668491363525390625000000000000000000000000000000L,
28762658884932613000273704528808593750000000000000000000000000000000L,
28762658884932613000273704528808593750000000000000000000000000000000L,
437893890380859375000000000000000L], dtype=object)
In this case the numpy array will not store the numbers themselves but references to the corresponding int objects. When you perform arithmetic operations they won't be performed on the numpy array but on the objects behind the references.
I think you're still able to use most of the numpy functions with this workaround but they will definitely be a lot slower than usual.
But that's what you get when you're dealing with numbers that large :D
Maybe somewhere out there is a library that can deal with this issue a little better.
Just for completeness, if precision is not an issue, you can also use floats:
In [19]: x = np.array([18,30,31,31,15], dtype=np.float64)
In [20]: 150**x
Out[20]:
array([ 1.47789188e+39, 1.91751059e+65, 2.87626589e+67,
2.87626589e+67, 4.37893890e+32])
150 ** 28 is way beyond what an int64 variable can represent (it's in the ballpark of 8e60 while the maximum possible value of an unsigned int64 is roughly 18e18).
Python may be using an arbitrary length integer implementation, but NumPy doesn't.
As you deduced correctly, negative numbers are a symptom of an int overflow.

summing over a list of int overflow(?) python

Let's consider a list of large integers, for example one given by:
def primesfrom2to(n):
# http://stackoverflow.com/questions/2068372/fastest-way-to-list-all-primes-below-n-in-python/3035188#3035188
""" Input n>=6, Returns a array of primes, 2 <= p < n """
sieve = np.ones(n/3 + (n%6==2), dtype=np.bool)
sieve[0] = False
for i in xrange(int(n**0.5)/3+1):
if sieve[i]:
k=3*i+1|1
sieve[ ((k*k)/3) ::2*k] = False
sieve[(k*k+4*k-2*k*(i&1))/3::2*k] = False
return np.r_[2,3,((3*np.nonzero(sieve)[0]+1)|1)]
primesfrom2to(2000000)
I want to calculate the sum of that, and the expected result is 142913828922.
But if I do:
sum(primesfrom2to(2000000))
I get 1179908154, which is clearly wrong. The problem is that I have an int overflow, but I don't understand why. Let's me explain.Consider this testing code:
a=primesfrom2to(2000000)
b=[float(i) for i in a]
c=[long(i) for i in a]
sumI=0
sumF=0
sumL=0
m=0
for i,j,k in zip(a,b,c):
m=m+1
sumI=sumI+i
sumF=sumF+j
sumL=sumL+k
print sumI,sumF,sumL
if sumI<0:
print i,m
break
I found out that the first integer overflow is happening at a[i=20444]=225289
If I do:
>>> sum(a[:20043])+225289
-2147310677
But if I do:
>>> sum(a[:20043])
2147431330
>>> 2147431330+225289
2147656619L
What's happening? Why such a different behaviour? Why can't sum switch automatically to long type and give the correct result?
Look at the types of your results. You are summing a numpy array, which is using numpy datatypes, which can overflow. When you do sum(a[:20043]), you get a numpy object back (some sort of int32 or the like), which overflows when added to another number. When you manually type in the same number, you're creating a Python builtin int, which can auto-promote to long. Numpy arrays cannot autopromote like Python builtin types, because the array type (and its memory layout) have to be fixed when the array is created. This makes operations much faster at the expense of type flexibility.
You may be able to get around the problem by using a different datatype (like np.int64) instead of np.bool. However, it depends how big your numbers are. A simple example:
# Python types ok
>>> 2**62
4611686018427387904L
>>> 2**63
9223372036854775808L
# numpy types overflow
>>> np.int64(2)**62
4611686018427387904
>>> np.int64(2)**63
-9223372036854775808
Your example works correctly for me on 64-bit Python, so I guess you're using 32-bit Python. If you can use 64-bit types you will be able to get past the limit you found, but as my example shows you will eventually overflow 64-bit ints too if your numbers get super huge.

How to convert a generic float value into a corresponding integer?

I need to use a module that does some math on integers, however my input is in floats.
What I want to achieve is to convert a generic float value into a corresponding integer value and loose as little data as possible.
For example:
val : 1.28827339907e-08
result : 128827339906934
Which is achieved after multiplying by 1e22.
Unfortunately the range of values can change, so I cannot always multiply them by the same constant. Any ideas?
ADDED
To put it in other words, I have a matrix of values < 1, let's say from 1.323224e-8 to 3.457782e-6.
I want to convert them all into integers and loose as little data as possible.
The answers that suggest multiplying by a power of ten cause unnecessary rounding.
Multiplication by a power of the base used in the floating-point representation has no error in IEEE 754 arithmetic (the most common floating-point implementation) as long as there is no overflow or underflow.
Thus, for binary floating-point, you may be able to achieve your goal by multiplying the floating-point number by a power of two and rounding the result to the nearest integer. The multiplication will have no error. The rounding to integer may have an error up to .5, obviously.
You might select a power of two that is as large as possible without causing any of your numbers to exceed the bounds of the integer type you are using.
The most common conversion of floating-point to integer truncates, so that 3.75 becomes 3. I am not sure about Python semantics. To round instead of truncating, you might use a function such as round before converting to integer.
If you want to preserve the values for operations on matrices I would choose some value to multiply them all by.
For Example:
1.23423
2.32423
4.2324534
Multiply them all by 10000000 and you get
12342300
23242300
42324534
You can perform you multiplications, additions etc with your matrices. Once you have performed all your calculations you can convert them back to floats by dividing them all by the appropriate value depending on the operation you performed.
Mathematically it makes sense because
(Scalar multiplication)
M1` = M1 * 10000000
M2` = M2 * 10000000
Result = M1`.M2`
Result = (M1 x 10000000).(M2 x 10000000)
Result = (10000000 x 10000000) x (M1.M2)
So in the case of multiplication you would divide your result by 10000000 x 10000000.
If its addition / subtraction then you simply divide by 10000000.
You can either choose the value to multiply by through your knowledge of what decimals you expect to find or by scanning the floats and generating the value yourself at runtime.
Hope that helps.
EDIT: If you are worried about going over the maximum capacity of integers - then you would be happy to know that python automatically (and silently) converts integers to longs when it notices overflow is going to occur. You can see for yourself in a python console:
>>> i = 3423
>>> type(i)
<type 'int'>
>>> i *= 100000
>>> type(i)
<type 'int'>
>>> i *= 100000
>>> type(i)
<type 'long'>
If you are still worried about overflow, you can always choose a lower constant with a compromise for slightly less accuracy (since you will be losing some digits towards then end of the decimal point).
Also, the method proposed by Eric Postpischil seems to make sense - but I have not tried it out myself. I gave you a solution from a more mathematical perspective which also seems to be more "pythonic"
Perhaps consider counting the number of places after the decimal for each value to determine the value (x) of your exponent (1ex). Roughly something like what's addressed here. Cheers!
Here's one solution:
def to_int(val):
return int(repr(val).replace('.', '').split('e')[0])
Usage:
>>> to_int(1.28827339907e-08)
128827339907

Categories