I was trying to use numpy.divmod with very large integers and I noticed a strange behaviour. At around 2**63 ~ 1e19 (which should be the limit for the usual memory representation of int in python 3.5+), this happens:
from numpy import divmod
test = 10**6
for i in range(15,25):
x = 10**i
print(i, divmod(x, test))
15 (1000000000, 0)
16 (10000000000, 0)
17 (100000000000, 0)
18 (1000000000000, 0)
19 (10000000000000.0, 0.0)
20 ((100000000000000, 0), None)
21 ((1000000000000000, 0), None)
22 ((10000000000000000, 0), None)
23 ((100000000000000000, 0), None)
24 ((1000000000000000000, 0), None)
Somehow, the quotient and remainder works fine till 2**63, then there's something different.
My guess is that the int representation is "vectorized" (i.e. as BigInt in Scala, as a little endian Seq of Long). But then, I'd expect, as a result of divmod(array, test), a pair of arrays: the array of quotients and the array of remainders.
I have no clue about this feature. It does not happen with the built-in divmod (everything works as expected)
Why does this happen? Does it have something to do with int internal representation?
Details: numpy version 1.13.1, python 3.6
The problem is that np.divmod will convert the arguments to arrays and what happens is really easy:
>>> np.array(10**19)
array(10000000000000000000, dtype=uint64)
>>> np.array(10**20)
array(100000000000000000000, dtype=object)
You will get an object array for 10**i with i > 19, in the other cases it will be a "real NumPy array".
And, indeed, it seems like object arrays behave strangely with np.divmod:
>>> np.divmod(np.array(10**5, dtype=object), 10) # smaller value but object array
((10000, 0), None)
I guess in this case the normal Python built-in divmod calculates the first returned element and all remaining items are filled with None because it delegated to Pythons function.
Note that object arrays often behave differently than native dtype arrays. They are a lot slower and often delegate to Python functions (which is often the reason for different results).
Related
I'm getting surprising behavior trying to convert a microsecond string date to an integer:
n = 20181231235959383171
int_ = np.int(n) # Works
int64_ = np.int64(n) # "OverflowError: int too big to convert"
Any idea why?
Edit - Thank you all, this is informative, however please see my actual problem:
Dataframe column won't convert from integer string to an actual integer
An np.int can be arbitrarily large, like a python integer.
An np.int64 can only range from -263 to 263 - 1. Your number happens to fall outside this range.
When used as dtype, np.int is equivalent to np.int_ (architecture-dependent size), which is probably np.int64. So np.array([n], dtype=np.int) will fail. Outside dtype, np.int behaves as Python int. Numpy is basically helping you calculate as much stuff in C-land as possible in order to speed up the calculations and conserve memory; but (AFAIK) integers larger than 64 bits do not exist in standard C (though the new GCC does support them on some architectures). So you are stuck using either Python integers, slow but of unlimited size, or C integers, fast but not big enough for this.
There are two obvious ways to stuff a large integer into a numpy array:
You can use the Python type, signified by dtype=object: np.array([n], dtype=object) will work, but you are getting no speedup or memory benefits from numpy.
You can split the microsecond time into second time (n // 1000000) and second fractions (n % 1000000), as two separate columns.
I have a dataset on which I'm trying to apply some arithmetical method.
The thing is it gives me relatively large numbers, and when I do it with numpy, they're stocked as 0.
The weird thing is, when I compute the numbers appart, they have an int value, they only become zeros when I compute them using numpy.
x = np.array([18,30,31,31,15])
10*150**x[0]/x[0]
Out[1]:36298069767006890
vector = 10*150**x/x
vector
Out[2]: array([0, 0, 0, 0, 0])
I have off course checked their types:
type(10*150**x[0]/x[0]) == type(vector[0])
Out[3]:True
How can I compute this large numbers using numpy without seeing them turned into zeros?
Note that if we remove the factor 10 at the beggining the problem slitghly changes (but I think it might be a similar reason):
x = np.array([18,30,31,31,15])
150**x[0]/x[0]
Out[4]:311075541538526549
vector = 150**x/x
vector
Out[5]: array([-329406144173384851, -230584300921369396, 224960293581823801,
-224960293581823801, -368934881474191033])
The negative numbers indicate the largest numbers of the int64 type in python as been crossed don't they?
As Nils Werner already mentioned, numpy's native ctypes cannot save numbers that large, but python itself can since the int objects use an arbitrary length implementation.
So what you can do is tell numpy not to convert the numbers to ctypes but use the python objects instead. This will be slower, but it will work.
In [14]: x = np.array([18,30,31,31,15], dtype=object)
In [15]: 150**x
Out[15]:
array([1477891880035400390625000000000000000000L,
191751059232884086668491363525390625000000000000000000000000000000L,
28762658884932613000273704528808593750000000000000000000000000000000L,
28762658884932613000273704528808593750000000000000000000000000000000L,
437893890380859375000000000000000L], dtype=object)
In this case the numpy array will not store the numbers themselves but references to the corresponding int objects. When you perform arithmetic operations they won't be performed on the numpy array but on the objects behind the references.
I think you're still able to use most of the numpy functions with this workaround but they will definitely be a lot slower than usual.
But that's what you get when you're dealing with numbers that large :D
Maybe somewhere out there is a library that can deal with this issue a little better.
Just for completeness, if precision is not an issue, you can also use floats:
In [19]: x = np.array([18,30,31,31,15], dtype=np.float64)
In [20]: 150**x
Out[20]:
array([ 1.47789188e+39, 1.91751059e+65, 2.87626589e+67,
2.87626589e+67, 4.37893890e+32])
150 ** 28 is way beyond what an int64 variable can represent (it's in the ballpark of 8e60 while the maximum possible value of an unsigned int64 is roughly 18e18).
Python may be using an arbitrary length integer implementation, but NumPy doesn't.
As you deduced correctly, negative numbers are a symptom of an int overflow.
Python can indeed left shift a bit by large integers
1L << 100
# 1267650600228229401496703205376L
But NumPy seemingly has a problem:
a = np.array([1,2,100])
output = np.left_shift(1L,a)
print output
# array([ 2, 4, 68719476736])
Is there a way to overcome this using NumPy's left_shift operation? Individually accessing the array elements gives the same incorrect result:
1L << a[2]
# 68719476736
Python long values aren't the same type as the integers held in a. Specifically, Python long values are not limited to 32 bits or 64 bits but instead can take up an arbitrary amount of memory.
On the other hand, NumPy will create a as an array of int32 or int64 integer values. When you left-shift this array, you get back an array of the same datatype. This isn't enough memory to hold 1 << 100 and the result of the left-shift overflows the memory it's been allocated in the array, producing the incorrect result.
To hold integers that large, you'll have to specify a to have the object datatype. For example:
>>> np.left_shift(1, a.astype(object))
array([2, 4, 1267650600228229401496703205376L], dtype=object)
object arrays can hold a mix of different types, including the unlimited-size Python long/integer values. However, a lot of the performance benefits of homogeneous datatype NumPy arrays will be lost when using object arrays.
Given the array a you create, the elements of output are going to be integers. Unlike Python itself, integers in a numpy darray are of fixed size with a defined maximum and minimum value.
numpy is arriving at the value given (2 ** 36) by reducing the shift modulo the length of the integer, 64 bits (100 mod 64 == 36).
This is an issue with 64-bit Python too, except that the integer at which this becomes a problem is larger. if you run a[2]=1L << a[2] you will get this Traceback :
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OverflowError: Python int too large to convert to C long
So as the traceback says Python int too large to convert to C long , so you need to change the array type (that is actually C structure ) to an python object :
>>> a = np.array([1,2,100],dtype='O')
>>> a[2]=1L << a[2]
>>> a
array([1, 2, 1267650600228229401496703205376L], dtype=object)
Let's consider a list of large integers, for example one given by:
def primesfrom2to(n):
# http://stackoverflow.com/questions/2068372/fastest-way-to-list-all-primes-below-n-in-python/3035188#3035188
""" Input n>=6, Returns a array of primes, 2 <= p < n """
sieve = np.ones(n/3 + (n%6==2), dtype=np.bool)
sieve[0] = False
for i in xrange(int(n**0.5)/3+1):
if sieve[i]:
k=3*i+1|1
sieve[ ((k*k)/3) ::2*k] = False
sieve[(k*k+4*k-2*k*(i&1))/3::2*k] = False
return np.r_[2,3,((3*np.nonzero(sieve)[0]+1)|1)]
primesfrom2to(2000000)
I want to calculate the sum of that, and the expected result is 142913828922.
But if I do:
sum(primesfrom2to(2000000))
I get 1179908154, which is clearly wrong. The problem is that I have an int overflow, but I don't understand why. Let's me explain.Consider this testing code:
a=primesfrom2to(2000000)
b=[float(i) for i in a]
c=[long(i) for i in a]
sumI=0
sumF=0
sumL=0
m=0
for i,j,k in zip(a,b,c):
m=m+1
sumI=sumI+i
sumF=sumF+j
sumL=sumL+k
print sumI,sumF,sumL
if sumI<0:
print i,m
break
I found out that the first integer overflow is happening at a[i=20444]=225289
If I do:
>>> sum(a[:20043])+225289
-2147310677
But if I do:
>>> sum(a[:20043])
2147431330
>>> 2147431330+225289
2147656619L
What's happening? Why such a different behaviour? Why can't sum switch automatically to long type and give the correct result?
Look at the types of your results. You are summing a numpy array, which is using numpy datatypes, which can overflow. When you do sum(a[:20043]), you get a numpy object back (some sort of int32 or the like), which overflows when added to another number. When you manually type in the same number, you're creating a Python builtin int, which can auto-promote to long. Numpy arrays cannot autopromote like Python builtin types, because the array type (and its memory layout) have to be fixed when the array is created. This makes operations much faster at the expense of type flexibility.
You may be able to get around the problem by using a different datatype (like np.int64) instead of np.bool. However, it depends how big your numbers are. A simple example:
# Python types ok
>>> 2**62
4611686018427387904L
>>> 2**63
9223372036854775808L
# numpy types overflow
>>> np.int64(2)**62
4611686018427387904
>>> np.int64(2)**63
-9223372036854775808
Your example works correctly for me on 64-bit Python, so I guess you're using 32-bit Python. If you can use 64-bit types you will be able to get past the limit you found, but as my example shows you will eventually overflow 64-bit ints too if your numbers get super huge.
Can anyone explain the following? I'm using Python 2.5
Consider 1*3*5*7*9*11 ... *49. If you type all that from within IPython(x,y) interactive console, you'll get 58435841445947272053455474390625L, which is correct. (why odd numbers: just the way I did it originally)
Python multiply.reduce() or prod() should yield the same result for the equivalent range. And it does, up to a certain point. Here, it is already wrong:
: k = range(1, 50, 2)
: multiply.reduce(k)
: -108792223
Using prod(k) will also generate -108792223 as the result. Other incorrect results start to appear for equivalent ranges of length 12 (that is, k = range(1,24,2)).
I'm not sure why. Can anyone help?
This is because numpy.multiply.reduce() converts the range list to an array of type numpy.int32, and the reduce operation overflows what can be stored in 32 bits at some point:
>>> type(numpy.multiply.reduce(range(1, 50, 2)))
<type 'numpy.int32'>
As Mike Graham says, you can use the dtype parameter to use Python integers instead of the default:
>>> res = numpy.multiply.reduce(range(1, 50, 2), dtype=object)
>>> res
58435841445947272053455474390625L
>>> type(res)
<type 'long'>
But using numpy to work with python objects is pointless in this case, the best solution is KennyTM's:
>>> import functools, operator
>>> functools.reduce(operator.mul, range(1, 50, 2))
58435841445947272053455474390625L
The CPU doesn't multiply arbitrarily large numbers, it only performs specific operations defined on particular ranges of numbers represented in base 2, 0-1 bits.
Python '*' handles large integers perfectly through a proper representation and special code beyond the CPU or FPU instructions for multiply.
This is actually unusual as languages go.
In most other languages, usually a number is represented as a fixed array of bits. For example in C or SQL you could choose to have an 8 bit integer that can represent 0 to 255, or -128 to +127 or you could choose to have a 16 bit integer that can represent up to 2^16-1 which is 65535. When there is only a range of numbers that can be represented, going past the limit with some operation like * or + can have an undesirable effect, like getting a negative number. You may have encountered such a problem when using the external library which is probably natively C and not python.