Left shifting 1 by large numbers using numpy array - python

Python can indeed left shift a bit by large integers
1L << 100
# 1267650600228229401496703205376L
But NumPy seemingly has a problem:
a = np.array([1,2,100])
output = np.left_shift(1L,a)
print output
# array([ 2, 4, 68719476736])
Is there a way to overcome this using NumPy's left_shift operation? Individually accessing the array elements gives the same incorrect result:
1L << a[2]
# 68719476736

Python long values aren't the same type as the integers held in a. Specifically, Python long values are not limited to 32 bits or 64 bits but instead can take up an arbitrary amount of memory.
On the other hand, NumPy will create a as an array of int32 or int64 integer values. When you left-shift this array, you get back an array of the same datatype. This isn't enough memory to hold 1 << 100 and the result of the left-shift overflows the memory it's been allocated in the array, producing the incorrect result.
To hold integers that large, you'll have to specify a to have the object datatype. For example:
>>> np.left_shift(1, a.astype(object))
array([2, 4, 1267650600228229401496703205376L], dtype=object)
object arrays can hold a mix of different types, including the unlimited-size Python long/integer values. However, a lot of the performance benefits of homogeneous datatype NumPy arrays will be lost when using object arrays.

Given the array a you create, the elements of output are going to be integers. Unlike Python itself, integers in a numpy darray are of fixed size with a defined maximum and minimum value.
numpy is arriving at the value given (2 ** 36) by reducing the shift modulo the length of the integer, 64 bits (100 mod 64 == 36).

This is an issue with 64-bit Python too, except that the integer at which this becomes a problem is larger. if you run a[2]=1L << a[2] you will get this Traceback :
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OverflowError: Python int too large to convert to C long
So as the traceback says Python int too large to convert to C long , so you need to change the array type (that is actually C structure ) to an python object :
>>> a = np.array([1,2,100],dtype='O')
>>> a[2]=1L << a[2]
>>> a
array([1, 2, 1267650600228229401496703205376L], dtype=object)

Related

OverflowError: Python int too large to convert to C long when feed data into numpy array

I am trying to feed large number after encryption into a numpy array, but it says the number is too long and it gets overflow. I checked the code, everything is correct before I feed the numbers into the numpy array, but it got an error at the step of feeding in the data, which is en1[i,j] = pk.raw_encrypt(int(test1[i,j])).
The encrypted number I have got here is 3721469428823308171852407981126958588051758293498563443424608937516905060542577505841168884360804470051297912859925781484960893520445514263696476240974988078627213135445788309778740044751099235295077596597798031854813054409733391824335666742083102231195956761512905043582400348924162387787806382637700241133312260811836700206345239790866810211695141313302624830782897304864254886141901824509845380817669866861095878436032979919703752065248359420455460486031882792946889235009894799954640035281227429200579186478109721444874188901886905515155160376705016979283166216642522595345955323818983998023048631350302980936674. Python3 still claims it to be a int type. The number itself did not get overflow, but the numpy array does not allow it to be filled in.
What property of the numpy caused this, and is there any solution to this problem? I have considered using list to substitute numpy array but it will be rather hard to implement when it is not a 1-D array. I have attached the full test code below.
test1 = np.array([[1,2,3],[1,2,4]])
test2 = np.array([[4,1,3],[6,1,5]])
en1 = np.copy(test1)
en2 = np.copy(test2)
pk, sk = paillier.generate_paillier_keypair()
en_sum = np.copy(en1)
pl_sum = np.copy(en1)
for i in range(test1.shape[0]):
for j in range(test2.shape[1]):
en1[i,j] = pk.raw_encrypt(int(test1[i,j]))
en2[i,j] = pk.raw_encrypt(int(test2[i,j]))
en_sum[i,j] = en1[i,j]*en2[i,j]
pl_sum[i,j] = sk.raw_decrypt(en_sum[i,j])
sum = sk.raw_decrypt(en_sum)
Python integers are stored with arbitrary precision, while numpy integers are stored in standard 32-bit or 64-bit representations depending on your platform.
What this means is that while the maximum representable Python integer is bounded only by your system memory, the maximum representable Numpy integer is bounded by what is representable in 64-bits.
You can see the maximum representable unsigned integer value here:
>>> import numpy as np
>>> np.iinfo(np.uint64).max
18446744073709551615
>>> 2 ** 64 - 1
18446744073709551615
The best approach for your application depends on what you want to do with these extremely large integers, but I'd lean toward avoiding Numpy arrays for integers of this size.

np.int64 is a smaller container than np.int....?

I'm getting surprising behavior trying to convert a microsecond string date to an integer:
n = 20181231235959383171
int_ = np.int(n) # Works
int64_ = np.int64(n) # "OverflowError: int too big to convert"
Any idea why?
Edit - Thank you all, this is informative, however please see my actual problem:
Dataframe column won't convert from integer string to an actual integer
An np.int can be arbitrarily large, like a python integer.
An np.int64 can only range from -263 to 263 - 1. Your number happens to fall outside this range.
When used as dtype, np.int is equivalent to np.int_ (architecture-dependent size), which is probably np.int64. So np.array([n], dtype=np.int) will fail. Outside dtype, np.int behaves as Python int. Numpy is basically helping you calculate as much stuff in C-land as possible in order to speed up the calculations and conserve memory; but (AFAIK) integers larger than 64 bits do not exist in standard C (though the new GCC does support them on some architectures). So you are stuck using either Python integers, slow but of unlimited size, or C integers, fast but not big enough for this.
There are two obvious ways to stuff a large integer into a numpy array:
You can use the Python type, signified by dtype=object: np.array([n], dtype=object) will work, but you are getting no speedup or memory benefits from numpy.
You can split the microsecond time into second time (n // 1000000) and second fractions (n % 1000000), as two separate columns.

Python numpy.divmod and integer representation

I was trying to use numpy.divmod with very large integers and I noticed a strange behaviour. At around 2**63 ~ 1e19 (which should be the limit for the usual memory representation of int in python 3.5+), this happens:
from numpy import divmod
test = 10**6
for i in range(15,25):
x = 10**i
print(i, divmod(x, test))
15 (1000000000, 0)
16 (10000000000, 0)
17 (100000000000, 0)
18 (1000000000000, 0)
19 (10000000000000.0, 0.0)
20 ((100000000000000, 0), None)
21 ((1000000000000000, 0), None)
22 ((10000000000000000, 0), None)
23 ((100000000000000000, 0), None)
24 ((1000000000000000000, 0), None)
Somehow, the quotient and remainder works fine till 2**63, then there's something different.
My guess is that the int representation is "vectorized" (i.e. as BigInt in Scala, as a little endian Seq of Long). But then, I'd expect, as a result of divmod(array, test), a pair of arrays: the array of quotients and the array of remainders.
I have no clue about this feature. It does not happen with the built-in divmod (everything works as expected)
Why does this happen? Does it have something to do with int internal representation?
Details: numpy version 1.13.1, python 3.6
The problem is that np.divmod will convert the arguments to arrays and what happens is really easy:
>>> np.array(10**19)
array(10000000000000000000, dtype=uint64)
>>> np.array(10**20)
array(100000000000000000000, dtype=object)
You will get an object array for 10**i with i > 19, in the other cases it will be a "real NumPy array".
And, indeed, it seems like object arrays behave strangely with np.divmod:
>>> np.divmod(np.array(10**5, dtype=object), 10) # smaller value but object array
((10000, 0), None)
I guess in this case the normal Python built-in divmod calculates the first returned element and all remaining items are filled with None because it delegated to Pythons function.
Note that object arrays often behave differently than native dtype arrays. They are a lot slower and often delegate to Python functions (which is often the reason for different results).

Python- Size of int, float etc from sys.getsizeof()

I am trying to compare sizes of data types in Python with sys.getsizeof(). However, for integers and floats, it returns same - 24 (not customary 4 or 8 bytes). Also, size of an array declared with array.array() with 4 integer elements is returned 72 (not 96). and with 4 float elements- 88 (not 96). What is going on?
import array, sys
arr1 = array.array('d', [1,2,3,4])
arr2 = array.array('i', [1,2,3,4])
print sys.getsizeof(arr1[1]), sys.getsizeof(arr2[1]) # 24, 24
print sys.getsizeof(arr1), sys.getsizeof(arr2) # 88, 72
The function sys.getsizeof() returns the amount of space the Python object takes. Not the amount of space you would need to represent the data in that object in the memory of the underlying system.
Python objects have overhead to cover reference counting (for garbage collection) and other implementation-related stuff. In addition, an array is not a naive sequence of floats or ints; the data structure has a fair amount of stuff under the hood that keeps track of datatype, number of elements and so on. That's where the 'd' or 'i' lives, for example.
To get the answers I think you are expecting, try
print (arr1.itemsize * len(arr1))
print (arr2.itemsize * len(arr2))

Why is numpy array's .tolist() creating long doubles?

I have some math operations that produce a numpy array of results with about 8 significant figures. When I use tolist() on my array y_axis, it creates what I assume are 32-bit numbers.
However, I wonder if this is just garbage. I assume it is garbage, but it seems intelligent enough to change the last number so that rounding makes sense.
print "y_axis:",y_axis
y_axis = y_axis.tolist()
print "y_axis:",y_axis
y_axis: [-0.99636686 0.08357361 -0.01638707]
y_axis: [-0.9963668578012771, 0.08357361233570479, -0.01638706796138937]
So my question is: if this is not garbage, does using tolist actually help in accuracy for my calculations, or is Python always using the entire number, but just not displaying it?
When you call print y_axis on a numpy array, you are getting a truncated version of the numbers that numpy is actually storing internally. The way in which it is truncated depends on how numpy's printing options are set.
>>> arr = np.array([22/7, 1/13]) # init array
>>> arr # np.array default printing
array([ 3.14285714, 0.07692308])
>>> arr[0] # int default printing
3.1428571428571428
>>> np.set_printoptions(precision=24) # increase np.array print "precision"
>>> arr # np.array high-"precision" print
array([ 3.142857142857142793701541, 0.076923076923076927347012])
>>> float.hex(arr[0]) # actual underlying representation
'0x1.9249249249249p+1'
The reason it looks like you're "gaining accuracy" when you print out the .tolist()ed form of y_axis is that by default, more digits are printed when you call print on a list than when you call print on a numpy array.
In actuality, the numbers stored internally by either a list or a numpy array should be identical (and should correspond to the last line above, generated with float.hex(arr[0])), since numpy uses numpy.float64 by default, and python float objects are also 64 bits by default.
My understanding is that numpy is not showing you the full precision to make the matrices lay out consistently. The list shouldn't have any more precision than its numpy.array counterpart:
>>> v = -0.9963668578012771
>>> a = numpy.array([v])
>>> a
array([-0.99636686])
>>> a.tolist()
[-0.9963668578012771]
>>> a[0] == v
True
>>> a.tolist()[0] == v
True

Categories