Why is numpy array's .tolist() creating long doubles? - python

I have some math operations that produce a numpy array of results with about 8 significant figures. When I use tolist() on my array y_axis, it creates what I assume are 32-bit numbers.
However, I wonder if this is just garbage. I assume it is garbage, but it seems intelligent enough to change the last number so that rounding makes sense.
print "y_axis:",y_axis
y_axis = y_axis.tolist()
print "y_axis:",y_axis
y_axis: [-0.99636686 0.08357361 -0.01638707]
y_axis: [-0.9963668578012771, 0.08357361233570479, -0.01638706796138937]
So my question is: if this is not garbage, does using tolist actually help in accuracy for my calculations, or is Python always using the entire number, but just not displaying it?

When you call print y_axis on a numpy array, you are getting a truncated version of the numbers that numpy is actually storing internally. The way in which it is truncated depends on how numpy's printing options are set.
>>> arr = np.array([22/7, 1/13]) # init array
>>> arr # np.array default printing
array([ 3.14285714, 0.07692308])
>>> arr[0] # int default printing
3.1428571428571428
>>> np.set_printoptions(precision=24) # increase np.array print "precision"
>>> arr # np.array high-"precision" print
array([ 3.142857142857142793701541, 0.076923076923076927347012])
>>> float.hex(arr[0]) # actual underlying representation
'0x1.9249249249249p+1'
The reason it looks like you're "gaining accuracy" when you print out the .tolist()ed form of y_axis is that by default, more digits are printed when you call print on a list than when you call print on a numpy array.
In actuality, the numbers stored internally by either a list or a numpy array should be identical (and should correspond to the last line above, generated with float.hex(arr[0])), since numpy uses numpy.float64 by default, and python float objects are also 64 bits by default.

My understanding is that numpy is not showing you the full precision to make the matrices lay out consistently. The list shouldn't have any more precision than its numpy.array counterpart:
>>> v = -0.9963668578012771
>>> a = numpy.array([v])
>>> a
array([-0.99636686])
>>> a.tolist()
[-0.9963668578012771]
>>> a[0] == v
True
>>> a.tolist()[0] == v
True

Related

how do I check decimal.is_nan() for all values in array?

Suppose I have my array like this:
from decimal import Decimal
array = [Decimal(np.nan), Decimal(np.nan), Decimal(0.231411)]
I know that if the types are float, I can check if all the values are nan or not
, as:
np.isnan(array).all()
Is there a way for type Decimal?
The solution would be better without iteration.
You could use NumPy's vectorize to avoid iteration.
In [40]: from decimal import Decimal
In [41]: import numpy as np
In [42]: nums = [Decimal(np.nan), Decimal(np.nan), Decimal(0.231411)]
In [43]: nums
Out[43]:
[Decimal('NaN'),
Decimal('NaN'),
Decimal('0.2314110000000000055830895462349872104823589324951171875')]
In [44]: np.all(np.vectorize(lambda x: x.is_nan())(np.asarray(nums)))
Out[44]: False
In [45]: np.all(np.vectorize(lambda x: x.is_nan())(np.asarray(nums[:-1])))
Out[45]: True
In the snippet above nums is a list of instances of class Decimal. Notice that you need to convert that list into a NumPy array.
From my comment above, I realise it’s an iteration. The reason is that np.isnan does not support Decimal as an input type; therefore, I don’t believe this can be done via broadcasting, without converting the datatype - which means a potential precision loss, which is a reason to use a Decimal type.
Additionally, as commented by #juanpa.arrivillaga, as the Decimal objects are in a list, iteration is the only way. Numpy is not necessary in this operation.
One method is:
all([i.is_nan() for i in array])

Python nan with complex arrays

Inserting a nan in Python into a complex numpy array gives some (for me) unexpected behavior:
a = np.array([5+6*1j])
print a
array([5.+6.j])
a[0] = np.nan
print a
array([nan+0.j])
I expected Python to write nan+nanj. For analyses it often might not matter, since np.isnan of any complex with either real and/or imaginary parts is True. However, I did not know the behavior and when plotting the real and imaginary parts of my array it gave me the impression I had info on the imaginary (however there is none). A workaround is to write a[0] = np.nan + np.nan*1j. Can somebody explain the reason for this behavior to me?
The issue here is that when you create an array with complex values:
a = np.array([5+6*1j])
You've created an array of dtype complex:
a.dtype
# dtype('complex128')
So by adding a value which only contains real part, it will be converted to a complex value, and you will thus be inserting a number with a complex component equal to 0j, so:
np.complex(np.nan)
# (nan+0j)
Which explains the behaviour:
a[0] = np.array([np.nan])
print(a)
# [nan+0.j]
It probably hast to do with numpy representation of nan:
NumPy uses the IEEE Standard for Binary Floating-Point for Arithmetic
(IEEE 754). This means that Not a Number is not equivalent to
infinity.
Essentially np.nan is a float. By setting x[0] = np.nan you are setting its value to a "real" float (but not changing the dtype of the array, which remains complex), so the imaginary part remains untouched as 0j.
That also explains why you can change the imaginary part by doing np.nan * 0j

Stocking large numbers into numpy array

I have a dataset on which I'm trying to apply some arithmetical method.
The thing is it gives me relatively large numbers, and when I do it with numpy, they're stocked as 0.
The weird thing is, when I compute the numbers appart, they have an int value, they only become zeros when I compute them using numpy.
x = np.array([18,30,31,31,15])
10*150**x[0]/x[0]
Out[1]:36298069767006890
vector = 10*150**x/x
vector
Out[2]: array([0, 0, 0, 0, 0])
I have off course checked their types:
type(10*150**x[0]/x[0]) == type(vector[0])
Out[3]:True
How can I compute this large numbers using numpy without seeing them turned into zeros?
Note that if we remove the factor 10 at the beggining the problem slitghly changes (but I think it might be a similar reason):
x = np.array([18,30,31,31,15])
150**x[0]/x[0]
Out[4]:311075541538526549
vector = 150**x/x
vector
Out[5]: array([-329406144173384851, -230584300921369396, 224960293581823801,
-224960293581823801, -368934881474191033])
The negative numbers indicate the largest numbers of the int64 type in python as been crossed don't they?
As Nils Werner already mentioned, numpy's native ctypes cannot save numbers that large, but python itself can since the int objects use an arbitrary length implementation.
So what you can do is tell numpy not to convert the numbers to ctypes but use the python objects instead. This will be slower, but it will work.
In [14]: x = np.array([18,30,31,31,15], dtype=object)
In [15]: 150**x
Out[15]:
array([1477891880035400390625000000000000000000L,
191751059232884086668491363525390625000000000000000000000000000000L,
28762658884932613000273704528808593750000000000000000000000000000000L,
28762658884932613000273704528808593750000000000000000000000000000000L,
437893890380859375000000000000000L], dtype=object)
In this case the numpy array will not store the numbers themselves but references to the corresponding int objects. When you perform arithmetic operations they won't be performed on the numpy array but on the objects behind the references.
I think you're still able to use most of the numpy functions with this workaround but they will definitely be a lot slower than usual.
But that's what you get when you're dealing with numbers that large :D
Maybe somewhere out there is a library that can deal with this issue a little better.
Just for completeness, if precision is not an issue, you can also use floats:
In [19]: x = np.array([18,30,31,31,15], dtype=np.float64)
In [20]: 150**x
Out[20]:
array([ 1.47789188e+39, 1.91751059e+65, 2.87626589e+67,
2.87626589e+67, 4.37893890e+32])
150 ** 28 is way beyond what an int64 variable can represent (it's in the ballpark of 8e60 while the maximum possible value of an unsigned int64 is roughly 18e18).
Python may be using an arbitrary length integer implementation, but NumPy doesn't.
As you deduced correctly, negative numbers are a symptom of an int overflow.

summing over a list of int overflow(?) python

Let's consider a list of large integers, for example one given by:
def primesfrom2to(n):
# http://stackoverflow.com/questions/2068372/fastest-way-to-list-all-primes-below-n-in-python/3035188#3035188
""" Input n>=6, Returns a array of primes, 2 <= p < n """
sieve = np.ones(n/3 + (n%6==2), dtype=np.bool)
sieve[0] = False
for i in xrange(int(n**0.5)/3+1):
if sieve[i]:
k=3*i+1|1
sieve[ ((k*k)/3) ::2*k] = False
sieve[(k*k+4*k-2*k*(i&1))/3::2*k] = False
return np.r_[2,3,((3*np.nonzero(sieve)[0]+1)|1)]
primesfrom2to(2000000)
I want to calculate the sum of that, and the expected result is 142913828922.
But if I do:
sum(primesfrom2to(2000000))
I get 1179908154, which is clearly wrong. The problem is that I have an int overflow, but I don't understand why. Let's me explain.Consider this testing code:
a=primesfrom2to(2000000)
b=[float(i) for i in a]
c=[long(i) for i in a]
sumI=0
sumF=0
sumL=0
m=0
for i,j,k in zip(a,b,c):
m=m+1
sumI=sumI+i
sumF=sumF+j
sumL=sumL+k
print sumI,sumF,sumL
if sumI<0:
print i,m
break
I found out that the first integer overflow is happening at a[i=20444]=225289
If I do:
>>> sum(a[:20043])+225289
-2147310677
But if I do:
>>> sum(a[:20043])
2147431330
>>> 2147431330+225289
2147656619L
What's happening? Why such a different behaviour? Why can't sum switch automatically to long type and give the correct result?
Look at the types of your results. You are summing a numpy array, which is using numpy datatypes, which can overflow. When you do sum(a[:20043]), you get a numpy object back (some sort of int32 or the like), which overflows when added to another number. When you manually type in the same number, you're creating a Python builtin int, which can auto-promote to long. Numpy arrays cannot autopromote like Python builtin types, because the array type (and its memory layout) have to be fixed when the array is created. This makes operations much faster at the expense of type flexibility.
You may be able to get around the problem by using a different datatype (like np.int64) instead of np.bool. However, it depends how big your numbers are. A simple example:
# Python types ok
>>> 2**62
4611686018427387904L
>>> 2**63
9223372036854775808L
# numpy types overflow
>>> np.int64(2)**62
4611686018427387904
>>> np.int64(2)**63
-9223372036854775808
Your example works correctly for me on 64-bit Python, so I guess you're using 32-bit Python. If you can use 64-bit types you will be able to get past the limit you found, but as my example shows you will eventually overflow 64-bit ints too if your numbers get super huge.

are numpy array elements rounded automatically?

I have an numpy array of floats in Python.
When I print the array, the first value is:
[7.14519700e+04, ....
If, however, I print out just the first value on it's own, the print out reads:
71451.9699799
Obviously these numbers should be identical, so I just wondered, is the array just showing me a rounded version of the element? The second number here has 12 significant figures, and the first only has 9.
I guess I just wonder why these numbers are different?
It's just in the printing, not in the storage. The only confusion might occur because the first example uses numpy's print precision settings, the second example general python's print settings.
You can adjust the numpy precision and print by
numpy.set_printoptions(precision=20)
print myarray`
(adjust precision to your needs), or select the number of significant figures in standard python formatted print:
print ('%.20f' % myarray[0])
The internal representation of the number is always the same.
The types in a numpy array are well defined. You can get how they are stored by inspecting the numpy.dtype property of an array.
For example:
import numpy
a = numpy.zeros(10)
print a.dtype
will show float64, that is a 64-bit floating point number.
You can specify the type of the array explicitly using either the commonly accepted dtype argument, or the dtype type object (that is, the thing that makes the dtype).
a = numpy.zeros(10, dtype='complex32') # a 32-bit floating point
b = numpy.longdouble(a) # create a long-double array from a
Regarding the printing, this is just a formatting issue. You can twiddle how numpy prints an array using numpy.set_printoptions:
>>> a = numpy.random.randn(3) # for interest, randn annoyingly doesn't support the dtype arg
>>> print a
[ 0.12584756 0.73540009 -0.17108244 -0.96818512]
>>> numpy.set_printoptions(precision=3)
>>> print a
[ 0.126 0.735 -0.171 -0.968]

Categories