ndarray to structured_array and float to int - python

The problem I encounter is that, by using ndarray.view(np.dtype) to get a structured array from a classic ndarray seems to miscompute the float to int conversion.
Example talks better:
In [12]: B
Out[12]:
array([[ 1.00000000e+00, 1.00000000e+00, 0.00000000e+00,
0.00000000e+00, 4.43600000e+01, 0.00000000e+00],
[ 1.00000000e+00, 2.00000000e+00, 7.10000000e+00,
1.10000000e+00, 4.43600000e+01, 1.32110000e+02],
[ 1.00000000e+00, 3.00000000e+00, 9.70000000e+00,
2.10000000e+00, 4.43600000e+01, 2.04660000e+02],
...,
[ 1.28900000e+03, 1.28700000e+03, 0.00000000e+00,
9.99999000e+05, 4.75600000e+01, 3.55374000e+03],
[ 1.28900000e+03, 1.28800000e+03, 1.29000000e+01,
5.40000000e+00, 4.19200000e+01, 2.08400000e+02],
[ 1.28900000e+03, 1.28900000e+03, 0.00000000e+00,
0.00000000e+00, 4.19200000e+01, 0.00000000e+00]])
In [14]: B.view(A.dtype)
Out[14]:
array([(4607182418800017408, 4607182418800017408, 0.0, 0.0, 44.36, 0.0),
(4607182418800017408, 4611686018427387904, 7.1, 1.1, 44.36, 132.11),
(4607182418800017408, 4613937818241073152, 9.7, 2.1, 44.36, 204.66),
...,
(4653383897399164928, 4653375101306142720, 0.0, 999999.0, 47.56, 3553.74),
(4653383897399164928, 4653379499352653824, 12.9, 5.4, 41.92, 208.4),
(4653383897399164928, 4653383897399164928, 0.0, 0.0, 41.92, 0.0)],
dtype=[('i', '<i8'), ('j', '<i8'), ('tnvtc', '<f8'), ('tvtc', '<f8'), ('tf', '<f8'), ('tvps', '<f8')])
The 'i' and 'j' columns are true integers:
Here you have two further check I have done, the problem seems to come from the ndarray.view(np.int)
In [21]: B[:,:2]
Out[21]:
array([[ 1.00000000e+00, 1.00000000e+00],
[ 1.00000000e+00, 2.00000000e+00],
[ 1.00000000e+00, 3.00000000e+00],
...,
[ 1.28900000e+03, 1.28700000e+03],
[ 1.28900000e+03, 1.28800000e+03],
[ 1.28900000e+03, 1.28900000e+03]])
In [22]: B[:,:2].view(np.int)
Out[22]:
array([[4607182418800017408, 4607182418800017408],
[4607182418800017408, 4611686018427387904],
[4607182418800017408, 4613937818241073152],
...,
[4653383897399164928, 4653375101306142720],
[4653383897399164928, 4653379499352653824],
[4653383897399164928, 4653383897399164928]])
In [23]: B[:,:2].astype(np.int)
Out[23]:
array([[ 1, 1],
[ 1, 2],
[ 1, 3],
...,
[1289, 1287],
[1289, 1288],
[1289, 1289]])
What am I doing wrong? Can't I change the type due to numpy allocation memory? Is there another way to do this (fromarrays, was accusing a shape mismatch ?

This is the difference between doing somearray.view(new_dtype) and calling astype.
What you're seeing is exactly the expected behavior, and it's very deliberate, but it's uprising the first time you come across it.
A view with a different dtype interprets the underlying memory buffer of the array as the given dtype. No copies are made. It's very powerful, but you have to understand what you're doing.
A key thing to remember is that calling view never alters the underlying memory buffer, just the way that it's viewed by numpy (e.g. dtype, shape, strides). Therefore, view deliberately avoids altering the data to the new type and instead just interprets the "old bits" as the new dtype.
For example:
In [1]: import numpy as np
In [2]: x = np.arange(10)
In [3]: x
Out[3]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [4]: x.dtype
Out[4]: dtype('int64')
In [5]: x.view(np.int32)
Out[5]: array([0, 0, 1, 0, 2, 0, 3, 0, 4, 0, 5, 0, 6, 0, 7, 0, 8, 0, 9, 0],
dtype=int32)
In [6]: x.view(np.float64)
Out[6]:
array([ 0.00000000e+000, 4.94065646e-324, 9.88131292e-324,
1.48219694e-323, 1.97626258e-323, 2.47032823e-323,
2.96439388e-323, 3.45845952e-323, 3.95252517e-323,
4.44659081e-323])
If you want to make a copy of the array with a new dtype, use astype instead:
In [7]: x
Out[7]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [8]: x.astype(np.int32)
Out[8]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)
In [9]: x.astype(float)
Out[9]: array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
However, using astype with structured arrays will probably surprise you. Structured arrays treat each element of the input as a C-like struct. Therefore, if you call astype, you'll run into several suprises.
Basically, you want the columns to have a different dtype. In that case, don't put them in the same array. Numpy arrays are expected to be homogenous. Structured arrays are handy in certain cases, but they're probably not what you want if you're looking for something to handle separate columns of data. Just use each column as its own array.
Better yet, if you're working with tabular data, you'll probably find its easier to use pandas than to use numpy arrays directly. pandas is oriented towards tabular data (where columns are expected to have different types), while numpy is oriented towards homogenous arrays.

Actually, from_arrays work, but it doesn't explain this weird comportment.
Here is the solution I've found:
np.core.records.fromarrays(B.T, dtype=A.dtype)

The only solution which worked for me in similar situation:
np.array([tuple(row) for row in B], dtype=A.dtype)

Related

How to convert Zero to Nan in the array?

I used the temp[temp==0] = np.nan, but I got this Error:
IndexError: 2-dimensional boolean indexing is not supported.
I'd use where, to avoid having to drop down to numpy:
In [35]: d
Out[35]:
<xarray.DataArray (dim_0: 2, dim_1: 3)>
array([[0, 1, 2],
[3, 4, 5]])
Dimensions without coordinates: dim_0, dim_1
In [36]: d.where(d != 0)
Out[36]:
<xarray.DataArray (dim_0: 2, dim_1: 3)>
array([[nan, 1., 2.],
[ 3., 4., 5.]])
Dimensions without coordinates: dim_0, dim_1
and which will automatically move to floats if necessary.

int array with Missing values numpy

Numpy int arrays can't store missing values.
>>> import numpy as np
>>> np.arange(10)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> myArray = np.arange(10)
>>> myArray.dtype
dtype('int32')
>>> myArray[0] = None
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
>>> myArray.astype( dtype = 'float')
array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
>>> myFloatArray = myArray.astype( dtype = 'float')
>>> myFloatArray[0] = None
>>> myFloatArray
array([ nan, 1., 2., 3., 4., 5., 6., 7., 8., 9.])
Pandas warns about this in the docs - Caveats and Gotchas, Support for int NA. Wes McKinney also reiterates the point in this stack question
I need to be able to store missing values in an int array. I'm INSERTing rows into my database which I've set up to accept only ints of varying sizes.
My current work around is to store the array as an object, which can hold both ints and None-types as elements.
>>> myArray.astype( dtype = 'object')
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=object)
>>> myObjectArray = myArray.astype( dtype = 'object')
>>> myObjectArray[0] = None
>>> myObjectArray
array([None, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=object)
This seems to be memory intensive and slow for large data-sets. I was wondering if anyone has a better solution while the numpy development is underway.
I found a very quick way to convert all the missing values in my dataframe into None types. the .where method
mydata = mydata.where( pd.notnull( mydata ), None )
It is much less memory intensive than what I was doing before.

How to make a numpy array of form ([1.], [2.], [3.]...) from list?

I am trying to make a numpy array of the form ([1.], [2.], ...) from a list [1, 2, 3] so I can use it as an input for sklearn's linear_model.
This command
np.array(test_list)
produces this kind of array:
array([1, 2, 3, 4])
whereas I want
array ([1], [2], [3], [4])
You could just reshape:
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr.reshape(arr.size, 1).astype(float))
Which would give you:
[[ 1.]
[ 2.]
[ 3.]
[ 4.]
[ 5.]]
+1 to mgilson's answer. Here's another way:
arr = np.array([np.array([float(i)]) for i in test_list])
You can insert a new axis and transpose:
>>> arr = np.array([1, 2, 3, 4], dtype=float)
>>> arr[None, ...].T
array([[1.],
[2.],
[3.],
[4.]])
As with most things numpy, there's probably a better way, but this works alright :-).
Or, as pointed out in the comments, you can just insert an axis at the right place:
>>> arr[..., None]
array([[ 1.],
[ 2.],
[ 3.],
[ 4.]])
Note that you could write None as np.newaxis if you find that to be more semantically correct.
You could also use NumPy's atleast_2d and transpose:
In [270]: np.atleast_2d([1, 2, 3, 4, 5]).T.astype(float)
Out[270]:
array([[ 1.],
[ 2.],
[ 3.],
[ 4.],
[ 5.]])

Python dealing with repeated eigenvalues

Consider A a real symmetric matrix and
import scipy
(s,u)=scipy.linalg.eigh(A)
If A has repeated eigenvalues then the columns of u are not necessarily orthonormal. What is the most efficient way to obtain a basis of orthonormal eigenvectors in python?
Use eigh() instead of eig(), since eigh() is specially designed to deal with complex hermitian and real symmetric matrices.
In [1]: import numpy as np
...: sigma = np.array([[ 7, -3, 2],
...: [ -3, -1, 6],
...: [ 2, 6, 4]])
In [2]: val, vec = np.linalg.eig(sigma)
In [3]: val
Out[3]: array([ 8., -6., 8.])
In [4]: vec
Out[4]:
array([[ 0.96362411, -0.26726124, -0.06680865],
[-0.22237479, -0.80178373, 0.56878282],
[ 0.14824986, 0.53452248, 0.81976991]])
Observe that vec.T # vec is not an identity matrix:
In [5]: vec.T # vec
Out[5]:
array([[ 1.00000000e+00, 0.00000000e+00, -6.93306162e-02],
[ 0.00000000e+00, 1.00000000e+00, 1.11022302e-16],
[-6.93306162e-02, 1.11022302e-16, 1.00000000e+00]])
Now let's use eigh():
In [6]: val, vec = np.linalg.eigh(sigma)
In [7]: val
Out[7]: array([-6., 8., 8.])
In [8]: vec
Out[8]:
array([[-0.26726124, 0.87287156, -0.40824829],
[-0.80178373, 0.03357198, 0.59667058],
[ 0.53452248, 0.48679376, 0.69088172]])
Observe that vec.T # vec is an identity matrix:
In [9]: vec.T # vec
Out[9]:
array([[1.00000000e+00, 5.78786501e-17, 5.56940743e-17],
[5.78786501e-17, 1.00000000e+00, 4.61864879e-17],
[5.56940743e-17, 4.61864879e-17, 1.00000000e+00]])

string representation of a numpy array with commas separating its elements

I have a numpy array, for example:
points = np.array([[-468.927, -11.299, 76.271, -536.723],
[-429.379, -694.915, -214.689, 745.763],
[ 0., 0., 0., 0. ]])
if I print it or turn it into a string with str() I get:
print w_points
[[-468.927 -11.299 76.271 -536.723]
[-429.379 -694.915 -214.689 745.763]
[ 0. 0. 0. 0. ]]
I need to turn it into a string that prints with separating commas while keeping the 2D array structure, that is:
[[-468.927, -11.299, 76.271, -536.723],
[-429.379, -694.915, -214.689, 745.763],
[ 0., 0., 0., 0. ]]
Does anybody know an easy way of turning a numpy array to that form of string?
I know that .tolist() adds the commas but the result loses the 2D structure.
Try using repr
>>> import numpy as np
>>> points = np.array([[-468.927, -11.299, 76.271, -536.723],
... [-429.379, -694.915, -214.689, 745.763],
... [ 0., 0., 0., 0. ]])
>>> print(repr(points))
array([[-468.927, -11.299, 76.271, -536.723],
[-429.379, -694.915, -214.689, 745.763],
[ 0. , 0. , 0. , 0. ]])
If you plan on using large numpy arrays, set np.set_printoptions(threshold=np.nan) first. Without it, the array representation will be truncated after about 1000 entries (by default).
>>> arr = np.arange(1001)
>>> print(repr(arr))
array([ 0, 1, 2, ..., 998, 999, 1000])
Of course, if you have arrays that large, this starts to become less useful and you should probably analyze the data some way other than just looking at it and there are better ways of persisting a numpy array than saving it's repr to a file...
Now, in numpy 1.11, there is numpy.array2string:
In [279]: a = np.reshape(np.arange(25, dtype='int8'), (5, 5))
In [280]: print(np.array2string(a, separator=', '))
[[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24]]
Comparing with repr from #mgilson (shows "array()" and dtype):
In [281]: print(repr(a))
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24]], dtype=int8)
P.S. Still need np.set_printoptions(threshold=np.nan) for large array.
The function you are looking for is np.set_string_function. source
What this function does is let you override the default __str__ or __repr__ functions for the numpy objects. If you set the repr flag to True, the __repr__ function will be overriden with your custom function. Likewise, if you set repr=False, the __str__ function will be overriden. Since print calls the __str__ function of the object, we need to set repr=False.
For example:
np.set_string_function(lambda x: repr(x), repr=False)
x = np.arange(5)
print(x)
will print the output
array([0, 1, 2, 3, 4])
A more aesthetically pleasing version is
np.set_string_function(lambda x: repr(x).replace('(', '').replace(')', '').replace('array', '').replace(" ", ' ') , repr=False)
print(np.eye(3))
which gives
[[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]]
Hope this answers your question.
Another way to do it, which is particularly helpful when an object doesn't have a __repr__() method, is to employ Python's pprint module (which has various formatting options). Here is what that looks like, by example:
>>> import numpy as np
>>> import pprint
>>>
>>> A = np.zeros(10, dtype=np.int64)
>>>
>>> print(A)
[0 0 0 0 0 0 0 0 0 0]
>>>
>>> pprint.pprint(A)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Categories