Reinterpret data in numpy ndarray - python

I have a numpy array with dtype=uint8 and shape=(N,4) and I want to reinterpret the 4 bytes along the axis=1 efficiently as dtype=int32 and get a resulting shape=(N,) but nothing I've tried works. The equivalent in c would be brutally casting the pointer of the array.
The initial array is created like this from a pandas dataframe:
tmp=df[['data_1','data_2','data_3','data_4']].values.astype('uint8')
But then this works but it's not vectorized:
tmp1=np.empty((tmp.shape[0],),dtype=np.int32)
for i in range(tmp.shape[0]):
tmp2=tmp[i].copy()
tmp1[i]=tmp2.view('<i4')
And this, which I understand as the efficient way to do it, doesn't:
tmp1=tmp.view('<i4')
Giving the error:
ValueError: When changing to a larger dtype, its size must be a divisor of the total size in bytes of the last axis of the array.
But the size should be correct as far as I understand.
edit: added the reinterpeted explanation

Assuming you actually want the output shape to be (N*4,) (not (N,) as you wrote initially), you can just flatten it and then cast it to your desired type:
tmp1 = tmp.flatten().astype('int32', copy=False)
EDIT:
If you actually want the same underlying data to be interpreted as a different type and get a (N,) array out, the view method is in fact the way to go. This for example works for me:
import numpy as np
N = 5
a = np.arange(N*4, dtype='uint8').reshape((N,4))
a.view('int32')[:,0]
That view is then array([ 50462976, 117835012, 185207048, 252579084, 319951120], dtype=int32).

Related

Zero-dimensional numpy.ndarray : only element is a 2D array : how to access it?

I have imported a Matlab *.mat file using scipy.io and trying to extract the 2D data from it. There are several arrays inside, and when I am trying to get them I got stuck at the last operation.
The data looks like the image below. When I try to index it: IndexError: too many indices for array
I have googled to the point that it looks like a single valued tuple, where the only element is my array. This in principle must be indexable, but it doesn't work. The type(data) returns <class 'numpy.ndarray'>
So the question is: how do I get my 2D array out of this data structure?
data[0] # Doesn't work.
A search on loadmat should yield many SO questions that will help you pick apart this result. loadmat has to translate MATLAB objects into Python/numpy approximations.
data = io.loadmat(filename)
should produce a dictionary with some cover keys and various data keys. list(data.keys()) to identify those.
x = data['x']
should match the x variable in the MATLAB workspace. It could be a 2d, order F array, corresponding to a MATLAB matrix.
It could be (n,m) object dtype array, corresponding to a MATLAB cell.
It could be a structured array, where the field names correspond to a MATLAB struct attributes.
In your case it looks like you have a 0d object dtype array. The shape is (), an empty tuple (1d has (n,) shape, 2d has (n,m) shape, etc). You can pull the element out of a () array with:
y[()]
y.item()
The [()] looks odd, but it's logical. For a 1d array y[1] can be written as y[(1,)]. For 2d, y[1,2] and y[(1,2)] are the same. The indexing tuple should match the number of dimensions. Hence a () can index a () shape array.
After some voodoo coding I have found a funny way to solve this:
The initial data is the zero-dimensional where the only element is the 2D array. The way to get this element out apparently is:
z = data.item()[()][0]
print(z)
The final result is below I got my 2D array:

numpy array concatenation error: 0-d arrays can't be concatenated

I am trying to concatenate two numpy arrays, but I got this error. Could some one give me a bit clue about what this actually means?
Import numpy as np
allValues = np.arange(-1, 1, 0.5)
tmp = np.concatenate(allValues, np.array([30], float))
Then I got
ValueError: 0-d arrays can't be concatenated
If I do
tmp = np.concatenate(allValues, np.array([50], float))
There is no error message but tmp variable does not reflect the concatenation either.
You need to put the arrays you want to concatenate into a sequence (usually a tuple or list) in the argument.
tmp = np.concatenate((allValues, np.array([30], float)))
tmp = np.concatenate([allValues, np.array([30], float)])
Check the documentation for np.concatenate. Note that the first argument is a sequence (e.g. list, tuple) of arrays. It does not take them as separate arguments.
As far as I know, this API is shared by all of numpy's concatenation functions: concatenate, hstack, vstack, dstack, and column_stack all take a single main argument that should be some sequence of arrays.
The reason you are getting that particular error is that arrays are sequences as well. But this means that concatenate is interpreting allValues as a sequence of arrays to concatenate. However, each element of allValues is a float rather than an array, and is therefore being interpreted as a zero-dimensional array. As the error says, these "arrays" cannot be concatenated.
The second argument is taken as the second (optional) argument of concatenate, which is the axis to concatenate on. This only works because there is a single element in the second argument, which can be cast as an integer and therefore is a valid value. If you had put an array with more elements in the second argument, you would have gotten a different error:
a = np.array([1, 2])
b = np.array([3, 4])
np.concatenate(a, b)
# TypeError: only length-1 arrays can be converted to Python scalars
Also make sure you are concatenating two numpy arrays. I was concatenating one python array with a numpy array and it was giving me the same error:
ValueError: 0-d arrays can't be concatenated
It took me some time to figure this out since all the answers in stackoverflow were assuming that you had two numpy arrays.
Pretty silly but easily overlooked mistake. Hence posting just in case this helps someone.
Here are the links to converting an existing python array using np.asarray
or
create np arrays, if it helps.
Another way to get this error is to have two numpy objects of different... types?
I get this error when I try np.concatenate([A,B])
and ValueError: all the input arrays must have same number of dimensions when I run np.concatenate([B,A])
Just as #mithunpaul mentioned, my types are off: A is an array of 44279x204 and B is a <44279x12 sparse matrix of type '<class 'numpy.float64'>' with 88558 stored elements in Compressed Sparse Row format>)
So that's why the error is happening. Don't know how to solve it yet though.

numpy arrays dimension mismatch

I am using numpy and pandas to attempt to concatenate a number of heterogenous values into a single array.
np.concatenate((tmp, id, freqs))
Here are the exact values:
tmp = np.array([u'DNMT3A', u'p.M880V', u'chr2', 25457249], dtype=object)
freqs = np.array([0.022831050228310501], dtype=object)
id = "id_23728"
The dimensions of tmp, 17232, and freqs are as follows:
[in] tmp.shape
[out] (4,)
[in] np.array(17232).shape
[out] ()
[in] freqs.shape
[out] (1,)
I have also tried casting them all as numpy arrays to no avail.
Although the variable freqs will frequently have more than one value.
However, with both the np.concatenate and np.append functions I get the following error:
*** ValueError: all the input arrays must have same number of dimensions
These all have the same number of columns (0), why can't I concatenate them with either of the above described numpy methods?
All I'm looking to obtain is[(tmp), 17232, (freqs)] in one single dimensional array, which is to be appended onto the end of a pandas dataframe.
Thanks.
Update
It appears I can concatenate the two existing arrays:
np.concatenate([tmp, freqs],axis=0)
array([u'DNMT3A', u'p.M880V', u'chr2', 25457249, 0.022831050228310501], dtype=object)
However, the integer, even when casted cannot be used in concatenate.
np.concatenate([tmp, np.array(17571)],axis=0)
*** ValueError: all the input arrays must have same number of dimensions
What does work, however is nesting append and concatenate
np.concatenate((np.append(tmp, 17571), freqs),)
array([u'DNMT3A', u'p.M880V', u'chr2', 25457249, 17571,
0.022831050228310501], dtype=object)
Although this is kind of messy. Does anyone have a better solution for concatenating a number of heterogeneous arrays?
The problem is that id, and later the integer np.array(17571), are not an array_like object. See here how numpy decides whether an object can be converted automatically to a numpy array or not.
The solution is to make id array_like, i.e. to be an element of a list or tuple, so that numpy understands that id belongs to a 1D array_like structure
It all boils down to
concatenate((tmp, (id,), freqs))
or
concatenate((tmp, [id], freqs))
To avoid this sort of problems when dealing with input variables in functions using numpy, you can use atleast_1d, as pointed out by #askewchan. See about it this question/answer.
Basically, if you are unsure if in different scenarios your variable id will be a single str or a list of str, you are better off using
concatenate((tmp, atleast_1d(id), freqs))
because the two options above will fail if id is already a list/tuple of strings.
EDIT: It may not be obvious why np.array(17571) is not an array_like object. This happens because np.array(17571).shape==(), so it is not iterable as it has no dimensions.

strange behaviour of numpy masked array

I have troubles understanding the behaviour of numpy masked array.
Here is the snippet that puzzles me for two reasons:
arr = numpy.ma.array([(1,2),(3,4)],dtype=[("toto","int"),("titi","int")])
arr[0][0] = numpy.ma.masked
when doing this nothing happens, no mask is applied on the element [0][0]
changing the data to [[1,2],[3,4]] (instead of [(1,2),(3,4)]), I get the following error:
TypeError: expected a readable buffer object
It seems that I misunderstood completely how to setup (and use) masked array.
Could you tell me what is wrong with this code ?
thanks
EDIT: without specifying the dtypes, it works like expected
The purpose of a masked array is to tell for any operation that some elements of the array are invalid to be used, i.e. masked.
For example, you have an array:
a = np.array([[2, 1000], [3, 1000]])
And you want to ignore any operations with the elements >100. You create a masked array like:
b = np.ma.array(a, mask=(a>100))
You can perform some operations in both arrays to see the differences:
a.sum()
# 2005
b.sum()
# 5
a.prod()
# 6000000
b.prod()
# 6
As you see, the masked items are ignored...

numpy loadtxt single line/row as list

I have a data file with only one line like:
1.2 2.1 3.2
I used numpy version 1.3.0 loadtxt to load it
a,b,c = loadtxt("data.dat", usecols(0,1,2), unpack=True)
The output was a float instead of array like
a = 1.2
I expect it would be:
a = array([1.2])
If i read a file with multiple lines, it's working.
Simply use the numpy's inbuit loadtxt parameter ndmin.
a,b,c=np.loadtxt('data.dat',ndmin=2,unpack=True)
output
a=[1.2]
What is happening is that when you load the array you obtain a monodimensional one. When you unpack it, it obtain a set of numbers, i.e. array without dimension. This is because when you unpack an array, it decrease it's number of dimension by one. starting with a monodimensional array, it boil down to a simple number.
If you test for the type of a, it is not a float, but a numpy.float, that has all the properties of an array but a void tuple as shape. So it is an array, just is not represented as one.
If what you need is a monodimensional array with just one element, the simplest way is to reshape your array before unpacking it:
#note the reshape function to transform the shape
a,b,c = loadtxt("text.txt").reshape((-1,1))
This gives you the expected result. What is happening is that whe reshaped it into a bidimensional array, so that when you unpack it, the number of dimensions go down to one.
EDIT:
If you need it to work normally for multidimensional array and to keep one-dimensional when you read onedimensional array, I thik that the best way is to read normally with loadtxt and reshape you arrays in a second phase, converting them to monodimensional if they are pure numbers
a,b,c = loadtxt("text.txt",unpack=True)
for e in [a,b,c]
e.reshape(e.shape if e.shape else (-1,))
The simple way without using reshape is, to explicitly typecast the list
a,b,c = loadtxt("data.dat", usecols(0,1,2), unpack=True)
a,b,c = (a,b,c) if usi.shape else ([a], [b], [c])
This works faster than the reshape!

Categories