shape of pandas dataframe to 3d array - python

I want to convert pandas dataframe to 3d array, but cannot get the real shape of the 3d array:
df = pd.DataFrame(np.random.rand(4,5), columns = list('abcde'))
df['a'][3:]=1
df['a'][:3]=2
a3d = np.array(list(df.groupby('a').apply(pd.DataFrame.as_matrix)))
a3d.shape
(2,)
But, when I set as this, I can get the shape
df = pd.DataFrame(np.random.rand(4,5), columns = list('abcde'))
df['a'][2:]=1
df['a'][:2]=2
a3d = np.array(list(df.groupby('a').apply(pd.DataFrame.as_matrix)))
a3d.shape
(2,2,5)
Is there some thing wrong with the code?
Thanks!

Nothing wrong with the code, it's because in the first case, you don't have a 3d array. By definition of an N-d array (here 3d), first two lines explain that each dimension must have the same size. In the first case:
df = pd.DataFrame(np.random.rand(4,5), columns = list('abcde'))
df['a'][3:]=1
df['a'][:3]=2
a3d = np.array(list(df.groupby('a').apply(pd.DataFrame.as_matrix)))
You have a 1-d array of size 2 (it's what a3d.shape shows you) which contains 2-d array of shape (1,5) and (3,5)
a3d[0].shape
Out[173]: (1, 5)
a3d[1].shape
Out[174]: (3, 5)
so both elements in the first dimension of what you call a3d does not have the same size, and can't be considered as other dimensions of this ndarray.
While in the second case,
df = pd.DataFrame(np.random.rand(4,5), columns = list('abcde'))
df['a'][2:]=1
df['a'][:2]=2
a3d = np.array(list(df.groupby('a').apply(pd.DataFrame.as_matrix)))
a3d[0].shape
Out[176]: (2, 5)
a3d[1].shape
Out[177]: (2, 5)
both elements of your first dimension have the same size, so a3d is a 3-d array.

Related

Numpy array shape after extraction from Pandas Dataframe

I have a column in a Dataframe where each cell has a (300,) shaped numpy array.
When I extract the values of this column using the .values method, I get a numpy array of shape (N,) where N is the number of rows of the Dataframe. And each element of N is a (300,) array. I would have expected the extracted shape to be (Nx300).
So I would like to shape of the extracted column to be (Nx300). I tried using pd.as_matrix() but this still gets me a numpy array of shape (N,).
Any suggestions?
You can use numpy.concatenate, connvert to list and cast to array:
a = np.random.randint(10, size=300)
print (a.shape)
(300,)
df = pd.DataFrame({ 'A':[a,a,a]})
arr = np.array(np.concatenate(df.values).tolist())
print (arr.shape)
(3, 300)

Confusion in size of a numpy array

Python numpy array 'size' confuses me a lot
a = np.array([1,2,3])
a.size = (3, )
------------------------
b = np.array([[2,1,3,5],
[2,2,5,1],
[3,6,99,5]])
b.size = (3,4)
'b' makes sense since it has 3 rows and 4 columns in each
But how is 'a' size = (3, ) ? Shouldn't it be (1,3) since its 1 row and 3 columns?
You should resist the urge to think of numpy arrays as having rows and columns, but instead consider them as having dimensions and shape. This is an important point which differentiates np.array and np.matrix:
x = np.array([1, 2, 3])
print(x.ndim, x.shape) # 1 (3,)
y = np.matrix([1, 2, 3])
print(y.ndim, y.shape) # 2 (1, 3)
An n-D array can only use n integer(s) to represent its shape. Therefore, a 1-D array only uses 1 integer to specify its shape.
In practice, combining calculations between 1-D and 2-D arrays is not a problem for numpy, and syntactically clean since # matrix operation was introduced in Python 3.5. Therefore, there is rarely a need to resort to np.matrix in order to satisfy the urge to see expected row and column counts.
In the rare instances where 2 dimensions are required, you can still use numpy.array with some manipulation:
a = np.array([1, 2, 3])[:, None] # equivalent to np.array([[1], [2], [3]])
print(a.ndim, a.shape) # 2 (3, 1)
b = np.array([[1, 2, 3]]) # equivalent to np.array([1, 2, 3])[:, None].T
print(b.ndim, b.shape) # 2 (1, 3)
No, a numpy.ndarray with shape (1, 3) would look like:
np.array([[1,2,3]])
Think about how the shape corresponds to indexing:
arr[0, ...] #First row
I still have three more options, namely:
arr[0,0]
arr[0,1]
arr[0,2]
Try doing that with a 1 dimensional array
I think you meant ndarray.shape. In that case, there's no need for confusion. Quoting the documentation from ndarray.shape:
Tuple of array dimensions.
ndarray.shape simply returns a shape tuple.
In [21]: a.shape
Out[21]: (3,)
This simply means that a is an 1D array with 3 entries.
If the shape tuple returns it as (1,3) then a would become a 2D array. For that you need to use:
In [23]: a = a[np.newaxis, :]
In [24]: a.shape
Out[24]: (1, 3)
Since array b is 2D, the shape tuple has two entries.
In [22]: b.shape
Out[22]: (3, 4)

What is the difference between an array with shape (N,1) and one with shape (N)? And how to convert between the two?

Python newbie here coming from a MATLAB background.
I have a 1 column array and I want to move that column into the first column of a 3 column array. With a MATLAB background this is what I would do:
import numpy as np
A = np.zeros([150,3]) #three column array
B = np.ones([150,1]) #one column array which needs to replace the first column of A
#MATLAB-style solution:
A[:,0] = B
However this does not work because the "shape" of A is (150,3) and the "shape" of B is (150,1). And apparently the command A[:,0] results in a "shape" of (150).
Now, what is the difference between (150,1) and (150)? Aren't they the same thing: a column vector? And why isn't Python "smart enough" to figure out that I want to put the column vector, B, into the first column of A?
Is there an easy way to convert a 1-column vector with shape (N,1) to a 1-column vector with shape (N)?
I am new to Python and this seems like a really silly thing that MATLAB does much better...
Several things are different. In numpy arrays may be 0d or 1d or higher. In MATLAB 2d is the smallest (and at one time the only dimensions). MATLAB readily expands dimensions the end because it is Fortran ordered. numpy, is by default c ordered, and most readily expands dimensions at the front.
In [1]: A = np.zeros([5,3])
In [2]: A[:,0].shape
Out[2]: (5,)
Simple indexing reduces a dimension, regardless whether it's A[0,:] or A[:,0]. Contrast that with happens to a 3d MATLAB matrix, A(1,:,:) v A(:,:,1).
numpy does broadcasting, adjusting dimensions during operations like sum and assignment. One basic rule is that dimensions may be automatically expanded toward the start if needed:
In [3]: A[:,0] = np.ones(5)
In [4]: A[:,0] = np.ones([1,5])
In [5]: A[:,0] = np.ones([5,1])
...
ValueError: could not broadcast input array from shape (5,1) into shape (5)
It can change (5,) LHS to (1,5), but can't change it to (5,1).
Another broadcasting example, +:
In [6]: A[:,0] + np.ones(5);
In [7]: A[:,0] + np.ones([1,5]);
In [8]: A[:,0] + np.ones([5,1]);
Now the (5,) works with (5,1), but that's because it becomes (1,5), which together with (5,1) produces (5,5) - an outer product broadcasting:
In [9]: (A[:,0] + np.ones([5,1])).shape
Out[9]: (5, 5)
In Octave
>> x = ones(2,3,4);
>> size(x(1,:,:))
ans =
1 3 4
>> size(x(:,:,1))
ans =
2 3
>> size(x(:,1,1) )
ans =
2 1
>> size(x(1,1,:) )
ans =
1 1 4
To do the assignment that you want you adjust either side
Index in a way that preserves the number of dimensions:
In [11]: A[:,[0]].shape
Out[11]: (5, 1)
In [12]: A[:,[0]] = np.ones([5,1])
transpose the (5,1) to (1,5):
In [13]: A[:,0] = np.ones([5,1]).T
flatten/ravel the (5,1) to (5,):
In [14]: A[:,0] = np.ones([5,1]).flat
In [15]: A[:,0] = np.ones([5,1])[:,0]
squeeze, ravel also work.
Some quick tests in Octave indicate that it is more forgiving when it comes to dimensions mismatch. But the numpy prioritizes consistency. Once the broadcasting rules are understood, the behavior makes sense.
Use squeeze method to eliminate the dimensions of size 1.
A[:,0] = B.squeeze()
Or just create B one-dimensional to begin with:
B = np.ones([150])
The fact that NumPy maintains a distinction between a 1D array and 2D array with one of dimensions being 1 is reasonable, especially when one begins working with n-dimensional arrays.
To answer the question in the title: there is an evident structural difference between an array of shape (3,) such as
[1, 2, 3]
and an array of shape (3, 1) such as
[[1], [2], [3]]

How to do dyadics-like operations in numpy

I have two 2-D arrays A and B. I want to get a 3-D array C, whose relation with A and B is:
C_mnl=A_mn*B_ml
How can I do this elegantly in numpy?
numpy.einsum can do that:
a = np.arange(6).reshape(3,2) # a.shape = (3, 2)
b = np.arange(12).reshape(3,4) # b.shape = (3, 4)
c = np.einsum('mn,ml->mnl', a, b) # c.shape = (3, 2, 4)
You can also use broadcasting -
C = A[...,None]*B[:,None,:]
Explanation
A[...,None] adds a new axis as the last axis with None (an equivalent for np.newaxis) pushing all existing dimensions to the front. Thus, this would be same as A[:,:,None].
Similarly with B[:,None,:], it adds a new axis between the existing dimensions.
With steps 1 and 2, we have the axes of the input arrays aligned and thus when operated with elementwise-multiplication would result in the desired output of shape (m,n,l) with broadcasting.

How to get these shapes to line up for a numpy matrix

I'm trying to input vectors into a numpy matrix by doing:
eigvec[:,i] = null
However I keep getting the error:
ValueError: could not broadcast input array from shape (20,1) into shape (20)
I've tried using flatten and reshape, but nothing seems to work
The shapes in the error message are a good clue.
In [161]: x = np.zeros((10,10))
In [162]: x[:,1] = np.ones((1,10)) # or x[:,1] = np.ones(10)
In [163]: x[:,1] = np.ones((10,1))
...
ValueError: could not broadcast input array from shape (10,1) into shape (10)
In [166]: x[:,1].shape
Out[166]: (10,)
In [167]: x[:,[1]].shape
Out[167]: (10, 1)
In [168]: x[:,[1]] = np.ones((10,1))
When the shape of the destination matches the shape of the new value, the copy works. It also works in some cases where the new value can be 'broadcasted' to fit. But it does not try more general reshaping. Also note that indexing with a scalar reduces the dimension.
I can guess that
eigvec[:,i] = null.flat
would work (however, null.flatten() should work too). In fact, it looks like NumPy complains because of you are assigning a pseudo-1D array (shape (20, 1)) to a 1D array which is considered to be oriented differently (shape (1, 20), if you wish).
Another solution would be:
eigvec[:,i] = null.T
where you properly transpose the "vector" null.
The fundamental point here is that NumPy has "broadcasting" rules for converting between arrays with different numbers of dimensions. In the case of conversions between 2D and 1D, a 1D array of size n is broadcast into a 2D array of shape (1, n) (and not (n, 1)). More generally, missing dimensions are added to the left of the original dimensions.
The observed error message basically said that shapes (20,) and (20, 1) are not compatible: this is because (20,) becomes (1, 20) (and not (20, 1)). In fact, one is a column matrix, while the other is a row matrix.

Categories