Numpy array shape after extraction from Pandas Dataframe - python

I have a column in a Dataframe where each cell has a (300,) shaped numpy array.
When I extract the values of this column using the .values method, I get a numpy array of shape (N,) where N is the number of rows of the Dataframe. And each element of N is a (300,) array. I would have expected the extracted shape to be (Nx300).
So I would like to shape of the extracted column to be (Nx300). I tried using pd.as_matrix() but this still gets me a numpy array of shape (N,).
Any suggestions?

You can use numpy.concatenate, connvert to list and cast to array:
a = np.random.randint(10, size=300)
print (a.shape)
(300,)
df = pd.DataFrame({ 'A':[a,a,a]})
arr = np.array(np.concatenate(df.values).tolist())
print (arr.shape)
(3, 300)

Related

Converting array of pandas Dataframes into 3D NumPy array

I have a numpy array of pandas Dataframes which I need to convert into a 3D numpy array of the form (samples, rows, columns) in order to feed into a Keras model for training. I have 46 samples in my dataset and each sample is 1101 rows by 64 columns.
Here is the code for my 1D numpy array of Dataframes:
static_dfs = []
#read in static csvs as pandas df
#static files is my np array of csv files
for x in range(0, static_files.size):
df = pd.read_csv(static_files[x], sep='\t', skiprows=skip_rows, header=(0))
#append df to list
static_dfs.append(df)
#convert list to np array
static_dfs = np.asarray(static_dfs)
Indeed the shape of the array is (46,) [the number of samples].
If I look at one of the Dataframes in the array (static_dfs[0] for instance) the shape is (1101, 64).
I then try to convert this to 3D numpy array:
static_nps = []
for x in range(0, static_dfs.size):
static_nps.append(static_dfs[x].to_numpy())
#convert to numpy array
static_nps = np.asarray(static_nps)
However it gives me this error:
could not broadcast input array from shape (1101,64) into shape (1101)
for the line of code:
#convert to numpy array
static_nps = np.asarray(static_nps)
Worst part is I had it working before, but a collaborator of mine went through my code and edited it after we found a bug in one of our data files. Now I can't seem to get it back to working like before and am stuck :(
The desired shape of my 3D array would look like (46, 1101, 64). If anyone could solve this you would be a huge help! Thanks

Difference between a numpy array and a numpy vector

I wanted to know the difference between these two lines of code
X_train = training_dataset.iloc[:, 1].values
X_train = training_dataset.iloc[:, 1:2].values
My guess is that the latter is a 2-D numpy array and the former is a 1-D numpy array. For inputs in a neural network, the latter is the proper way for input data, is there are specific reason for that?
Please help!
Not quite that, they have both have ndim=2, just check by doing this:
X_train.ndim
The difference is that in the second one it doesn't have a defined second dimension if you want to see the difference between the shapes I suggest reading this: Difference between numpy.array shape (R, 1) and (R,)
Difference is iloc returns a Series with a single row or column is selected but a Dataframe with a multiple row or column ranges reference
Although they both refer to column 1, 1 and 1:2 are different types, with 1 representing an int and 1:2 representing a slice.
With,
X_train = training_dataset.iloc[:, 1].values
You specify a single column so training_dataset.iloc[:, 1] is a Pandas Series, so .values is a 1D Numpy array
Vs.,
X_train = training_dataset.iloc[:, 1:2].values
Although it becomes one column, [1:2] is a slice you represents a column range so training_dataset.iloc[:, 1:2] is a Pandas Dataframe. Thus, .values is a 2D Numpy array
Test as follows:
Create training_dataset Dataframe
data = {'Height':[1, 14, 2, 1, 5], 'Width':[15, 25, 2, 20, 27]}
training_dataset = pd.DataFrame(data)
Using .iloc[:, 1]
print(type(training_dataset.iloc[:, 1]))
print(training_dataset.iloc[:, 1].values)
# Result is:
<class 'pandas.core.series.Series'>
# Values returns a 1D Numpy array
0 15
1 25
2 2
3 20
4 27
Name: Width, dtype: int64,
Using iloc[:, 1:2]
print(type(training_dataset.iloc[:, 1:2]))
print(training_dataset.iloc[:, 1:2].values)
# Result is:
<class 'pandas.core.frame.DataFrame'>
# Values is a 2D Numpy array (since values of Pandas Dataframe)
[[15]
[25]
[ 2]
[20]
[27]],
X_train Values Var Type <class 'numpy.ndarray'>

numpy indexing mixed tuple-index and range-index

I have a numpy array a with shape (m,n,3) that I want to index into the first two columns with another array idx of shape (100,2). So what I want is the following:
np.array([a[x,y,:] for x,y in idx])
What's the most efficient way to do this?

python difference between array(10,1) array(10,)

I'm trying to load MNIST dataset into arrays.
When I use
(X_train, y_train), (X_test, y_test)= mnist.load_data()
I get an array y_test(10000,) but I want it to be in the shape of (10000,1).
What is the difference between array(10000,1) and array(10000,)?
How can I convert the first array to the second array?
Your first Array with shape (10000,) is a 1-Dimensional np.ndarray.
Since the shape attribute of numpy Arrays is a Tuple and a tuple of length 1 needs a trailing comma the shape is (10000,) and not (10000) (which would be an int). So currently your data looks like this:
import numpy as np
a = np.arange(5) # >>> array([0, 1, 2, 3, 4]
print(a.shape) # >>> (5,)
What you want is an 2-Dimensional array with shape of (10000, 1).
Adding a dimension of length 1 doesn't require any additional data, it is basically and "empty" dimension. To add an dimension to an existing array you can use either np.expand_dims() or np.reshape().
Using np.expand_dims:
import numpy as np
b = np.array(np.arange(5)) # >>> array([0, 1, 2, 3, 4])
b = np.expand_dims(b, axis=1) # >>> array([[0],[1],[2],[3],[4]])
The function was specifically made for the purpose of adding empty dimensions to arrays. The axis keyword specifies which position the newly added dimension will occupy.
Using np.reshape:
import numpy as np
a = np.arange(5)
X_test_reshaped = np.reshape(a, shape=[-1, 1]) # >>> array([[0],[1],[2],[3],[4]])
The shape=[-1, 1] specifies how the new shape should look like after the reshape operation. The -1 itself will be replaced by the shape that 'fits the data' by numpy internally.
Reshape is a more powerful function than expand_dims and can be used in many different ways. You can read more on other uses of it in the numpy docs. numpy.reshape()
An array with a size of (10,1) is a 2D array containing empty columns.
An array with a size of (10,) is a 1D array.
To convert (10,1) to (10,), you can simply collapse the columns. For example, we take the x array, which has x.shape = (10,1). now using x[:,] you can collapse the columns and x[:,].shape = (10,).
To convert (10,) to (10,1), you can add a dimension by using np.newaxis. So, after import numpy as np, assuming we are using numpy arrays here. Take a y array for example, which has y.shape = (10,). Using y[:, np.newaxis], you can a new array with the shape of (10,1).

shape of pandas dataframe to 3d array

I want to convert pandas dataframe to 3d array, but cannot get the real shape of the 3d array:
df = pd.DataFrame(np.random.rand(4,5), columns = list('abcde'))
df['a'][3:]=1
df['a'][:3]=2
a3d = np.array(list(df.groupby('a').apply(pd.DataFrame.as_matrix)))
a3d.shape
(2,)
But, when I set as this, I can get the shape
df = pd.DataFrame(np.random.rand(4,5), columns = list('abcde'))
df['a'][2:]=1
df['a'][:2]=2
a3d = np.array(list(df.groupby('a').apply(pd.DataFrame.as_matrix)))
a3d.shape
(2,2,5)
Is there some thing wrong with the code?
Thanks!
Nothing wrong with the code, it's because in the first case, you don't have a 3d array. By definition of an N-d array (here 3d), first two lines explain that each dimension must have the same size. In the first case:
df = pd.DataFrame(np.random.rand(4,5), columns = list('abcde'))
df['a'][3:]=1
df['a'][:3]=2
a3d = np.array(list(df.groupby('a').apply(pd.DataFrame.as_matrix)))
You have a 1-d array of size 2 (it's what a3d.shape shows you) which contains 2-d array of shape (1,5) and (3,5)
a3d[0].shape
Out[173]: (1, 5)
a3d[1].shape
Out[174]: (3, 5)
so both elements in the first dimension of what you call a3d does not have the same size, and can't be considered as other dimensions of this ndarray.
While in the second case,
df = pd.DataFrame(np.random.rand(4,5), columns = list('abcde'))
df['a'][2:]=1
df['a'][:2]=2
a3d = np.array(list(df.groupby('a').apply(pd.DataFrame.as_matrix)))
a3d[0].shape
Out[176]: (2, 5)
a3d[1].shape
Out[177]: (2, 5)
both elements of your first dimension have the same size, so a3d is a 3-d array.

Categories