How does plot numpy's logical indexing get the datapoints from the "data" variable in the code snippet below? I understand that the first parameter is the x co-ordinate and the second parameter is the y co-ordinate. I am unsure of how it maps to the datapoints from the variable.
data = vstack((rand(150,2) + array([.5,.5]),rand(150,2)))
# assign each sample to a cluster
idx,_ = vq(data,centroids)
# some plotting using numpy's logical indexing
plot(data[idx==0,0],data[idx==0,1],'ob',
data[idx==1,0],data[idx==1,1],'or')
plot(centroids[:,0],centroids[:,1],'sg',markersize=8)
It's all in the shapes:
In [89]: data.shape
Out[89]: (300, 2) # data has 300 rows and 2 columns
In [93]: idx.shape
Out[93]: (300,) # idx is a 1D-array with 300 elements
idx == 0 is a boolean array with the same shape as idx. It is True wherever an element in idx equals 0:
In [97]: (idx==0).shape
Out[97]: (300,)
When you index data with idx==0, you get all rows of data where idx==0 is True:
In [98]: data[idx==0].shape
Out[98]: (178, 2)
When you index using a tuple, data[idx==0, 0], the first axis of data is indexed with the boolean array idx==0, and the second axis of data is indexed with 0:
In [99]: data[idx==0, 0].shape
Out[99]: (178,)
The first axis of data correspond to rows, the second axis corresponds to columns. So you get just the first column of data[idx==0]. Since the first column of data are x-values, this gives you those x-values in data where idx==0.
Related
I wanted to know the difference between these two lines of code
X_train = training_dataset.iloc[:, 1].values
X_train = training_dataset.iloc[:, 1:2].values
My guess is that the latter is a 2-D numpy array and the former is a 1-D numpy array. For inputs in a neural network, the latter is the proper way for input data, is there are specific reason for that?
Please help!
Not quite that, they have both have ndim=2, just check by doing this:
X_train.ndim
The difference is that in the second one it doesn't have a defined second dimension if you want to see the difference between the shapes I suggest reading this: Difference between numpy.array shape (R, 1) and (R,)
Difference is iloc returns a Series with a single row or column is selected but a Dataframe with a multiple row or column ranges reference
Although they both refer to column 1, 1 and 1:2 are different types, with 1 representing an int and 1:2 representing a slice.
With,
X_train = training_dataset.iloc[:, 1].values
You specify a single column so training_dataset.iloc[:, 1] is a Pandas Series, so .values is a 1D Numpy array
Vs.,
X_train = training_dataset.iloc[:, 1:2].values
Although it becomes one column, [1:2] is a slice you represents a column range so training_dataset.iloc[:, 1:2] is a Pandas Dataframe. Thus, .values is a 2D Numpy array
Test as follows:
Create training_dataset Dataframe
data = {'Height':[1, 14, 2, 1, 5], 'Width':[15, 25, 2, 20, 27]}
training_dataset = pd.DataFrame(data)
Using .iloc[:, 1]
print(type(training_dataset.iloc[:, 1]))
print(training_dataset.iloc[:, 1].values)
# Result is:
<class 'pandas.core.series.Series'>
# Values returns a 1D Numpy array
0 15
1 25
2 2
3 20
4 27
Name: Width, dtype: int64,
Using iloc[:, 1:2]
print(type(training_dataset.iloc[:, 1:2]))
print(training_dataset.iloc[:, 1:2].values)
# Result is:
<class 'pandas.core.frame.DataFrame'>
# Values is a 2D Numpy array (since values of Pandas Dataframe)
[[15]
[25]
[ 2]
[20]
[27]],
X_train Values Var Type <class 'numpy.ndarray'>
I have a Dataframe that corresponds to a 3D centerline (x,y,z). I want to turn the Dataframe into a binary array with shape (272, 512, 512). The z values from the Dataframe range from about 40-160 and they correspond to the first column in the array. The x and y values correspond to the second and third columns in the array, respectively. Any xyz value not in the Dataframe should correspond to a 0 in the array and any value that is present should correspond to a 1. Any ideas on how to do this considering each plane/slice may have multiple 1's in the array?
I was able to accomplish this if I limited the Dataframe to only have one row per unique z value (one point for each slice) but the real data has multiple rows per unique z value.
Here is what the header of the Dataframe looks like
This is the code that works for downsampled Dataframe (only one row per unique z value):
def dataframe_to_binary_array(df):
'''
THIS FUNCTION TAKES IN A DOWNSAMPLED DATAFRAME AND CONVERTS IT TO A 3D
BINARY ARRAY THAT IS THE SAME SHAPE AS THE ORIGINAL DICOM STACK
'''
empty_array = np.zeros([272, 512, 512], dtype='int64')
z_column = df['Z']
for z in z_column:
z_df = df[z_column == z]
for k in range(0, 272):
x = z_df['X']
y = z_df['Y']
empty_array[z, x, y] = 1
return empty_array
Here is my attempt at code for the true Dataframe:
def dataframe_to_binary_array_new(df):
'''
THIS FUNCTION TAKES IN A DOWNSAMPLED DATAFRAME AND CONVERTS IT TO A 3D
BINARY ARRAY THAT IS THE SAME SHAPE AS THE ORIGINAL DICOM STACK
'''
empty_array = np.zeros([272, 512, 512], dtype='int64')
z_column = df['Z']
for i in range(0,272):
z_df = df[z_column == i]
for row in z_df:
x_col = z_df['X'].to_numpy()
y_col = z_df['Y'].to_numpy()
for x_element in x_col:
x = int(x_element)
for y_element in y_col:
y = int(y_element)
empty_array[i,x,y] = 1
return empty_array
The error message I get is "IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices"
I'd come at this a different way. How about iterating over the rows of the original dataframe. Then use the coordinate from each dataframe row to set the appropriate element in empty_array to 1.
Below's some example code. empty_array is renamed as binary_array. You may need to convert your coordinates from floats to integers to be able to use then as indices in binary_array.
# x, y, z are integers from [0, 10)
n = 10
binary_array = np.zeros([n]*3)
# Builds 10 example coordinates
df = pd.DataFrame(np.random.randint(n, size=(10,3)), columns=list('XYZ'))
for idx, coord in df.iterrows():
x, y, z = tuple(coord)
binary_array[x, y, z] = 1
As a frame challenge, I'd ask you to consider why you're changing it to a 3D array. Your array would have 71 million entries. How do that compare to the size of your dataframe?
You're probably not creating a 3D array just to have a 3D. You have some things that you want to do with the 3D array. You should consider whether those things are really easier to implement with a 3D array. Presumably, you want an object my_array such that you can do my_array.get_value(x,y,z) and get a 1 if the tuple (x,y,z) corresponds to a row in the original dataframe, and 0 otherwise. But it's rather simple to create a wrapper around the original dataframe that does that. You could also create a set out of the tuples that appear in each dataframe row, and then simply query the set for inclusion.
I'm trying to get some specific values given a dataframe using pandas and numpy.
My process right now is as it follows:
In[1]: df = pd.read_csv(file)
In[2]: a = df[df.columns[1]].values
Right now a has the following shape:
In[3]: a.shape
Out[4]: (8640, 1)
When I filter it out to get the values that match a given condition I don't get the same shape in the axis 1:
In[5]: b = a[a>100]
In[6]: b.shape
Out[7]: (3834,)
Right now I'm reshaping the new arrays everytime I filter them, however this is making my code look really messy and uncomfortable:
In[8]: (b.reshape(b.size,1)).shape
Out[9]: (3834, 1)
I really need it to have the shape (x, 1) in order to use some other functions, so is it any way of getting that shape everytime I filter out the values without having to reshape it constantly?
EDIT:
The main reason I need to do this reshaping is that I need to get the minimum value in every row for two arrays with the same number of rows. What I use is np.min and np.concatenate.
For example:
av is the mean of 5 different columns in my dataframe:
av = np.mean(myColumns,axis=1)
Which has shape (8640, )
med is the median for the same columns:
med = np.median(myColumns,axis=1)
And when I try to get the minimum values I have the next error:
np.min(np.concatenate((av,med),axis=1),axis=1) Traceback (most recent
call last):
File "", line 1, in
np.min(np.concatenate((av,med),axis=1),axis=1)
AxisError: axis 1 is out of bounds for array of dimension 1
However, if I reshape av and med it works fine:
np.min(np.concatenate((av.reshape(av.size,1),med.reshape(av.size,1)),axis=1),axis=1)
Out[232]: array([0., 0., 0., ..., 0., 0., 0.])
you can use np.take(a, np.where(a>100)[0], axis=0) for keeping the same shape as original
If you really need this shape, this code gives shape (..., 1) and it's not that ugly:
np.expand_dims(a, 1)
or
a[:, np.newaxis]
If your code is not too heavy that you have to use numpy for performance, you can stick with pandas objects (DataFrame/Series) and maintain shape.
For example, take this example df (which, I must add, you should've provided with your question):
df = pd.DataFrame(data=np.random.rand(7,3), columns=['a','b','c'])
df
a b c
0 0.382530 0.748674 0.186446
1 0.142991 0.965972 0.299884
2 0.568910 0.469341 0.896786
3 0.452816 0.021598 0.989637
4 0.884955 0.738519 0.082460
5 0.944797 0.103953 0.287005
6 0.379389 0.593280 0.832720
To create an object with shape (7,1) you can use x = df[['a']], which is a dataframe with one column (compare with x=df['a'], which is a Series with shape (7,)).
Now, if I move on to a numpy array by using y=x.values, I still get the same shape (both x and y have shapes (7,1)).
However, both react differently to boolean indexing: calling y[y>0.3] will return an array with shape (6,), while calling x[x>0.3] will return ... a dataframe with shape (7,1). Let's see:
array:
y[y>0.3]
array([0.38252971, 0.56890993, 0.45281553, 0.88495521, 0.94479716,
0.37938899])
dataframe:
x[x>0.3]
a
0 0.382530
1 NaN
2 0.568910
3 0.452816
4 0.884955
5 0.944797
6 0.379389
So, to get a series with the shape that you want (6,1), you can use
x[x['a']>0.3]
which returns
a
0 0.382530
2 0.568910
3 0.452816
4 0.884955
5 0.944797
6 0.379389
And then, only after doing all of your manipulations, you can call the .values at the end at obtain a numpy array with the desired result.
Now, generally speaking, manipulation on arrays are faster than on pandas objects, but working with pandas objects is easier, especially if you have lots of data processing to do.
You might prefer to work with numpy all the way, but the pandas option is worth knowing, and in my opinion is easier and simpler.
Hope this helps!
I want to convert pandas dataframe to 3d array, but cannot get the real shape of the 3d array:
df = pd.DataFrame(np.random.rand(4,5), columns = list('abcde'))
df['a'][3:]=1
df['a'][:3]=2
a3d = np.array(list(df.groupby('a').apply(pd.DataFrame.as_matrix)))
a3d.shape
(2,)
But, when I set as this, I can get the shape
df = pd.DataFrame(np.random.rand(4,5), columns = list('abcde'))
df['a'][2:]=1
df['a'][:2]=2
a3d = np.array(list(df.groupby('a').apply(pd.DataFrame.as_matrix)))
a3d.shape
(2,2,5)
Is there some thing wrong with the code?
Thanks!
Nothing wrong with the code, it's because in the first case, you don't have a 3d array. By definition of an N-d array (here 3d), first two lines explain that each dimension must have the same size. In the first case:
df = pd.DataFrame(np.random.rand(4,5), columns = list('abcde'))
df['a'][3:]=1
df['a'][:3]=2
a3d = np.array(list(df.groupby('a').apply(pd.DataFrame.as_matrix)))
You have a 1-d array of size 2 (it's what a3d.shape shows you) which contains 2-d array of shape (1,5) and (3,5)
a3d[0].shape
Out[173]: (1, 5)
a3d[1].shape
Out[174]: (3, 5)
so both elements in the first dimension of what you call a3d does not have the same size, and can't be considered as other dimensions of this ndarray.
While in the second case,
df = pd.DataFrame(np.random.rand(4,5), columns = list('abcde'))
df['a'][2:]=1
df['a'][:2]=2
a3d = np.array(list(df.groupby('a').apply(pd.DataFrame.as_matrix)))
a3d[0].shape
Out[176]: (2, 5)
a3d[1].shape
Out[177]: (2, 5)
both elements of your first dimension have the same size, so a3d is a 3-d array.
I have the following 2-D Numpy arrays:
X # X.shape = (11688, 144)
Y # Y.shape = (2912, 1000)
The first array is populated with atmospheric data, and the second array is populated with random index values from 0 to X.shape[0]-1. I want to index the rows of X with each column of Y to yield a 3-D array result, where result.shape = (2912, 1000, 144), and I want to do this without looping.
My current approach is:
result = X[Y,:]
but this one line of code can take more than 10 seconds to execute depending on the shape of the 0th axis of Y.
Is there a more optimal way to perform this type of indexing in order to speed up its execution?
EDIT: Here's a more complete example of what I'm trying to accomplish.
X = np.random.rand(11688, 144) # Time-by-longitude array of atmospheric data
t = np.arange(X.shape[0]) # Time vector
# Populate array of randomly drawn time steps
Y = np.zeros((2912, 1000), dtype='i')
for i in xrange(1000):
Y[:,i] = np.random.choice(t, 2912)
# Index X with each column of Y
result = X[Y,:]