I am wondering if it's possible to vectorize the following operation in Numpy or Tensorflow. The ultimate goal is to do it in Tensorflow, but seems Numpy would be easier for illustration here.
The problem is to get an discretized occupancy grid from a set of 2D points (x, y), and calculate the average of points in that particular grid.
Given 2D array xy, every row [x, y] will be mapped to an index [xid, yid]. This step is done via np.apply_along_axis
In another 3D array grid_sum, given the [xid, yid] we calculated in 1), we update grid_sum[xid, yid] += [x, y].
In yet another 2D array grid_count, given the [xid, yid] we calculated in 1), we update grid_sum[xid, yid] += 1.
We get the final results 3D array grid_mean by dividing grid_sum by grid_count at every [xid, yid].
The problem of vectorize this operation is different rows might be trying to write to the same location in the new array, creating a race condition. How can I handle this?
I have the following minimal example here to help understand this situation.
Edit after comment
This example works fine because I use a for loop. Is it possible to achieve the same without the for loop?
import numpy as np
xy = np.array([[1, 1], [1, 1]], dtype=np.int16)
grid_sum = np.zeros([3, 3, 2])
grid_count = np.zeros([3, 3])
for i in range(xy.shape[0]):
idx = xy[i] # simple case, just use array value as index
grid_sum[idx[0], idx[1], :] += xy[i]
grid_count[idx[0], idx[1]] += 1
print(grid_sum)
print(grid_count)
# grid_sum result
# [[[0. 0.]
# [0. 0.]
# [0. 0.]]
# [[0. 0.]
# [2. 2.]
# [0. 0.]]
# [[0. 0.]
# [0. 0.]
# [0. 0.]]]
# grid_count result
# [[0. 0. 0.]
# [0. 2. 0.]
# [0. 0. 0.]]
Related
I am working on some problem which requires rolling the elements in a matrix. Below is the example of using numpy to roll a numpy array as desired. I want to replicate the same for scipy sparse csr_matrix without converting it into dense matrix as in actual use case I am working on would be having very large sparse matrix.
The numpy version:
A=np.eye(3,3)
print(np.roll(A,[0,3]))
Outputs:
[[0. 0. 1.]
[1. 0. 0.]
[0. 1. 0.]]
The desired functionality must do something like this:
A = np.eye(3, 3)
A = sparse.csr_matrix(A)
print(sparse_roll(A, [0, 3]).todense())
Outputs:
[[0. 0. 1.]
[1. 0. 0.]
[0. 1. 0.]]
where sparse_roll is the function to be implemented.
import numpy as np
board = np.zeros((3,3))
board[0][0] = 1
# This would result in the 0,0th entry being 1.
board[None][0] = 2
# This would result in ALL entries being 2
Of course, I am aware that you should never supply the index as 'None', and I only stumbled upon this by accident.
But I cannot understand why using 'None' as an index would change all the values of a 2D array.
Should it not throw an error instead?
Using something similar on a 1D list threw an error for me.
When working with Numpy arrays, indexing with None value is equivalent to numpy.newaxis, which gets all the array values inside an specific dimension as you can see in the documentation:
import numpy as np
board = np.zeros((3,3))
print(board[None])
print(board[:, None])
Output: Note that None indexing only works with NumPy arrays and tensor objects
[[[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]]
[[[0. 0. 0.]]
[[0. 0. 0.]]
[[0. 0. 0.]]]
How can I fill the elements of the lower triangular part of a matrix, including the diagonal, with values from a column vector?
For example i have :
m=np.zeros((3,3))
n=np.array([[1],[1],[1],[1],[1],[1]]) #column vector
I want to replace values which have indices of (0,0),(1,0),(1,1),(2,0),(2,1),(2,2) from m with the vector n, so I get:
m=np.array([[1,0,0],[1,1,0],[1,1,1]])
Then I want make the same operation to m.T to get as a result:
m=np.array([[1,1,1],[1,1,1],[1,1,1]])
Can someone help me please? n should be a vector with shape(6,1)
I'm not sure if there's going to be a clever numpy-specific way of doing this, but it looks relatively straightforward like this:
import numpy as np
m=np.zeros((3,3))
n=np.array([[1],[1],[1],[1],[1],[1]]) #column vector
indices=[(0,0),(1,0),(1,1),(2,0),(2,1),(2,2)]
for ix, index in enumerate(indices):
m[index] = n[ix][0]
print(m)
for ix, index in enumerate(indices):
m.T[index] = n[ix][0]
print(m)
Output of the above is:
[[1. 0. 0.]
[1. 1. 0.]
[1. 1. 1.]]
[[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]]
I am trying to wrap my head around 3D arrays (or multi-dimensional arrays in general), but it's blowing my brains a bit. Especially the way in which 3D numpy arrays are printed is counter-intuitive to me. This question is similar but it is more about the differences between programming languages, and I still do not fully get it. Let me try to explain.
Say I want to create a 3D array with 3 rows (length), 5 columns(width) and 2 depth. So a 3x5x2 matrix.
I do the following:
import numpy as np
a = np.zeros(30).reshape(3, 5, 2)
To me, a logical way to print this would be like this:
[[[0. 0. 0. 0. 0.] #We can still see three rows from top to bottom
[0. 0. 0. 0. 0.]] #We can still see five columns from left to right
[[0. 0. 0. 0. 0.] #Depth values are shown underneath each other
[0. 0. 0. 0. 0.]]
[[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]]]
However, when I print this array it prints like this:
[[[0. 0.] #We can still see three rows from top to bottom,
[0. 0.] #However columns now also appear from top to bottom instead of from left to right
[0. 0.] #Depth values are now shown from left to right
[0. 0.]
[0. 0.]]
[[0. 0.]
[0. 0.]
[0. 0.]
[0. 0.]
[0. 0.]]
[[0. 0.]
[0. 0.]
[0. 0.]
[0. 0.]
[0. 0.]]]
It is unobvious to me why the array would be printed in this way. Maybe it is just me (Maybe my spatial reasoning is lacking here), or is there a specific reason why NumPy arrays are printed like this?
Synthesizing the comments into a proper answer:
First, take a look at np.zeros(10).reshape(5, 2). That's 5 rows of 2 columns, not 2 rows of 5 columns. Adding 3 at the front means 3 planes of 5 rows and 2 columns. What you're missing is that you new dimension is at the front, not the end. In mathematics, usually the extra dimensions are added at the end (Like extending an (x,y) with a z becomes (x,y,z). However, in computer science array dimensions are typically done this way. It reflects the way arrays are typically stored in row-major order in memory.
I have been trying to perform Ordinary Least Squares regression using the scikit-learn library but have hit another rock.
I have used OneHotEncoder to binarize my (independent) dummy/categorical features and I have an array like so:
x = [[ 1. 0. 0. ..., 0. 0. 0.]
[ 1. 0. 0. ..., 0. 0. 0.]
[ 0. 1. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 1. ..., 0. 0. 0.]
[ 1. 0. 0. ..., 0. 0. 0.]]
The dependent variables (Y) are stored in a one dimensional array. Everything is wonderful, except now when I come to plot these values I get an error:
# Plot outputs
pl.scatter(x_test, y_test, color='black')
ValueError: x and y must be the same size
When I use numpy.size on X and Y respectively it is clear thats a reasonable error:
>>> print np.size(x)
5096
>>> print np.size(y)
98
Interestingly, the two sets of data are accepted by the fit method.
My question is how can I transform the output of OneHotEncoder to use in my regression?
If I understand you correctly, you have your X matrix as an input as an [m x n] matrix and some output Y of [n x 1], where m = number of features and n = number of data points.
Firstly, the linear regression fitting function will not care that X is of dimension [m x n] and Y of [n x 1] as it will simply use a parameter of dimension [1 x m], i.e.,
Y = theta * X
Unfortunately, as noted by eickenberg, you cannot plot all of the X features against the Y value using matplotlibs scatter call as you have, hence you get the error message of incompatible sizes, it wants to plot n x n not (n x m) x n.
To fix your problem, try looking at a single feature at a time:
pl.scatter(x_test[:,0], y_test, color='black')
Assuming you have standardised your data (subtracted the mean and divided by the average) a quick and dirty way to see the trends would be plot all of them on a single axes:
fig = plt.figure(0)
ax = fig.add_subplot(111)
n, m = x_test.size
for i in range(m):
ax.scatter(x_test[:,m], y_test)
plt.show()
To visualise all at once on independent figures (depending on the number of features) then look at, e.g., subplot2grid routines or another python module like pandas.