I have been trying to perform Ordinary Least Squares regression using the scikit-learn library but have hit another rock.
I have used OneHotEncoder to binarize my (independent) dummy/categorical features and I have an array like so:
x = [[ 1. 0. 0. ..., 0. 0. 0.]
[ 1. 0. 0. ..., 0. 0. 0.]
[ 0. 1. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 1. ..., 0. 0. 0.]
[ 1. 0. 0. ..., 0. 0. 0.]]
The dependent variables (Y) are stored in a one dimensional array. Everything is wonderful, except now when I come to plot these values I get an error:
# Plot outputs
pl.scatter(x_test, y_test, color='black')
ValueError: x and y must be the same size
When I use numpy.size on X and Y respectively it is clear thats a reasonable error:
>>> print np.size(x)
5096
>>> print np.size(y)
98
Interestingly, the two sets of data are accepted by the fit method.
My question is how can I transform the output of OneHotEncoder to use in my regression?
If I understand you correctly, you have your X matrix as an input as an [m x n] matrix and some output Y of [n x 1], where m = number of features and n = number of data points.
Firstly, the linear regression fitting function will not care that X is of dimension [m x n] and Y of [n x 1] as it will simply use a parameter of dimension [1 x m], i.e.,
Y = theta * X
Unfortunately, as noted by eickenberg, you cannot plot all of the X features against the Y value using matplotlibs scatter call as you have, hence you get the error message of incompatible sizes, it wants to plot n x n not (n x m) x n.
To fix your problem, try looking at a single feature at a time:
pl.scatter(x_test[:,0], y_test, color='black')
Assuming you have standardised your data (subtracted the mean and divided by the average) a quick and dirty way to see the trends would be plot all of them on a single axes:
fig = plt.figure(0)
ax = fig.add_subplot(111)
n, m = x_test.size
for i in range(m):
ax.scatter(x_test[:,m], y_test)
plt.show()
To visualise all at once on independent figures (depending on the number of features) then look at, e.g., subplot2grid routines or another python module like pandas.
Related
I am wondering if it's possible to vectorize the following operation in Numpy or Tensorflow. The ultimate goal is to do it in Tensorflow, but seems Numpy would be easier for illustration here.
The problem is to get an discretized occupancy grid from a set of 2D points (x, y), and calculate the average of points in that particular grid.
Given 2D array xy, every row [x, y] will be mapped to an index [xid, yid]. This step is done via np.apply_along_axis
In another 3D array grid_sum, given the [xid, yid] we calculated in 1), we update grid_sum[xid, yid] += [x, y].
In yet another 2D array grid_count, given the [xid, yid] we calculated in 1), we update grid_sum[xid, yid] += 1.
We get the final results 3D array grid_mean by dividing grid_sum by grid_count at every [xid, yid].
The problem of vectorize this operation is different rows might be trying to write to the same location in the new array, creating a race condition. How can I handle this?
I have the following minimal example here to help understand this situation.
Edit after comment
This example works fine because I use a for loop. Is it possible to achieve the same without the for loop?
import numpy as np
xy = np.array([[1, 1], [1, 1]], dtype=np.int16)
grid_sum = np.zeros([3, 3, 2])
grid_count = np.zeros([3, 3])
for i in range(xy.shape[0]):
idx = xy[i] # simple case, just use array value as index
grid_sum[idx[0], idx[1], :] += xy[i]
grid_count[idx[0], idx[1]] += 1
print(grid_sum)
print(grid_count)
# grid_sum result
# [[[0. 0.]
# [0. 0.]
# [0. 0.]]
# [[0. 0.]
# [2. 2.]
# [0. 0.]]
# [[0. 0.]
# [0. 0.]
# [0. 0.]]]
# grid_count result
# [[0. 0. 0.]
# [0. 2. 0.]
# [0. 0. 0.]]
How can I fill the elements of the lower triangular part of a matrix, including the diagonal, with values from a column vector?
For example i have :
m=np.zeros((3,3))
n=np.array([[1],[1],[1],[1],[1],[1]]) #column vector
I want to replace values which have indices of (0,0),(1,0),(1,1),(2,0),(2,1),(2,2) from m with the vector n, so I get:
m=np.array([[1,0,0],[1,1,0],[1,1,1]])
Then I want make the same operation to m.T to get as a result:
m=np.array([[1,1,1],[1,1,1],[1,1,1]])
Can someone help me please? n should be a vector with shape(6,1)
I'm not sure if there's going to be a clever numpy-specific way of doing this, but it looks relatively straightforward like this:
import numpy as np
m=np.zeros((3,3))
n=np.array([[1],[1],[1],[1],[1],[1]]) #column vector
indices=[(0,0),(1,0),(1,1),(2,0),(2,1),(2,2)]
for ix, index in enumerate(indices):
m[index] = n[ix][0]
print(m)
for ix, index in enumerate(indices):
m.T[index] = n[ix][0]
print(m)
Output of the above is:
[[1. 0. 0.]
[1. 1. 0.]
[1. 1. 1.]]
[[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]]
I have got a set of histograms from numpy.histogram:
probas, years = zip(*[np.histogram(r, bins= bin_values) for r in results])
results is an array of shape(9, 10000) The bin values are the years from 2029 and 2066. The probas array has a shape (9,37) and the years array (9,38). So years[:,:-1] has a shape of (9,37).
I can obtaint he cumulative histogram data using:
probas = np.cumsum(probas, axis=1)
I can then normalize it to [0,1]:
probas = np.asarray(probas)
probas = probas/np.max(probas, axis = 0)
I then try and interpolate that cumulative distribution using scipy:
inverse_pdfs = [scipy.interpolate.interp1d(probas[i], years[i,:-1]) for i in range(probas.shape[0])]
When I plot the third histogram of the data set as a plt.plot() and that from the inverse_pdfs using:
i = 2
plt.plot(years[i,:-1], probas[i], color="orange")
probability_range = np.arange(0.,1.01,0.01)
plt.plot([inverse_pdfs[i](p) for p in probability_range], probability_range, color="blue")
I obtain:
As you can see the match is pretty good for most of the years after 2042, but before that it is very bad.
Any suggestion on how to improve that match, or where the problem comes from, would be very welcome.
For information, the data used to train the interpolator on the third histogram are:
years[2,:-1]: [2029. 2030. 2031. 2032. 2033. 2034. 2035. 2036. 2037. 2038. 2039. 2040.
2041. 2042. 2043. 2044. 2045. 2046. 2047. 2048. 2049. 2050. 2051. 2052.
2053. 2054. 2055. 2056. 2057. 2058. 2059. 2060. 2061. 2062. 2063. 2064.
2065.]
probas[2]:[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0.0916 0.2968 0.4888 0.6666 0.8335 0.9683 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. ]
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn import svm
digits = datasets.load_digits()
print(digits.data)
classifier = svm.SVC(gamma=0.4, C=100)
x, y = digits.data[:-1], digits.target[:-1]
x = x.reshape(1,-1)
y = y.reshape(-1,1)
print((x))
classifier.fit(x, y)
###
print('Prediction:', classifier.predict(digits.data[-3]))
###
plt.imshow(digits.images[-1], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()
I have reshaped the x and y as well. Still I'm getting an error saying :
Found input variables with inconsistent numbers of samples: [1, 1796]
Y has 1-d array with 1796 elements whereas x has many. How does it show 1 for x?
Actually scrap what I suggested below:
This link describes the general dataset API. The attribute data is a 2d array of each image, already flattened:
import sklearn.datasets
digits = sklearn.datasets.load_digits()
digits.data.shape
#: (1797, 64)
This is all you need to provide, no reshaping required. Similarly, the attribute data is a 1d array of each label:
digits.data.shape
#: (1797,)
No reshaping necessary. Just split into training and testing and run with it.
Try printing x.shape and y.shape. I feel that you're going to find something like: (1, 1796, ...) and (1796, ...) respectively. When calling fit for classifiers in scikit it expects two identically shaped iterables.
The clue, why are the arguments when reshaping different ways around:
x = x.reshape(1, -1)
y = y.reshape(-1, 1)
Maybe try:
x = x.reshape(-1, 1)
Completely unrelated to your question, but you're predicting on digits.data[-3] when the only element left out of the training set is digits.data[-1]. Not sure if that was intentional.
Regardless, it could be good to check your classifier over more results using the scikit metrics package. This page has an example of using it over the digits dataset.
The reshaping will transform your 8x8 matrix to a 1-dimensional vector, which can be used as a feature. You need to reshape the entire X vector, not only those of the training data, since the one's you will use for prediction need to have the same format.
The following code shows how:
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn import svm
digits = datasets.load_digits()
classifier = svm.SVC(gamma=0.4, C=100)
x, y = digits.images, digits.target
#only reshape X since its a 8x8 matrix and needs to be flattened
n_samples = len(digits.images)
x = x.reshape((n_samples, -1))
print("before reshape:" + str(digits.images[0]))
print("After reshape" + str(x[0]))
classifier.fit(x[:-2], y[:-2])
###
print('Prediction:', classifier.predict(x[-2]))
###
plt.imshow(digits.images[-2], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()
###
print('Prediction:', classifier.predict(x[-1]))
###
plt.imshow(digits.images[-1], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()
It will output:
before reshape:[[ 0. 0. 5. 13. 9. 1. 0. 0.]
[ 0. 0. 13. 15. 10. 15. 5. 0.]
[ 0. 3. 15. 2. 0. 11. 8. 0.]
[ 0. 4. 12. 0. 0. 8. 8. 0.]
[ 0. 5. 8. 0. 0. 9. 8. 0.]
[ 0. 4. 11. 0. 1. 12. 7. 0.]
[ 0. 2. 14. 5. 10. 12. 0. 0.]
[ 0. 0. 6. 13. 10. 0. 0. 0.]]
After reshape[ 0. 0. 5. 13. 9. 1. 0. 0. 0. 0. 13. 15. 10. 15. 5.
0. 0. 3. 15. 2. 0. 11. 8. 0. 0. 4. 12. 0. 0. 8.
8. 0. 0. 5. 8. 0. 0. 9. 8. 0. 0. 4. 11. 0. 1.
12. 7. 0. 0. 2. 14. 5. 10. 12. 0. 0. 0. 0. 6. 13.
10. 0. 0. 0.]
And a correct prediction for the last 2 images, which weren't used for training - you can decide however to make a bigger split between testing and training set.
I'm using in the robotics toolbox (c.q. robot module) in Python http://code.google.com/p/robotics-toolbox-python/) and I'm utterly confused how to interpret the following conversion results (I've done things like this in the past could always work them out...).
A simple phi=PI/10 rotation about the x-axes produces the following (3x3) rotation matrix:
R =
[ 1. 0. 0. ]
[ 0. 0.99609879 -0.08824514]
[-0. 0.08824514 0.99609879]]
( where 0.996..=cos(phi) and 0.0882..=sin(phi) )
with corresponding (4x4) homogeneous transformation matrix:
T = | R 0 | =
| 0 1 |
[[ 1. 0. 0. 0. ]
[ 0. 0.99609879 -0.08824514 0. ]
[-0. 0.08824514 0.99609879 0. ]
[ 0. 0. 0. 1. ]]
Conversion of T into angle representation produces the following:
RPY (Roll, pitch, yaw) angles (around z, y and x axes, respectively, ... I presume):
print robot.tr2rpy(T)
[[ 0. 0. 0.08836007]]
Problem: How can the x-rotation be the last element (rather than the first)....?
Further:
Euler angles (around x, y and z axes, respectively, ... I presume):
print robot.tr2eul(T)
[[-1.57079633 0.08836007 1.57079633]]
( = [[ -PI/4, sin(phi), PI/4 ]] ))
Problem: My interpretation (sequential rotation around x,y,z axes) tells me the result is dead wrong...?
What am I missing? Thanks.
-- Henk
Solved:
With successive angles, a[0],a[1],a[2],
1) For Euler angles these are successive rotations around Z-Y-Z axes;
2) For RPY angles these are successive Y(aw)-P(itch)-R(roll) rotations (i.e around Z-Y-X axes).