I have a X matrix with 1000 features (columns) and 100 lines of float elements and y a vector target of two classes 0 and 1, the dimension of y is (100,1). I want to compute the 10 best features in this matrix which discriminate the 2 classes. I tried to use the chi-square defined in scikit-learn but X is of float elements.
Can you help me and tell me a function that I can use.
Thank you.
I am not sure what you mean by X is of float elements. Chi2 works for non-negative histogram data (i.e. l1 normalized). If you data doesn't satisfy this, you have to use another method.
There is a whole module of feature selection algorithms in scikit-learn. Have you read the docs? The simplest one would be using SelectKBest.
Recursive Feature Elimination(RFE) has been really effective for me. This method assigns weights to all the features initially, and removes the feature with the least weight. This step is applied repeatedly till we achieve our desired number of features (in your case 10).
http://scikit-learn.org/stable/modules/feature_selection.html#recursive-feature-elimination
As far as I know, if you data is correlated, L1 penalty selection might not be the best idea. Correct me if I'm wrong.
Related
I have a dataset of peak load for a year. Its a simple two column dataset with the date and load(kWh).
I want to train it on the first 9 months and then let it predict the next three months . I can't get my head around how to implement SVR. I understand my 'y' would be predicted value in kWh but what about my X values?
Can anyone help?
given multi-variable regression, y =
Regression is a multi-dimensional separation which can be hard to visualize in ones head since it is not 3D.
The better question might be, which are consequential to the output value `y'.
Since you have the code to the loadavg in the kernel source, you can use the input parameters.
For Python (I suppose, the same way will be for R):
Collect the data in this way:
[x_i-9, x_i-8, ..., x_i] vs [x_i+1, x_i+2, x_i+3]
First vector - your input vector. Second vector - your output vector (or value if you like). Use method fit from here, for example: http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR.fit
You can try scaling, removing outliers, apply weights and so on. Play :)
I am estimating the fundamental matrix and the essential matrix by using the inbuilt functions in opencv.I provide input points to the function by using ORB and brute force matcher.These are the problems that i am facing:
1.The essential matrix that i compute from in built function does not match with the one i find from mathematical computation using fundamental matrix as E=k.t()FK.
2.As i vary the number of points used to compute F and E,the values of F and E are constantly changing.The function uses Ransac method.How do i know which value is the correct one??
3.I am also using an inbuilt function to decompose E and find the correct R and T from the 4 possible solutions.The value of R and T also change with the changing E.More concerning is the fact that the direction vector T changes without a pattern.Say it was in X direction at a value of E,if i change the value of E ,it changes to Y or Z.Y is this happening????.Has anyone else had the same problem.???
How do i resolve this problem.My project involves taking measurements of objects from images.
Any suggestions or help would be welcome!!
Both F and E are defined up to a scale factor. It may help to normalize the matrices, e. g. by dividing by the last element.
RANSAC is a randomized algorithm, so you will get a different result every time. You can test how much it varies by triangulating the points, or by computing the reprojection errors. If the results vary too much, you may want to increase the number of RANSAC trials or decrease the distance threshold, to make sure that RANSAC converges to the correct solution.
Yes, Computing Fundamental Matrix gives a different matrix every time as it is defined up to a scale factor.
It is a Rank 2 matrix with 7DOF(3 rot, 3 trans, 1 scaling).
The fundamental matrix is a 3X3 matrix, F33(3rd col and 3rd row) is scale factor.
You make ask why do we append matrix with constant at F33, Because of (X-Left)F(x-Right)=0, This is a homogenous equation with infinite solutions, we are adding a constraint by making F33 constant.
I have two observations of the same event. Let say X and Y.
I suppose to have nc clusters. I am using sklearn to make the clustering.
x = KMeans(n_clusters=nc).fit_predict(X)
y = KMeans(n_clusters=nc).fit_predict(Y)
is there a measure that allow me to compare x and y: i.e. this measure will be 1 if the clusters x and y are the same.
Just extract the cluster centers of your kmeans-objects (see the docs):
x_centers = x.cluster_centers_
y_centers = y.cluster_centers_
The you have to decide which metric you are using to compare these. Keep in mind that the centers are floating-points, the clustering-process is a heuristic and the clustering-process is a random-algorithm. This means, you will get something which interprets as not exactly the same with a high probability, even for cluster-objects trained on the same data.
This link discusses some approaches and the problems.
The Rand Index and its adjusted version do this exactly. Two cluster assignments that match (even if the labels themselves, which are treated as arbitrary, are different), get a score of 1. A value of 0 means they don't agree at all. The Adjusted Rand Index uses its baseline as random assignment of points to clusters.
I am trying to fit a linear regression Ax = b where A is a sparse matrix and b a sparse vector. I tried scipy.sparse.linalg.lsqr but apparently b needs to be a numpy (dense) array. Indeed if i run
A = [list(range(0,10)) for i in range(0,15)]
A = scipy.sparse.coo_matrix(A)
b = list(range(0,15))
b = scipy.sparse.coo_matrix(b)
scipy.sparse.linalg.lsqr(A,b)
I end up with:
AttributeError: squeeze not found
While
scipy.sparse.linalg.lsqr(A,b.toarray())
seems to work.
Unfortunately, in my case b is a 1,5 billion x 1 vector and I simply can't use a dense array. Does anybody know a workaround or other libraries for running linear regression with sparse matrix and vector?
It seems that the documentation specifically asks for numpy array. However, given the scale of your problem, maybe its easier to use the closed-form solution of Linear Least Squares?
Given that you want to solve Ax = b, you can cast the normal equations and solve those instead. In other words, you'd solve min ||Ax-b||.
The closed form solution would be x = (A.T*A)^{-1} * A.T *b.
Of course, this closed form solution comes with its own requirements (specifically, on the rank of the matrix A).
You can solve for x using spsolve or if that's too expensive, then using an iterative solver (like Conjugate Gradients) to get an inexact solution.
The code would be:
A = scipy.sparse.rand(1500,1000,0.5) #Create a random instance
b = scipy.sparse.rand(1500,1,0.5)
x = scipy.sparse.linalg.spsolve(A.T*A,A.T*b)
x_lsqr = scipy.sparse.linalg.lsqr(A,b.toarray()) #Just for comparison
print scipy.linalg.norm(x_lsqr[0]-x)
which on a few random instances, consistently gave me values less than 1E-7.
Apparently billions of observations is too much for my machine. I ended up:
Changing algorithm to Stochastic Gradient Descent (SGD): faster with many obs
Removing completely sparse examples (i.e. features and label equal to zero)
Indeed, the update rule of SGD with least square loss function is always zero for obs in 2. This reduced observations from billions to millions which turned out to be feasible under SGD on my machine.
I'd like to linearly fit the data that were NOT sampled independently. I came across generalized least square method:
b=(X'*V^(-1)*X)^(-1)*X'*V^(-1)*Y
The equation is Matlab format; X and Y are coordinates of the data points, and V is a "variance matrix".
The problem is that due to its size (1000 rows and columns), the V matrix becomes singular, thus un-invertable. Any suggestions for how to get around this problem? Maybe using a way of solving generalized linear regression problem other than GLS? The tools that I have available and am (slightly) familiar with are Numpy/Scipy, R, and Matlab.
Instead of:
b=(X'*V^(-1)*X)^(-1)*X'*V^(-1)*Y
Use
b= (X'/V *X)\X'/V*Y
That is, replace all instances of X*(Y^-1) with X/Y. Matlab will skip calculating the inverse (which is hard, and error prone) and compute the divide directly.
Edit: Even with the best matrix manipulation, some operations are not possible (for example leading to errors like you describe).
An example of that which may be relevant to your problem is if try to solve least squares problem under the constraint the multiple measurements are perfectly, 100% correlated. Except in rare, degenerate cases this cannot be accomplished, either in math or physically. You need some independence in the measurements to account for measurement noise or modeling errors. For example, if you have two measurements, each with a variance of 1, and perfectly correlated, then your V matrix would look like this:
V = [1 1; ...
1 1];
And you would never be able to fit to the data. (This generally means you need to reformulate your basis functions, but that's a longer essay.)
However, if you adjust your measurement variance to allow for some small amount of independence between the measurements, then it would work without a problem. For example, 95% correlated measurements would look like this
V = [1 0.95; ...
0.95 1 ];
You can use singular value decomposition as your solver. It'll do the best that can be done.
I usually think about least squares another way. You can read my thoughts here:
http://www.scribd.com/doc/21983425/Least-Squares-Fit
See if that works better for you.
I don't understand how the size is an issue. If you have N (x, y) pairs you still only have to solve for (M+1) coefficients in an M-order polynomial:
y = a0 + a1*x + a2*x^2 + ... + am*x^m