I've been fiddling around with the scipy.optimize.curve_fit() function today and I can get some pretty good results, but I'm not too sure how to make some data points weigh more than others.
Let me briefly summarize the situation:
We need to fit a decay curve to some data gathered from our experiment. Some data points occur more often than others and these are determined by a weight. So that means that if we have one data point A with weight x and one data point B with weight 2x, this would be equal to fitting a curve to one data point A with weight x and two data point B's with weight x.
The problem is that the curve_fit function can only be weighed using uncertainties, i.e. sigmas. I thought I was smart to translate each weight into the proportion of the sum of all the weights and then translate this proportion to a Z-score (I thought that would be equivalent in terms of uncertainty) and while this gave a MUCH better fit than not weighing anything at all, I still found through some unit testing that it wasn't when comparing weights of 0.5 to having two actual data points.
How can I use curve_fit with linear weights?
PS: Through unit testing I've found that fitting data points:
(0,0) with weight 1
(1,0) with weight 1
(1,1) with weight 1
(1,1) with weight 1
yields an equal result as fitting:
(0,0) with weight 1
(1,0) with weight 1
(1,1) with weight 0.70710678118
And peculiarly, sin(0.5*pi) = 1 and sin(0.25*pi) = 0.70710678118!! So there seems to be a sine relation here? Unfortunately my math skills are limiting me in understanding the exact relation.
Also, sin(0.125*pi) unfortunately doesn't equal a weight of 3 or 4...
Related
My dataset has feature columns and a target label of 0 and 1.
When I use SVM classifier for binary classification it predicts well.
But my question is how is it mathematically predicted.?
The marginal hyperplanes H1 and H2 have the equations:
W^T X +b >= 1
meaning if greater than +1 it falls in one class. And if less than -1, it falls in another class.
But we have given the target label 0 and 1.
How actually is it done mathematically?
Anyone expert please.....
Basically, SVM wants to find the optimal hyperplane that splits the datapoints in such way that the margin between the closest datapoints of each class (the so-called support vectors) is maximized. This all breaks down to the following Lagrangian optimization problem:
w:vector that determines the optimum hyperplane ( for intuition, make yourself familiar with the geometrical meaning of a dot product)
(w^T∙x_i+b) is a scalar and displays the geometrical distance
between single datapoint x_i and the maximum margin hyperplane
b is a bias vector ( I think it comes from indetermined integral in
derivation of SVM) more on that you can find here: University Stanford -Computer Science Lecture 3-SVM
λ_i the Lagrangian multiplier
y_i the normalized classification boundary
Solving the optimization problem leads to all necessary parameters of w, b, and lambda.
To answer you quesiton in one sentence: The class boundaries [-1,1] are set arbitrarily. It is really just definition.
The labels of your binary data [0;1] (so-called dummy varaibles) have nothing to with the boundaries. It is just a convenient way to label binary data. The labels are only needed to link the features to its corresponding class or category.
The only non paramter in Formula (8) is x_i , your datapoint in feature space.
At least thats how I understand SVM. Feel free to correct me if I am wrong or unprecise.
So I'm having a hard time conceptualizing how to make mathematical representation of my solution for a simple logistic regression problem. I understand what is happening conceptually and have implemented it, but I am answering a question which asks for a final solution.
Say I have a simple two column dataset denoting something like likelihood of getting a promotion per year worked, so the likelihood would increase the person accumulates experience. Where X denotes the year and Y is a binary indicator for receiving a promotion:
X | Y
1 0
2 1
3 0
4 1
5 1
6 1
I implement logistic regression to find the probability per year worked of receiving a promotion, and get an output set of probabilities that seem correct.
I get an output weight vector that that is two items, which makes sense as there are only two inputs. The number of years X, and when I fix the intercept to handle bias, it adds a column of 1s. So one weight for years, one for bias.
So I have two few questions about this.
Since it is easy to get an equation of the form y = mx + b as a decision boundary for something like linear regression or a PLA, how can similarly I denote a mathematical solution with the weights of the logistic regression model? Say I have a weights vector [0.9, -0.34], how can I convert this into an equation?
Secondly, I am performing gradient descent which returns a gradient, and I multiply that by my learning rate. Am I supposed to update the weights at every epoch? As my gradient never returns zeros in this case so I am always updating.
Thank you for your time.
The logistic regression is trying to map the input value (x = years) to the output value (y=likelihood) through this relationship:
where theta and b are the weights you are trying to find.
The decision boundary will then be defined as L(x)>p or <p. where L(x) is the right term of the equation above. That is the relationship you want.
You can eventually transform it to a more linear form like the one of linear regression by passing the exponential in numerator and taking the log on both sides.
I have a regression model where my target variable (days) quantitative values ranges between 2 to 30. My RMSE is 2.5 and all the other X variables(nominal) are categorical and hence I have dummy encoded them.
I want to know what would be a good value of RMSE? I want to get something within 1-1.5 or even lesser but I am unaware what I should do to achieve the same.
Note# I have already tried feature selection and removing features will less importance.
Any ideas would be appreciated.
If your x values are categorical then it does not necessarily make much sense binding them to a uniform grid. Who's to say category A and B should be spaced apart the same as B and C. Assuming that they are will only lead to incorrect representation of your results.
As your choice of scale is the unknowns, you would be better in terms of visualisation to set your uniform x grid as being the day number and then seeing where the categories would place on the y scale if given a linear relationship.
RMS Error doesn't come into it at all if you don't have quantitative data for x and y.
I have a question regarding Active Shape Models. I am using the paper of T. Coots (which can be found here.)
I have done all of the initial steps (Procrustes Analysis to calculate mean shape, PCA to reduce dimensions) but am stuck on fitting.
This is the situation I am in now: I have calculated the mean shape with points X and have also calculated a new set of points Y that X should move to, to better fit my image.
I am using the following algorithm, which can be found on page 23 of the paper previously linked:
To clarify: is the mean shape calculated with Procrustes Analysis, and the is the matrix containing the eigenvectors calculated with PCA.
Everything goes well up to step 4. I can calculate the pose parameters and invert the transformation onto the points Y.
However, in stap 5, something strange happens. Whatever the pose parameters are calculated in stap 3 and applied in stap 4, stap 5 always results in almost exactly the same vector y' with very low values (one of them being 1.17747114e-05 for example). (So whether i calculated a scale of 1/10 or 1000, y' barely changes).
This results in the algorithm always converging to the same value of b, and thus in the same output shape x, no matter what the input set of target points Y are that I want the model points X to match with.
This sure is not the goal of the algorithm... Could anyone explain this strange behaviour? Somehow, projecting my calculated vector y in step 5 into the "tangent plane" does not take into account any of the changes made in step 4.
Edit: I have some more reasoning, though no explanation or solution. If, in step 5, i manually set y' to consist only of zeros, then in step 6, b is equal to the matrix of eigenvectors multiplicated with the meanshape. And this results in the same b I always get (since y' is always a vector with very low values).
But these eigenvectors are calculated from the meanshape using PCA... So what's expected, is that no change should take place, right?
Something you could check is that your coordinates are scaled properly: the algorithm assumes that all coordinates are scaled so that the mean shape vector has Euclidean norm one. If this is not the case (especially if it is much larger than one, you will get extremely small components for y).
I'm trying to do a K-means clustering of some dataset using sklearn. The problem is that one of the dimensions is hour-of-day: a number from 0-23 and so the distance algorithm then thinks that 0 is very far from 23, because in absolute terms it is. In reality and for my purposes, hour 0 is very close to hour 23. Is there a way to make the distance algorithm do some form of wrap-around so it computes the more 'real' time difference.
I'm doing something simple, similar to the following:
from sklearn.cluster import KMeans
clusters = KMeans(n_clusters = 2)
data = vstack(data)
fit = clusters.fit(data)
classes = fit.predict(data)
data elements looks something like [22, 418, 192] where the first element is the hour.
Any ideas?
Even though #elyase answer is accepted, I think it is not the correct approach.
Yes, to use such distance you have to refine your distance measure and so - use different library. But what is more important - concept of mean used in k-means won't suit the cyclic dimension. Lets consider following example:
#current cluster X,, based on centroid position Xc=24
x1=1
x2=24
#current cluster Y, based on centroid position Yc=10
y1=12
y2=13
computing simple arithmetic mean will place the centoids in Xc=12.5,Yc=12.5, which from the point of view of cyclic meausre is incorect, it should be Xc=0.5,Yc=12.5. As you can see, asignment based on the cyclic distance measure is not "compatible" with simple mean operation, and leads to bizzare results.
Simple k-means will result in clusters {x1,y1}, {x2,y2}
Simple k--means + distance measure result in degenerated super cluster {x1,x2,y1,y2}
Correct clustering would be {x1,x2},{y1,y2}
Solving this problem requires checking one if (whether it is better to measure "simple average" or by representing one of the points as x'=x-24). Unfortunately given n points it makes 2^n possibilities.
This seems as a use case of the kernelized k-means, where you are actually clustering in the abstract feature space (in your case - a "tube" rolled around the time dimension) induced by kernel ("similarity measure", being the inner product of some vector space).
Details of the kernel k-means are given here
Why k-means doesn't work with arbitrary distances
K-means is not a distance-based algorithm.
K-means minimizes the Within-Cluster-Sum-of-Squares, which is a kind of variance (it's roughly the weighted average variance of all clusters, where each object and dimension is given the same weight).
In order for Lloyds algorithm to converge you need to have both steps optimize the same function:
the reassignment step
the centroid update step
Now the "mean" function is a least-squares estimator. I.e. choosing the mean in step 2 is optimal for the WCSS objective. Assigning objects by least-squares deviation (= squared Euclidean distance, monotone to Euclidean distance) in step 1 also yields guaranteed convergence. The mean is exactly where your wrap-around idea would fall apart.
If you plug in a random other distance function as suggested by #elyase k-means might no longer converge.
Proper solutions
There are various solutions to this:
Use K-medoids (PAM). By choosing the medoid instead of the mean you do get guaranteed convergence with arbitrary distances. However, computing the medoid is rather expensive.
Transform the data into a kernel space where you are happy with minimizing Sum-of-Squares. For example, you could transform the hour into sin(hour / 12 * pi), cos(hour / 12 * pi) which may be okay for SSQ.
Use other, distance-based clustering algorithms. K-means is old, and there has been a lot of research on clustering since. You may want to start with hierarchical clustering (which actually is just as old as k-means), and then try DBSCAN and the variants of it.
The easiest approach, to me, is to adapt the K-means algorithm wraparound dimension via computing the "circular mean" for the dimension. Of course, you will also need to change the distance-to-centroid calculation accordingly.
#compute the mean of hour 0 and 23
import numpy as np
hours = np.array(range(24))
#hours to angles
angles = hours/24 * (2*np.pi)
sin = np.sin(angles)
cos = np.cos(angles)
a = np.arctan2(sin[23]+sin[0], cos[23]+cos[0])
if a < 0: a += 2*np.pi
#angle back to hour
hour = a * 24 / (2*np.pi)
#23.5