Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
Is it better to implement my own K-means Algorithm in Python or use the pre-implemented K-mean Algorithm in Python libraries like for example Scikit-Learn?
Before answering which is better, here is a quick reminder of the algorithm:
"Choose" the number of clusters K
Initiate your first centroids
For each point, find the closest centroid
according to a distance function D
When all points are attributed to a cluster, calculate the barycenter of the cluster which become its new centroid
Repeat step 3. and step 4. until convergence
As stressed previously, the algorithm depends on various parameters:
The number of clusters
Your initial centroid positions
A distance function to calculate distance between any point and centroid
A function to calculate the barycenter of each new cluster
A convergence metric
...
If none of the above is familiar to you, and you want to understand the role of each parameter, I would recommend to re-implement it on low-dimensional data-sets. Moreover, the implemented Python libraries might not match your specific requirements - even though they provide good tuning possibilities.
If your point is to use it quickly with a big-picture understanding, you can use existing implementation - scikit-learn would be a good choice.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed last year.
Improve this question
When looking for a model such as linearregresion or DecisionTreeRegressor, which is the best scoring to use? In https://scikit-learn.org/stable/modules/model_evaluation.html we can see the following:
explained_variance, max_error, neg_mean_absolute_error, neg_mean_squared_error, mean_squared_error, neg_root_mean_squared_error, neg_mean_squared_log_error, neg_median_absolute_error, r2, neg_mean_poisson_deviance, neg_mean_gamma_deviance, neg_mean_absol
However for someone introducing to the field is not easy to decide which to use. For a simple linear regression I would use r2 (probably because is the one I'm used to from the school), but is the best? However, for a decisionTreeRegressor, is it also good this parameter or is it better to use another one?
Also, the fitting should be good if, in the case of r^2=1 (altough Anscombe's quartet). How about for the rest?
There's no best scoring function. The one you pick should depend on your problem and what you're trying to measure.
I suggested you take a look at the regression metrics section of the page you linked. You can find descriptions and suggestions of usage, for example (for MSLE):
This metric is best to use when targets having exponential growth, such as population counts, average sales of a commodity over a span of years etc.
So, a good question on this topic would be something like I am studying X and am trying to measure Y, which scoring metric should I be using?
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I have a few lists of movement tracking data, which looks something like this
I want to create a list of outputs where I mark these large spikes, essentially telling that there is a movement at that point.
I applied a rolling standard deviation on the data with a window size of two and got this result
Now I can see the spikes which mark the point of interest, but I am not sure how to do it in code. A statistical tool to measure these spikes, which can be used to flag these spikes.
There are several approaches that you can use for an anomaly detection task.
The choice depends on your data.
If you want to use a statistical approach, you can use some measures like z-score or IQR.
Here you can find a tutorial for these measures.
Here instead, you can find another tutorial for a statistical approach which uses mean and variance.
Last but not least, I suggest you also to check how to use a control chart, because in some cases it's enough.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I recently got interested in soccer statistics. Right now I want to implement the famous Dixon-Coles Model in Python 3.5 (paper-link).
The basic problem is, that from the model described in the paper a Likelihood function with numerous parameters results, which needs to be maximized.
For example: The likelihood function for one Bundesliga season would result in 37 parameters. Of course I do the minimization of the corresponding negative log-likelihood function. I know that this log function is strictly convex so the optimization should not be too difficult. I also included the analytic gradient, but as the number of parameters exceeds ~10 the optimization methods from the SciPy-Package fail (scipy.optimize.minimize()).
My question:
Which other optimization techniques are out there and are mostly suited for optimization problems involving ~40 independent parameters?
Some hints to other methods would be great!
You may want to have a look at convex optimization packages like https://cvxopt.org/ or https://www.cvxpy.org/. It's Python-based, hence easy to use!
You can make use of Metaheuristic algorithms which work both on convex and non-convex spaces. Probably the most famous one of them is Genetic algorithm. It is also easy to implement and the concept is straightforward. The beautiful thing about Genetic algorithm is that you can adapt it to solve most of the optimization problems.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
How can I calculate similarity between user and score?
For example, df:
user score category_cluster
i 4.5 category1
j 5 category1
k 9.5 category2
I want to have a result like:
similarity between useri_j score in the same category_cluster if not in the same cluster do not compute similarity. How would you measure the similarity?
You will need to define a score function first. Among others, you have manhattan or euclidean distances, which are the probably the most used ones. For more information about distances, I suggest you looking into scikit-learn, they hae a wide variety of distances (metrics) implemented. Look here for a list (you can research later what each of them measure).
Some of them are distance metrics (how different the elements are, the closest to 0 the more similar) while others measure similarity (like exponential kernels, closer to 1 more similar). Is easy to swap between distance and similarity metrics (being the most basic one distance = 1. - similarity assuming both are in the [0,1] range).
As for your similarity example similarity[i,j] = 0.9 doesn't make any sense to me. What would be the similarity of i and k? Which formula did you use to get that 0.9? If you clarify it I could provide you with a numpy based representation.
For direct similarity metrics, have a look here. You can use any of them if they suit your needs. It is explained what each of those measure.
A example usage of rbf_kernel.
data = df['score']
similarity = rbf_kernel(data.reshape(-1, 1), gamma=1.) # Try different values of gamma
gamma here acts like a threshold different values of gamma will make being similar less or more cheap.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I'm clustering some data using scikit.
I have the easiest possible task: I do know the number of clusters. And, I do know the size of each cluster. Is it possible to specify this information and relay it to the K-means function?
No. You need some type of constrained clustering algorithm to do this, and none are implemented in scikit-learn. (This is not "the easiest possible task", I wouldn't even know of a principled algorithm that does this, aside from some heuristic moving of samples from one cluster to another.)
It won't be k-means anymore.
K-means is variance minimization, and it seems your objective is to produce paritions of a predefined size, not of minimum variance.
However, here is a tutorial that shows how to modify k-means to produce clusters of the same size. You can easily extend this to produce clusters of the desired sizes instead of the average size. It's fairly easy to modify k-means this way. But the results will be even more meaningless than k-means results on most data sets. K-means is often just as good as random convex partitions.
I can think only of bruteforce algorithm. If clusters are well separated then you may try to run clustering several times with different random initializations providing just number of clusters as an input. After each iteration count size of each cluster, sort it and compare to sorted list of known cluster sizes. If they don't match rinse and repeat.