Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed last year.
Improve this question
When looking for a model such as linearregresion or DecisionTreeRegressor, which is the best scoring to use? In https://scikit-learn.org/stable/modules/model_evaluation.html we can see the following:
explained_variance, max_error, neg_mean_absolute_error, neg_mean_squared_error, mean_squared_error, neg_root_mean_squared_error, neg_mean_squared_log_error, neg_median_absolute_error, r2, neg_mean_poisson_deviance, neg_mean_gamma_deviance, neg_mean_absol
However for someone introducing to the field is not easy to decide which to use. For a simple linear regression I would use r2 (probably because is the one I'm used to from the school), but is the best? However, for a decisionTreeRegressor, is it also good this parameter or is it better to use another one?
Also, the fitting should be good if, in the case of r^2=1 (altough Anscombe's quartet). How about for the rest?
There's no best scoring function. The one you pick should depend on your problem and what you're trying to measure.
I suggested you take a look at the regression metrics section of the page you linked. You can find descriptions and suggestions of usage, for example (for MSLE):
This metric is best to use when targets having exponential growth, such as population counts, average sales of a commodity over a span of years etc.
So, a good question on this topic would be something like I am studying X and am trying to measure Y, which scoring metric should I be using?
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I have a few lists of movement tracking data, which looks something like this
I want to create a list of outputs where I mark these large spikes, essentially telling that there is a movement at that point.
I applied a rolling standard deviation on the data with a window size of two and got this result
Now I can see the spikes which mark the point of interest, but I am not sure how to do it in code. A statistical tool to measure these spikes, which can be used to flag these spikes.
There are several approaches that you can use for an anomaly detection task.
The choice depends on your data.
If you want to use a statistical approach, you can use some measures like z-score or IQR.
Here you can find a tutorial for these measures.
Here instead, you can find another tutorial for a statistical approach which uses mean and variance.
Last but not least, I suggest you also to check how to use a control chart, because in some cases it's enough.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
We have requirement that we get different type of documents from client like student admission document, marksheet etc. So we want to create an algorithm which identify which document it is. So for this we choose some specific keyword to identify the document type like if admission documents have keywords like fee, admission etc . And marksheet documents keyword like marks, grade etc. So Here we can predict document type by comparing keywords frequency.
For this above requirement which algorithm should implement? I was planning to implement multinomial naive base algorithm. But I can not fit my data in to it.
FYI.. I am using python sklearn module.
Can you please anyone tell me which algorithm should suitable for above requirement. If possible can you also please provide an example with code so that i can easily figure out the solution?
You are looking for Topic Modeling solution and there are plenty of it to solve the problem. via python and scikit-learn i recommend you to take a look at this article
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
So I've been kinda new to some concepts, can someone please briefly explain what is the difference between these two codes?
regressor=LinearRegression()
regressor.fit(train_X,train_Y)
.
LinearRegression().fit(train_X,train_Y)
The main difference between the two is that the first creates a variable called regressor which you can later access. The second doesn't do this.
Otherwise the two are doing exactly the same thing.
The purpose of fitting (training) the regressor is to use it in the future for prediction. In you r second example (LinearRegression().fit(train_X,train_Y)) you create an anonymous regressor, train it, and then immediately discard. You cannot use it anymore as it does not have any references.
In the first example, you first create a regressor and assign it to a variable, then train the regressor that was previously created. You can later use it for prediction or any other purpose.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I recently got interested in soccer statistics. Right now I want to implement the famous Dixon-Coles Model in Python 3.5 (paper-link).
The basic problem is, that from the model described in the paper a Likelihood function with numerous parameters results, which needs to be maximized.
For example: The likelihood function for one Bundesliga season would result in 37 parameters. Of course I do the minimization of the corresponding negative log-likelihood function. I know that this log function is strictly convex so the optimization should not be too difficult. I also included the analytic gradient, but as the number of parameters exceeds ~10 the optimization methods from the SciPy-Package fail (scipy.optimize.minimize()).
My question:
Which other optimization techniques are out there and are mostly suited for optimization problems involving ~40 independent parameters?
Some hints to other methods would be great!
You may want to have a look at convex optimization packages like https://cvxopt.org/ or https://www.cvxpy.org/. It's Python-based, hence easy to use!
You can make use of Metaheuristic algorithms which work both on convex and non-convex spaces. Probably the most famous one of them is Genetic algorithm. It is also easy to implement and the concept is straightforward. The beautiful thing about Genetic algorithm is that you can adapt it to solve most of the optimization problems.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
Is it better to implement my own K-means Algorithm in Python or use the pre-implemented K-mean Algorithm in Python libraries like for example Scikit-Learn?
Before answering which is better, here is a quick reminder of the algorithm:
"Choose" the number of clusters K
Initiate your first centroids
For each point, find the closest centroid
according to a distance function D
When all points are attributed to a cluster, calculate the barycenter of the cluster which become its new centroid
Repeat step 3. and step 4. until convergence
As stressed previously, the algorithm depends on various parameters:
The number of clusters
Your initial centroid positions
A distance function to calculate distance between any point and centroid
A function to calculate the barycenter of each new cluster
A convergence metric
...
If none of the above is familiar to you, and you want to understand the role of each parameter, I would recommend to re-implement it on low-dimensional data-sets. Moreover, the implemented Python libraries might not match your specific requirements - even though they provide good tuning possibilities.
If your point is to use it quickly with a big-picture understanding, you can use existing implementation - scikit-learn would be a good choice.