Converting code from STATA to Python and I am trying to figure out how to calculate a regression and R-squared based on STATA's aweight function.
I have my X variable, my Y variable, and want to weight the regression based on a separate Z variable (population). How do you do this in Python?
sklearn aka scikit-learn is my go to library for modelling.
This tutorial might be your answere:
SGD: Weighted samples
Related
I am working on a machine learning project using sklearn GridSearchCV. My goal is that from the output of the GridSearchCV I need to find all the parameter settings which give the positive predicted value of greater than 0.95 and compute the confusion matrix for each of them. I have implemented the customized score function for the grid to compute the PPV value and inside the score function, I am computing the confusion matrix and writing them into a text file. However, due to the parallel execution, I can't keep track of them.
Is there a way that I can accomplish this?
What formula does this function use after computing a simple linear regression (OLS) on data? There's many different prediction interval formulas, some using RMSE (root mean square error), some using standard deviation, etc.
http://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLSResults.get_prediction.html#statsmodels.regression.linear_model.OLSResults.get_prediction
In particular, I want to know if it's using this formula or something else:
Note the standard deviation of x parameter.
It does use the same formula as shown above. The standard error of the prediction is calculated using the formula sqrt(variance of predicted mean + variance of residuals). This can be simplified as shown in this link.
Based on the Logistic Regression function:
I'm trying to extract the following values from my model in scikit-learn.
and
Where is the intercept and is the regression coefficient. (as per the wikipedia)
Now, I think I can get by doing model.intercept_ but I've been struggling to get . Any ideas?
You can access the coefficient of the features using model.coef_.
It gives a list of values that corresponds to the values beta1, beta2 and so on. The size of the list depends on the amount of explanatory variables your logistic regression uses.
I'm attempting to translate R code into Python and running into trouble trying to replicate the R lm{stats} function which contains 'weights', allowing for weights to be used in the fitting process.
My ultimate goal is to simply run a weighted linear regression in Python using the statsmodels library.
Searching through the Statsmodels issues I've located caseweights in linear models #743 and SUMM/ENH rare events, unbalanced sample, matching, weights #2701 which make me think this may not be possible with Statsmodels.
Is it possible to add weights to GLM models in Statsmodels or alternatively, is there a better way to run a weighted linear regression in python?
WLS has weights for the linear model, where weights are interpreted as inverse variance for the result statistics.
http://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.WLS.html
The unreleased version of statsmodels has frequency weights for GLM, but no variance weights.
see freq_weights in http://www.statsmodels.org/dev/generated/statsmodels.genmod.generalized_linear_model.GLM.html
(There are many open issues to expand the types of weights and adding weights to other models, but those are not available yet.)
I am running a logistic regression using statsmodels and am trying to find the score of my regression. The documentation doesn't really provide much information about the score method unlike sklearn which allows the user to pass a test dataset with the y value and the regression coefficients i.e. lr.score(test_data, target). What and how should I pass parameters to the statsmodels's score function? Documentation: http://statsmodels.sourceforge.net/stable/generated/statsmodels.discrete.discrete_model.Logit.score.html#statsmodels.discrete.discrete_model.Logit.score
In statistics and econometrics score refers usually to the derivative of the log-likelihood function. That's the definition used in statsmodels.
Prediction performance measures for classification or regression with binary dependent variables have largely been neglected in statsmodels.
An open pull request is here
https://github.com/statsmodels/statsmodels/issues/1577
statsmodels does have performance measures for continuous dependent variables.
You pass it model parameters, i.e. the coefficients for the predictors. However, that method doesn't do what you think it does: it returns the score vector for the model, not the accuracy of its predictions (like the scikit-learn score method).
But you can always check sm.rsquared