linear and nonlinear regression combined in python - python

I am working on data set where there are four predictors. There is good linear relation with one of the predictors but with other three i think polynomial would fit. Is there any method in python where i can predict single variable combining linear regression on one predictors and polynomial or other non-linear regression on other three predictors?
Please help.

You can fit one polynomial expression for all features which should take care of the linear one as well. The only difference is that the coefficient of the linear one will be of order 1.

You could try np.polyfit to see a potential trend in your data.
Documentation: https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html

Related

How does sklearn deal with singular matrices when performing Linear Regression

I was trying to implement my own function of linear regression by using this formula to calculate the regression coefficients:
However if given a dataset where a string value was encoded, it might be that during the split we may have a row of 0s in our data which will give us a singular matrix, thus making it non invertible.
One solution to this would be to use np.linalg.pinv instad of np.linalg.inv however it greatly reduces the model's accuracy, compared to the sklearn's implementation.
How does sklearn deal with this issue?

Sklearn multi-class logistic regression (one-vs-rest): How to match each coefficient with the iterations in ovr?

I have three classes (supermarkets, convenient stores, and grocery stores) and I want to use the logistic regression for classification. I understand how does the one-vs-rest method work and why I get three coefficients by applying LogReg.coef_. But what makes me confused is, how can I match each coefficient with the iterations in one-vs-rest?
For example, one of the coefficients must match with the situation that the estimator considers supermarkets as one type and the others as another type. How can I recognize this coefficient?

Is it possible to do a restricted VAR-X model using python?

I have seen some similar questions but they didn't work for my situation.
Here is the model I am trying to implement.
VAR model
I suppose I would need to be able to change the coefficient of stockxSign to 0 when we calculate Stock and same thing for CDSxSign when calculating the CDS
Does someone have any idea how i could do this?
It is possible now with the package that I just wrote.
https://github.com/fstroes/ridge-varx
You can fit coefficients for specific lags only by supplying a list of lags to fit coefficient matrices for. Providing "lags=[1,12]" for instance would only use the variables at lags 1 and 12.
In addition you can use Ridge regularization if you are not sure what lags should be included. If Ridge is not used, the model is fitted using the usual multivariate least squares approach.

Residual plot like method to check if linear model applicable for Multiple Linear Regression

For a simple regression model, we can use residual plots to check if a linear model is suitable to establish a relationship between our predictor and our response (by checking if the residuals are randomly spread out).
However, is there a similar method to check if linear regression is applicable when we have multiple predictors and one response (i.e. for a multiple linear regression model)?
The same intuition applies in the case where you have multiple predictor variables. (You could search either on "multivariate" or "multiple" regression, since people tend not to agree on when to use which term).
A quick statement of the theory is this: you want to "partial out" the effects of other predictors on the response in order to see the effect of just the predictor of interest. To do this, you have to first isolate the effect of any predictors aside from the one you've chosen (via regression, naturally), then get the residuals. But since you need to understand the effect all the other predictors have on the one you're interested in, you must perform a regression of the variable of interest against all the rest, from you get a second set of residuals. Plotting these against each other shows some information about the possible (non-) linearity of the relationship between your response and your variable of interest.
More concretely, consider a regression equation with two predictors and an intercept:
y = x_0 + x_1*B_1 + x_2*B_2 + u
Say we want to get partial residuals for x_1. First, regress y on x_0 and x_2. This will give you a fitted y that does not include x_1, obviously. This gets you the portion of y that is not predicted by x_1; lets call that y^. Now you need to get the residuals from that regression: y* = y - y^.
But you will then need to estimate the effect the other predictors have on the one of interest. In this case, you need to regress x_1 on x_0 and x_2. This is the amount of x_1 that is predicted by the other variables; call that x_1^. Similar to the response residuals, get x_1* = x_1 - x_1^.
Now, just plot y* against x_1* to see the relationship.
This page might be a good reference for you.
In python, the statsmodels package has a plot_ccpr function that will plot partial residuals along with a fitted line. A full description is here.

What is the difference between Linear regression classifier and linear regression to extract the confidential interval?

I am a beginner with machine learning. I want to use time series linear regression to extract confidential interval of my dataset. I don't need to use the linear regression as a classifier. Firstly what is the difference between the two cases? Secondly in python, Is there different way to implement them ?
The main difference is the classifier will compute a probabilty about a label. The regression will compute a quantitative output.
Generally, classifier is used to compute a probability of label, and a regression is often use to compute a quantity. For instance if you want to compute the price of a flat considering some criterias you will use a regression, if you want to compute a label (luxurious, modest, ...) about the same flat considering some criterias you will use classifier.
But to use regressions in order to compute a threshold to seperate labels observed is a technic often used too. That is the case of linear SVM, which compute a boundary between labels. It is called decision boundary. Warning, the main drawback with linear is that is linear: it means the boundary will necessary be a straight line to separate labels. Sometimes it is good enough, sometimes it is not.
Logistic regression is an exception because it compute a probability actually. Its name is misleading.
For regression, when you want to compute a quantitative output, you can use a confidence interval to have an idea about the error. In a classification there is not confidence interval, even if you use linear SVM, it is non sensical. You can use the decision function but it is difficult to interpret in reality, or use the predicted probabilities and to check the number of time the label is wrong and compute a ratio of error. There are plethora ratios available considering your problematic, and it is buntly the subject of a whole book actually.
Anyway, if you're computing a time series, as far as I know your goal is to obtain a quantitative output, then you do not need a classifier as you said. And about extracting it depends totally of the object you used to compute it in python: meaning it depends of the available attributes of the object used. Then depends of the library too. So it would be very better, to answer to you, if you would indicate which libraries and objects you are using.

Categories