Regression analysis for linear regression

Regression analysis for linear regression - python

I have a regression model where my target variable (days) quantitative values ranges between 2 to 30. My RMSE is 2.5 and all the other X variables(nominal) are categorical and hence I have dummy encoded them.
I want to know what would be a good value of RMSE? I want to get something within 1-1.5 or even lesser but I am unaware what I should do to achieve the same.
Note# I have already tried feature selection and removing features will less importance.
Any ideas would be appreciated.

If your x values are categorical then it does not necessarily make much sense binding them to a uniform grid. Who's to say category A and B should be spaced apart the same as B and C. Assuming that they are will only lead to incorrect representation of your results.
As your choice of scale is the unknowns, you would be better in terms of visualisation to set your uniform x grid as being the day number and then seeing where the categories would place on the y scale if given a linear relationship.
RMS Error doesn't come into it at all if you don't have quantitative data for x and y.

Related

Top features of linear regression in python

So I had to create a linear regression in python, but this dataset has over 800 columns. Is there anyway to see what columns are contributing most to the linear regression model? Thank you.

Look at the coefficients for each of the features. Ignore the sign of the coefficient:
A large absolute value means the feature is heavily contributing.
A value close to zero means the feature is not contributing much.
A value of zero means the feature is not contributing at all.

You can measure the correlation between each independent variable and dependent variable, for example:
corr(X1, Y)
corr(X2, Y)
.
.
.
corr(Xn, Y)
and you can test the model selecting the N most correlated variable.
There are more sophisticated methods to perform dimensionality reduction:
PCA (Principal Component Analysis)
(https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c)
Forward Feature Construction
Use XGBoost in order to measure feature importance for each variable and then select the N most important variables
(How to get feature importance in xgboost?)
There are many ways to perform this action and each one has pros and cons.
https://machinelearningmastery.com/dimensionality-reduction-algorithms-with-python/

If you are just looking for variables with high correlation I would just do something like this
import pandas as pd
cols = df.columns
for c in cols:
# Set this to whatever you would like
if df['Y'].corr(df[c]) > .7:
print(c, df['Y'].corr(df[c]))
after you have decided what threshold/columns you want you can append c to a list

Solution to a single feature logistic regression problem

So I'm having a hard time conceptualizing how to make mathematical representation of my solution for a simple logistic regression problem. I understand what is happening conceptually and have implemented it, but I am answering a question which asks for a final solution.
Say I have a simple two column dataset denoting something like likelihood of getting a promotion per year worked, so the likelihood would increase the person accumulates experience. Where X denotes the year and Y is a binary indicator for receiving a promotion:
X | Y
1 0
2 1
3 0
4 1
5 1
6 1
I implement logistic regression to find the probability per year worked of receiving a promotion, and get an output set of probabilities that seem correct.
I get an output weight vector that that is two items, which makes sense as there are only two inputs. The number of years X, and when I fix the intercept to handle bias, it adds a column of 1s. So one weight for years, one for bias.
So I have two few questions about this.
Since it is easy to get an equation of the form y = mx + b as a decision boundary for something like linear regression or a PLA, how can similarly I denote a mathematical solution with the weights of the logistic regression model? Say I have a weights vector [0.9, -0.34], how can I convert this into an equation?
Secondly, I am performing gradient descent which returns a gradient, and I multiply that by my learning rate. Am I supposed to update the weights at every epoch? As my gradient never returns zeros in this case so I am always updating.
Thank you for your time.

The logistic regression is trying to map the input value (x = years) to the output value (y=likelihood) through this relationship:
where theta and b are the weights you are trying to find.
The decision boundary will then be defined as L(x)>p or <p. where L(x) is the right term of the equation above. That is the relationship you want.
You can eventually transform it to a more linear form like the one of linear regression by passing the exponential in numerator and taking the log on both sides.

Covariance of two columns of a dataframe

Please forgive this question if it sounds too trivial, but I want to be sure I'm on the right track.
I have a data frame similar to the following, and I'm interested in understanding whether the two variables A and B vary together or otherwise.
A B
0 34.4534 35.444248
1 34.8915 24.693800
2 0.0000 21.586316
3 34.7767 23.783602
I am asked to plot a covariance between the two. However, from my research, it seems covariance is a single-calculated value just like mean and standard deviation, not a distribution like pdf/cdf that one can plot.
Is my perception about covariance right? What advice could you give me for some other way to understand the variability between these variables?

Is your perception right? - Yes
Covariance is a measure of the joint variability of two random variables and is represented by one number. This number is
positive if they "behave similar" (which means roughly that positive peaks in variable 1 coincide with positive peaks in variable 2)
zero if they do not covary
negative if they "behave similar" but with an inverse relationship (that is, negative peaks align with positive peaks and vice versa)
import pandas as pd
# create 3 random variables; var 3 is based on var 1, so they should covary
data = np.random.randint(-9,9,size=(20,3))
data[:,2] = data[:,0] + data[:,2]*0.5
df = pd.DataFrame(data,columns=['var1','var2','var3'])
df.plot(marker='.')
We see that var1 and var3 seem to covary; so in order to compute the covariance between all variables, pandas comes in handy:
>>> df.cov()
var1 var2 var3
var1 31.326316 -5.389474 30.684211
var2 -5.389474 21.502632 -10.907895
var3 30.684211 -10.907895 37.776316
Since the actual values of covariance depend on the scale of your input variables, you typically normalize the covariance by the respective standard deviations which gives you the correlation as a measure of covariance, ranging from -1 (anticorrelated) to 1 (correlated). With pandas, this reads
>>> df.corr()
var1 var2 var3
var1 1.000000 -0.207657 0.891971
var2 -0.207657 1.000000 -0.382724
var3 0.891971 -0.382724 1.000000
from which it becomes clear, that var1 and var3 exhibit a strong correlation, exactly as we expect it to be.
What advice could you give me for some other way to understand the variability between these variables? - Depends on the data
Since we don't know anything about the nature of your data, this is hard to say. Perhaps just as a starter (without intending to be exhaustive), some hints at what you could look at:
Spearman's rank correlation: more robust than Pearson correlation coefficient, what we have used above; Pearson basically only looks at linear correlation and produces less correct result if your data exhibits some sort of non-linearity; in the case of possible non-linear relationships in your data, you should go for Spearman
Autocorrelation: think about a sinusoidal signal which triggers another signal but with a time lag of 90º (which represents a cosine). In that case, the typical covariance/correlation will tell you that the relationship is weak, and may (falsely) lead you to the conclusion that there is no causal effect between both signals. Autocorrelation basically is the correlation for shifted versions of your time series, thus allowing to detect lagged correlation.
probably much more, but perhaps that's good for a start

support vector regression time series forecasting - python

I have a dataset of peak load for a year. Its a simple two column dataset with the date and load(kWh).
I want to train it on the first 9 months and then let it predict the next three months . I can't get my head around how to implement SVR. I understand my 'y' would be predicted value in kWh but what about my X values?
Can anyone help?

given multi-variable regression, y =
Regression is a multi-dimensional separation which can be hard to visualize in ones head since it is not 3D.
The better question might be, which are consequential to the output value `y'.
Since you have the code to the loadavg in the kernel source, you can use the input parameters.

For Python (I suppose, the same way will be for R):
Collect the data in this way:
[x_i-9, x_i-8, ..., x_i] vs [x_i+1, x_i+2, x_i+3]
First vector - your input vector. Second vector - your output vector (or value if you like). Use method fit from here, for example: http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR.fit
You can try scaling, removing outliers, apply weights and so on. Play :)

scipy curve_fit using linear weights instead of sigmas

I've been fiddling around with the scipy.optimize.curve_fit() function today and I can get some pretty good results, but I'm not too sure how to make some data points weigh more than others.
Let me briefly summarize the situation:
We need to fit a decay curve to some data gathered from our experiment. Some data points occur more often than others and these are determined by a weight. So that means that if we have one data point A with weight x and one data point B with weight 2x, this would be equal to fitting a curve to one data point A with weight x and two data point B's with weight x.
The problem is that the curve_fit function can only be weighed using uncertainties, i.e. sigmas. I thought I was smart to translate each weight into the proportion of the sum of all the weights and then translate this proportion to a Z-score (I thought that would be equivalent in terms of uncertainty) and while this gave a MUCH better fit than not weighing anything at all, I still found through some unit testing that it wasn't when comparing weights of 0.5 to having two actual data points.
How can I use curve_fit with linear weights?
PS: Through unit testing I've found that fitting data points:
(0,0) with weight 1
(1,0) with weight 1
(1,1) with weight 1
(1,1) with weight 1
yields an equal result as fitting:
(0,0) with weight 1
(1,0) with weight 1
(1,1) with weight 0.70710678118
And peculiarly, sin(0.5*pi) = 1 and sin(0.25*pi) = 0.70710678118!! So there seems to be a sine relation here? Unfortunately my math skills are limiting me in understanding the exact relation.
Also, sin(0.125*pi) unfortunately doesn't equal a weight of 3 or 4...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.