Good day! I'd be appreciate if someone could guide me through the doubts below.
I'm working on a predictive modelling where I have two independent variables/predictors and one dependent variable.
Most resources of multiple regression only refers to linear regression, yet, the predictors I have are non-linear.
Is it possible to have:
y = z + ax1 + bx1^2 + cx2 + xc2^2
?
Let's say
X = [[2.64 0.96]
[3.75 0.88]
[3.74 0.75]
[6.51 1.27]]
Y = [[0.77]
[1.12]
[1.12]
[1.23]]
I know prediction for multiple linear regression is regr.predict([[new_x1], [new_x2]]). What about multiple polynomial regression?
You can use PolynomialFeatures from sklearn.preprocessing in order to generate the higher order terms. Then you can fit your model on the transformed data.
X = PolynomialFeatures(degree=2).fit_transform(X)
... # use the new X to fit the model
Related
Python (jupyter notebook to be exact), using numpy and sklearn only
np.random.seed(16)
x = np.arange(100)
yp = 3*x + 3 + 2*(np.random.poisson(3*x+3,100)-(3*x+3))
np.random.seed(12)
# Choose how many outliers
out = np.random.choice(100,15)
yp_wo = np.copy(yp)
np.random.seed(12) #set again
yp_wo[out] = yp_wo[out] + 5*np.random.rand(15)*yp[out]
# With outliers
plt.scatter(x,yp_wo)
# Without outliers
plt.scatter(x,yp)
For the data above (wo means "with outliers"), I need to find:
The best coefficients for two more losses: the MAE and the MAPE (Median Absolute Percentage Error)
Plot the best fit line for the MSE loss, the MAE loss, and the MAPE loss.
Apply Ridge Regression to the same data, and use cross validation to choose the optimal parameter alpha (you can use values of alpha = 10^-5, 10^-4, 10^-3, ... 10^3). Which value gives you the lowest MSE?
What confuses me is having to plot the best line fit for two or more losses.
I can follow the code from class and try to get the values, but I don't know what's meant by coefficients.
Any help / guidance?
This is for a homework I am trying to figure out (no I am not asking for the solutions)
Please excuse any formatting errors, I am very new to Stack Overflow.
I'm trying to use a MLPregressor from scikit learn in order to do a non linear regression on a set of 260 examples (X,Y). One example is composed of 200 features for X and 1 feature for Y.
File containing X
File containing Y
The link between X and Y is not obvious if directly plotted together but if we plot x=log10(sum(X)) and y=log10(Y), the link between both is almost linear.
As a first approach, I tried to apply my neural network directly on X and Y without success.
I have read that scaling would improve regression. In my case, Y is containing datas in a very wide range of values (from 10e-12 to 10e-5). When computing the error, of course 10e-5 as much more weight than 10e-12. But I would like my neural network to correctly approximate both. When using a linear scaling, let's say preprocessing.MinMaxScaler from scikit learn, 10e-8 ~ -0.99 and 10e-12 ~ -1. So I'm loosing all the information of my target.
My question here is: what kind of scaling could I use to get consistent results?
The only solution I have found is to apply log10(Y) but of course, error is increased exponentially.
The best I could get is with the code below:
from sklearn.neural_network import MLPRegressor
from sklearn.svm import SVR
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"]=(20,10)
freqIter=[]
for i in np.arange(0,0.2,0.001):
freqIter.append([i,i+0.001])
#############################################################################
X = np.zeros((len(learningFiles),len(freqIter)))
Y = np.zeros(len(learningFiles))
# Import X: loadtxt()
# Import Y: loadtxt
maxy = np.amax(Y)
Y *= 1/maxy
Y = Y.reshape(-1, 1)
maxx = np.amax(X)
X *= 1/maxx
#############################################################################
reg = MLPRegressor(hidden_layer_sizes=(8,2), activation='tanh', solver='adam', alpha=0.0001, learning_rate='adaptive', max_iter=10000, verbose=False, tol = 1e-7)
reg.fit(X, Y)
#############################################################################
plt.scatter([np.log10(np.sum(kou*maxx)) for kou in X],Y*maxy,label = 'INPUTS',color='blue')
plt.scatter([np.log10(np.sum(kou*maxx)) for kou in X],reg.predict(X)*maxy,label='Predicted',color='red')
plt.grid()
plt.legend()
plt.show()
Result:
Thanks for your help.
You may want to look at a FunctionTransformer. The example given applies a logarithmic transformation as part of pre-processing. You can also do it for an arbitrary mathematical function.
I would also suggest trying a ReLU activation function if you scale logarithmically. After the transformation your data looks fairly linear, so it may be converge a little faster -- but that's just a hunch.
I've finally found something interesting that is working well on my case.
First, I've used a log scaling for Y. I think it is the most adapted scaling when the range of values is very wide such as mine (from 10e-12 to 10e-5). Target is then between -5 and -12.
Secondly, my error about scaling X was to apply the same scaling to all features. Let's say my X contains 200 features, then I was dividing by the max of all features of all examples. My solution here is to scale feature1 by the max of all feature1 through all examples and then to reapeat it for all features. This gives me feature1 between 0 and 1 for all examples instead of far less previously (feature1 could be betwwen 0 and 0.0001 with my previous scaling).
I get better results, my main issue now is to select the correct parameters (number of layers, tolerance,...) but this is another problem.
Well, community:
Recently I have asked how to do exponential regression (Exponential regression function Python) thinking that for that data set the optimal regression was the Hyperbolic.
x_data = np.arange(0, 51)
y_data = np.array([0.001, 0.199, 0.394, 0.556, 0.797, 0.891, 1.171, 1.128, 1.437,
1.525, 1.720, 1.703, 1.895, 2.003, 2.108, 2.408, 2.424,2.537,
2.647, 2.740, 2.957, 2.58, 3.156, 3.051, 3.043, 3.353, 3.400,
3.606, 3.659, 3.671, 3.750, 3.827, 3.902, 3.976, 4.048, 4.018,
4.286, 4.353, 4.418, 4.382, 4.444, 4.485, 4.465, 4.600, 4.681,
4.737, 4.792, 4.845, 4.909, 4.919, 5.100])
Now, I'm doubting:
The first is an exponential fit. The second is hyperbolic. I don't know which is better... How to determine? Which criteria should I follow? Is there some python function?
Thanks in advance!
One common fit statistic is R-squared (R2), which can be calculated as "R2 = 1.0 - (absolute_error_variance / dependent_data_variance)" and it tells you what fraction of the dependent data variance is explained by your model. For example, if the R-squared value is 0.95 then your model explains 95% of the dependent data variance. Since you are using numpy, the R-squared value is trivially calculated as "R2 = 1.0 - (abs_err.var() / dep_data.var())" since numpy arrays have a var() method to calculate variance. When fitting your data to the Michaelis-Menten equation "y = ax / (b + x)" with parameter values of a = 1.0232217656373191E+01 and b = 5.2016057362771100E+01 I calculate an R-squared value of 0.9967, which means that 99.67 percent of the variance in the "y" data is explained by this model. Howver, there is no silver bullet and it is always good to verify other fit statistics and visually inspect the model. Here is my plot for the example I used:
You can take the 2-norm between the function and line of fit. Python has the function np.linalg.norm The R squared value is for linear regression.
Well, you should calculate an error function which measures how good your fit actually is. There are many different error functions you could use but for the start the mean-squared-error should work (if you're interested in further metrics, have a look at http://scikit-learn.org/stable/modules/model_evaluation.html).
You can manually implement mean-squared-error, once you determined the coefficients for your regression problem:
from sklearn.metrics import mean_squared_error
f = lambda x: a * np.exp(b * x) + c
mse = mean_squared_error(y_data, f(x_data))
I have a dataset with about 100+ features. I also have a small set of covariates.
I build an OLS linear model using statsmodels for y = x + C1 + C2 + C3 + C4 + ... + Cn for each covariate, and a feature x, and a dependent variable y.
I'm trying to perform hypothesis testing on the regression coefficients to test if the coefficients are equal to 0. I figured a t-test would be the appropriate approach to this, but I'm not quite sure how to go about implementing this in Python, using statsmodels.
I know, particularly, that I'd want to use http://www.statsmodels.org/devel/generated/statsmodels.regression.linear_model.RegressionResults.t_test.html#statsmodels.regression.linear_model.RegressionResults.t_test
But I am not certain I understand the r_matrix parameter. What could I provide to this? I did look at the examples but it is unclear to me.
Furthermore, I am not interested in doing the t-tests on the covariates themselves, but just the regression co-eff of x.
Any help appreciated!
Are you sure you don't want statsmodels.regression.linear_model.OLS? This will perform a OLS regression, making available the parameter estimates and the corresponding p-values (and many other things).
from statsmodels.regression import linear_model
from statsmodels.api import add_constant
Y = [1,2,3,5,6,7,9]
X = add_constant(range(len(Y)))
model = linear_model.OLS(Y, X)
results = model.fit()
print(results.params) # [ 0.75 1.32142857]
print(results.pvalues) # [ 2.00489220e-02 4.16826428e-06]
These p-values are from the t-tests of each fit parameter being equal to 0.
It seems like RegressionResults.t_test would be useful for less conventional hypotheses.
I have one regression function, g1(x) = 5x - 1 for one data point.
I have another regression function, g2(x) = 3x + 4.
I want to add these two models to create a final regression model, G(x).
That means:
G(x) = g1(x) + g2(x)
=> 5x - 1 + 3x + 4
=> 8x +3
My question is, how can this be done in python? If my dataset is X, I'm using statsmodels like this:
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std
import numpy as np
mod_wls = sm.WLS(y, X)
res_wls = mod_wls.fit()
print res_wls.params
And that gives me the coefficients for the regression function that fits the data X.
To add the functions, I can easily grab the coefficients for each, and sum them up to get the coefficients for a new regression function such as G(x). But now that I've got my own coefficients, how can I convert them into a regression function and use them to predict a new data? Because as far as I know, models have to be "fitted" to data before they can be used for prediction.
Or is there any way to directly add regression functions? I'm going to be adding functions iteratively in my algorithm.
The prediction generated by this model should be exactly
np.dot(X_test, res_wls.params)
Thus, if you want to sum several models, e.g.
summed_params = np.array([res_wls.params for res_wls in all_my_res_wls]).sum(axis=0)
your prediction should be
np.dot(X_test, summed_params)
In this case there would be no need to use the built-in functions of the estimator.