Regression analysis results not coming in Range of numbers expected - python

I am trying to conduct Regression analysis on 25-D data.
My data is in a data frame.
My end objective to predict a score value which is a percentage (0,99,70,22 e.t.c)
1.Do i need to normalize the data/scale it or Linear/Polynomial Regression analysis handles this?
I applied Polynomial Regression though it gives me a good r squared value what i see id that it returns results in negative values -342.54 else high range values like 252 (not at all in range of scores i gave to train) How do i rectify this ?
Is there any other technique i want to predict values?
So heres the link data type:
https://docs.google.com/spreadsheets/d/1swkRwLXklrWEDV3bKic5uxl_uHLjzU0QDHJ2JLSP8zQ/edit?usp=sharing
Also here's the code:
X= colum[D:AC] of spreadsheet
Y= column['Score'] or column ['Match'] in case of logistic regression
poly = PolynomialFeatures(degree=2)
X_ = poly.fit_transform(X)
X_test_ = poly.fit_transform(X_test)
# Instantiate
lg = LinearRegression()
# Fit
lg.fit(X_, y)
# Obtain coefficients
lg.coef_

1.Do i need to normalize the data/scale it or Linear/Polynomial Regression analysis handles this?
It is "usually" a good practice. Model converges faster. If you are using sklearn, then the Linear Regression module has a parameter called normalize which when set to True will normalize all variables before fitting the model
2. I applied Polynomial Regression though it gives me a good r squared value what i see id that it returns results in negative values -342.54 else high range values like 252 (not at all in range of scores i gave to train) How do i rectify this ? Is there any other technique i want to predict values?
Polynomial Regression is designed to give values between -inf and +inf. If you want percentage values, scale these variables through a function like sigmoid. You can also use Logistic Regression and the predict_proba() function will output probabilities between 0 and 1 (although this model works on a different objective entirely).
As #VivekKumar rightly said, we can hardly help you unless we have specific information.

Related

After normalising my data using DataPreparer while using random forest and SVM, why do my data values become negative?

I am working on predictive modeling where I need to predict whether an online customer ends up purchasing a product on a website or not, and I am using Random Forest Classifier and SVM since it's a classification problem.
After creating the fitting splits for training, testing, and validation sets, I dummify, standardize and normalize my data. However, after I normalize the sets, their values become all negative. Is there a way to change that and why does it happen?
The code that I am using to normalize my fitting sets is as below:
data_preparer = DataPreparer(one_hot_encoder, standard_scaler)
data_preparer.prepare_data(fitting_splits.train_set).head()
data_preparer.prepare_data(fitting_splits.validation_set).head()
I think the documentation from sklearn.preprocessing.StandardScaler can help here:
The standard score of a sample x is calculated as:
z = (x - u) / s
where u is the mean of the training samples or zero if
with_mean=False, and s is the standard deviation of the training
samples or one if with_std=False.
Based on this equation, if x (the individual value currently being scaled) is less than the mean of the variable, then your scaled value will be negative.

Permutation importance larger than 1 for linear regression R^2

I am using sklearns permutation_importance to estimate the importance of my independent variables. The model I am fitting is a linear Regression.
The permutation importance the model returns looks like this: [0.7939618 3.6692722 0.02936469].
The permutation importance is defined to be the difference between the baseline metric and metric from permutating the feature column.
In my case, I think that the baseline metric is the R^2 value if I do not permutate any variables (baseline R^2~0.86). How is it possible that I obtain a value of 3.66 for one of my features in this case? If I manually permutate this feature and recalculate R^2, I get a value of ~ 0.18, so the feature importance would be ~0.68 if I am not mistaken.
If anyone could explain to my why I am observing these high feature importance values, I'd be very grateful!

Residual plot diagnostic and how to improve the regression model

When creating regression models for this housing dataset, we can plot the residuals in function of real values.
from sklearn.linear_model import LinearRegression
X = housing[['lotsize']]
y = housing[['price']]
model = LinearRegression()
model.fit(X, y)
plt.scatter(y,model.predict(X)-y)
We can clearly see that the difference (prediction - real value) is mainly positive for lower prices, and the difference is negative for higher prices.
It is true for linear regression, because the model is optimized for RMSE (so the sign of the residual is not taken into account).
But when doing KNN
from sklearn.neighbors import KNeighborsRegressor
model = KNeighborsRegressor(n_neighbors = 3)
We can find a similar plot.
In this case, what interpretation can we give, and how can we improve the model.
EDIT: we can use all the other predictors, the results are similar.
housing = housing.replace(to_replace='yes', value=1, regex=True)
housing = housing.replace(to_replace='no', value=0, regex=True)
X = housing[['lotsize','bedrooms','stories','bathrms','bathrms','driveway','recroom',
'fullbase','gashw','airco','garagepl','prefarea']]
The following graph is for KNN with 3 neighbors. With 3 neighbors, one would expect overfitting, I can't figure out why there is this trend.
If you look at the fit:
plt.scatter(X,y)
plt.plot(X,model.predict(X), '--k')
You get negative values for higher values of y because there is a cluster of data around x=8000 with high y values that deviate a lot from what you expect.
Now if you do a knn, bear in mind your independent variable is only 1 dimensional, meaning, you are defining neighbours based on your lotsize, and you use the mean of the groups as a predictive value. For those high outlier values around x=8000, they will group together with values lower than them, making the difference negative
If you plot this out:
plt.scatter(X,y)
plt.scatter(X,model.predict(X))
How to improve the model? With only one predictor, there's not much you can do, maybe categorize lotsize but I doubt it changes much. Most likely you need other variables to see what is causing that bump around lotsize = 8000, then you can model the dependent variable better.

Input format for logistic regression in scikit-learn as in R

When Using logistic regression in R, the data input for the 'glm' function (family = binomial) can be: (?family) in several formats, and specifically in the format of:
......
For the binomial and quasibinomial families the response can be
specified in one of three ways:
......
As a numerical vector with values between 0 and 1, interpreted as the
proportion of successful cases (with the total number of cases given
by the weights)....
I have aggregated data that represents proportion of success out of trials (number between 0 and 1) and their equivalent weights, I'm interested in applying logistic regression with it, which would be trivial to use in R.
Unfortunately i cant use R in this project, and would like to use scikit-learn to estimate the logistic regression coefficients . More precise, i'm looking to apply the sklearn.linear_model.LogisticRegression in a form of input that will allow me to insert the model proportions and wights, in a similar fashion as available in R.
example:
from sklearn import linear_model
import pandas as pd
df = pd.DataFrame([[1,1,1,0], [1,1,1,0],[1,1,1,1],[2,2,1,1] , [2,2,1,1],[2,2,1,0] , [3,3,1,0] ],columns=['a', 'b','Trials','Success'])
logistic = linear_model.LogisticRegression()
#this works
logistic.fit(X=df[['a','b','Trials']] , y=df.Success)
logistic.predict_proba(df[['a','b','Trials']])
prob_to_success = logistic.predict_proba(df[['a','b','Trials']])[:,1]
prob_to_success
Out[51]: array([ 0.45535843, 0.45535843, 0.45535843, 0.42212169, 0.42212169,
0.42212169, 0.38957565])
#How can i use the following Data?
df_agg = df.groupby(['a','b'] , as_index=False)['Trials','Success'].sum()
df_agg["Prop"] = df_agg.Success / (df_agg.Trials)
df_agg
#I want to use Prop & Trials as weights in df_agg
Thanks in advance!
Convert to log-odds form and use linear regression on the transformation. Sklearn doesn't seem to have a quasi-binomial conversion for logistic regression. As you said, trivial in R but sklearn seems to not have anything of the sort.
If you want to use weights, you can use them in the fit function of LogisticRegression:
fit(X, y, sample_weight=None)

How to find the importance of the features for a logistic regression model?

I have a binary prediction model trained by logistic regression algorithm. I want know which features(predictors) are more important for the decision of positive or negative class. I know there is coef_ parameter comes from the scikit-learn package, but I don't know whether it is enough to for the importance. Another thing is how I can evaluate the coef_ values in terms of the importance for negative and positive classes. I also read about standardized regression coefficients and I don't know what it is.
Lets say there are features like size of tumor, weight of tumor, and etc to make a decision for a test case like malignant or not malignant. I want to know which of the features are more important for malignant and not malignant prediction. Does it make sort of sense?
One of the simplest options to get a feeling for the "influence" of a given parameter in a linear classification model (logistic being one of those), is to consider the magnitude of its coefficient times the standard deviation of the corresponding parameter in the data.
Consider this example:
import numpy as np
from sklearn.linear_model import LogisticRegression
x1 = np.random.randn(100)
x2 = 4*np.random.randn(100)
x3 = 0.5*np.random.randn(100)
y = (3 + x1 + x2 + x3 + 0.2*np.random.randn()) > 0
X = np.column_stack([x1, x2, x3])
m = LogisticRegression()
m.fit(X, y)
# The estimated coefficients will all be around 1:
print(m.coef_)
# Those values, however, will show that the second parameter
# is more influential
print(np.std(X, 0)*m.coef_)
An alternative way to get a similar result is to examine the coefficients of the model fit on standardized parameters:
m.fit(X / np.std(X, 0), y)
print(m.coef_)
Note that this is the most basic approach and a number of other techniques for finding feature importance or parameter influence exist (using p-values, bootstrap scores, various "discriminative indices", etc).
I am pretty sure you would get more interesting answers at https://stats.stackexchange.com/.

Categories