I have the following OLS model from StatsModels:
X = df['Grade']
y = df['Results']
X = statsmodels.tools.tools.add_constant(X)
mod = sm.OLS(y,X)
results = mod.fit()
When trying to predict a new Y value for an X value of 4, I have to pass the following:
results.predict([1,4])
I don't understand why an array with the first value being '1' needs to be passed in order for the predict function to work correctly. Why do I need to include a 1 instead of just saying:
results.predict([4])
I'm not clear on the concept at work here. Does anybody know what's going on?
You are adding a constant to the regression equation with X = statsmodels.tools.tools.add_constant(X). So your regressor X has two columns where the first column is a array of ones.
You need to do the same with the regressor that is used in prediction. So, the 1 means include the constant in the prediction. If you use zero instead, then the contribution of the constant (0 * params[0]) is zero and the prediction is only the slope effect.
The formula interface adds the constant automatically both for the regressor in the model and for the regressor in the prediction. However, with the pandas DataFrame or numpy ndarray interface, the constant needs to be added by the user both for the model and for predict.
Related
I am running a Logistic regression using 2 features, cylinders and years for a multiclass classification problem.
After applying the training, we are applying the model to test data:
for origin in unique_origins:
# Select testing features.
X_test = test[features]
# Compute probability of observation being in the origin.
testing_probs[origin] = models[origin].predict_proba(X_test)[:,1]
I am confused to what the [:,1] part does to the code, if anyone could kindly explain.
There is a hint which says:
testing_probs[1] should return the probability returned from model 1 for each observation.
However i do not understand what it is trying to explain.
Please help and thank you.
array[:,1] is equivalent to array[:][1] for many types of array.
Also it means that you keep all the data on the first dimension, but you pick only the 2nd data on the second dimension :
a = np.array([[0,1], [2,3]])
a[:][1]
>>> array([2, 4])
In your case, the first dimension represent the number of the prediction and the second dimension the number of the class.
Also, when calling :
models[origin].predict_proba(X_test)[i,j]
You get the prediction for the ith input of the jth class.
So, models[origin].predict_proba(X_test)[:,1] means you want the predictions for all of your inputs but only the result for the 2nd class (as the first class is indexed at 0)!
If I have few classes, i.e. 3. I am expecting to get 3 Generalized Linear Regression coefficients arrays, as in sklearn.linear_model.LogisticRegression but statsmodels.discrete.discrete_model.MNLogit provides classes_num - 1 coefficients (in this case -- 2).
Example:
import statsmodels.api as st
from sklearn.linear_model import LogisticRegression
iris = st.datasets.get_rdataset('iris','datasets')
y = iris.data.Species
x = iris.data.iloc[:, :-1]
mdl = st.MNLogit(y, x)
# mdl_fit = mdl.fit()
mdl_fit = mdl.fit(method='bfgs' , maxiter=1000)
print(mdl_fit.params.shape) # (4, 2)
model = LogisticRegression(fit_intercept = False, C = 1e9)
mdl = model.fit(x, y)
print(model.coef_.shape) # (3, 4)
How I am supposed to get regression coefficients for all 3 classes using MNLogit?
Those coefficients are not computed to force identifiability of the model. In other words, not computing them ensures that the coefficients for the other classes are unique. If you had three sets of coefficients, there are an infinite number of models that give the same predictions but have different values for the coefficients. And this is bad if you want to know standard errors, p-values and so on.
The logit of the missing class is assumed to be zero. Demo:
mm = st.MNLogit(
np.random.randint(1, 5, size=(100,)),
np.random.normal(size=(100, 3))
)
res = mm.fit()
xt = np.random.normal(size=(2, 3))
res.predict(xt)
Results in:
array([[0.19918096, 0.34265719, 0.21307297, 0.24508888],
[0.33974178, 0.21649687, 0.20971884, 0.23404251]])
Now these are the logits, plus the zeros for the first class
logits = np.hstack([np.zeros((xt.shape[0], 1)), xt.dot(res.params)])
array([[ 0. , 0.54251673, 0.06742093, 0.20740715],
[ 0. , -0.45060978, -0.4824181 , -0.37268309]])
And the predictions via a softmax:
np.exp(logits) / (np.sum(np.exp(logits), axis=1, keepdims=1))
array([[0.19918096, 0.34265719, 0.21307297, 0.24508888],
[0.33974178, 0.21649687, 0.20971884, 0.23404251]])
which match the predictions from the model.
To reiterate: you cannot find those coefficients. Use a constant logit of zero for the first class. And you cannot find how much features are influential for the first class. That is actually an ill-posed question: the features cannot be influential for the reference class, because the reference class is never predicted directly. What the coefficients tell you is how much the log-odds for a given class, compared the reference class, change as a result of an unit increase for a particular feature.
The predicted probabilities have to add up to 1 over classes. So, we loose one free parameter in the full model in order to impose this constraint.
The predicted probability for the reference class is one minus the sum of probabilities of all other classes.
This is similar to the binary case, where we don't have separate parameters for success and failure because one probability is just one minus the other probability.
I am wondering how to predict and get future time series data after model training. I would like to get the values after N steps. I wonder if the time series data has been properly learned and predicted. How do I do this right to get the following(next) value? I want to get the next value using model.predict or similar.
I have x_test and x_test[-1] == t So, the meaning of the next value is t+1, t+2, .... t+n. In this example I want to get t+1, t+2 ... t+n
First
I tried using stock index data
inputs = total_data[len(total_data) - forecast - look_back:]
inputs = scaler.transform(inputs)
X_test = []
for i in range(look_back, inputs.shape[0]):
X_test.append(inputs[i - look_back:i])
X_test = np.array(X_test)
predicted = model.predict(X_test)
but the result is like below
The results from X_test[-20:] and the following 20 predictions looks like same. I'm wondering if it's the correct method to train and predicted value and also if the result was correct.
full source
The method I tried first did not work correctly.
Second
I realized something is wrong, I tried using another official data so I used the time series in the Tensorflow tutorial to practice training the model.
a = y_val[-look_back:]
for i in range(N-step prediction): #predict a new value n times.
tmp = model.predict(a.reshape(-1, look_back, num_feature)) #predicted value
a = a[1:] #remove first
a = np.append(a, tmp) #insert predicted value
The results were predicted in a linear regression shape very differently from the real data.
Output a linear regression abnormal that is independent of the real data:
full source (After the 25th line is my code.)
I'm really very curious that How can I predict the following value of time series using Tensorflow predict method
I'm not wondering if this works or not theoretically. I'm just wondering how to get the following n steps using the predict method.
Thank you for reading the long question. I seek advice about your priceless opinion.
In the Second approach, Output is not expected, as per my understanding, because of a small mistake in the code.
The line of code,
a = y_val[-look_back:]
should be replaced by
look_back = 20
x = x_val_uni
a = x[-look_back:]
a.shape
In other words, we should send X Values as Inputs to the Model for the Prediction, not the Y Values.
However, we can compare it's predictions with Y Values, with the code,
y = y_val_uni[-20:]
plt.plot(y)
plt.plot(tmp)
plt.show()
Which would result in the plot shown below:
Please find the Complete Working Code in this Google Colab Gist.
I have this Credit Default dataset with head like this:
default student balance income default_Yes
No No 729.526495 44361.625074 0
No Yes 817.180407 12106.134700 0
No No 1073.549164 31767.138947 0
No No 529.250605 35704.493935 0
No No 785.655883 38463.495879 0
I am trying to perform logistic regression for 'default_Yes' based on the 'balance' attribute and used the following function:
from sklearn.cross_validation import train_test_split
from sklearn import metrics
X = cred_def[['balance']]
Y = cred_def['default_Yes']
X_train, X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.3,random_state=76)
logist = LogisticRegression()
logist.fit(X_train,Y_train)
y_pred = logist.predict(X_test)
def model(threshold):
def_thresh = np.greater(y_pred, threshold).astype(int)
acc_score = metrics.accuracy_score(Y_test, def_thresh)
print(acc_score)
plt.scatter(X_test.values,Y_test.values)
plt.scatter(X_test.values,def_thresh)
conf = metrics.confusion_matrix(Y_test, y_pred)
print(conf)
The problem I am facing is: no matter what value of threshold I am passing to the function 'model', it's producing same output and not considering the value passed.
EDIT (in response to the first two edits of this question statement): you don't pass any parameters whatsoever to logist = LogisticRegression(). You pass random_state=True to train_test_split(). Not to LogisticRegression.
random_state is supposed to be an integer (random seed), not boolean - read the doc. So by passing True, which will get coerced to 1, you just keep setting random_state = 1.
Try it on some other integer values and you'll get different results.
EDIT2: Your issue had nothing to do with random_state parameter as originally titled. It is to do with your predicted values y_pred = logist.predict(X_test), and specifically how behave as you sweep your threshold parameter across the possible range [0,1] of LR output values. Show us a table with at least five different values of threshold. Like [0,0.25,0.5,0.75,1.0], and whatever value you mean by "the result". Next, what do you mean by "the result"? Your accuracy acc_score, your confusion matrix conf, or what? For now, forget confusion matrix. Just look at say the effect of applying different values of threshold to the same array of predicted values y_pred. Also, you want to sanity-check y_pred, inspect it. Is it all-one? all-zero? What are its mean, median etc. Please post a table of data. Do not just keep saying "it doesn't work".
I'm confused how to interpret the output of .predict from a fitted CoxnetSurvivalAnalysis model in scikit-survival. I've read through the notebook Intro to Survival Analysis in scikit-survival and the API reference, but can't find an explanation. Below is a minimal example of what leads to my confusion:
import pandas as pd
from sksurv.datasets import load_veterans_lung_cancer
from sksurv.linear_model import CoxnetSurvivalAnalysis
# load data
data_X, data_y = load_veterans_lung_cancer()
# one-hot-encode categorical columns in X
categorical_cols = ['Celltype', 'Prior_therapy', 'Treatment']
X = data_X.copy()
for c in categorical_cols:
dummy_matrix = pd.get_dummies(X[c], prefix=c, drop_first=False)
X = pd.concat([X, dummy_matrix], axis=1).drop(c, axis=1)
# display final X to fit Cox Elastic Net model on
del data_X
print(X.head(3))
so here's the X going into the model:
Age_in_years Celltype Karnofsky_score Months_from_Diagnosis \
0 69.0 squamous 60.0 7.0
1 64.0 squamous 70.0 5.0
2 38.0 squamous 60.0 3.0
Prior_therapy Treatment
0 no standard
1 yes standard
2 no standard
...moving on to fitting model and generating predictions:
# Fit Model
coxnet = CoxnetSurvivalAnalysis()
coxnet.fit(X, data_y)
# What are these predictions?
preds = coxnet.predict(X)
preds has same number of records as X, but their values are wayyy different than the values in data_y, even when predicted on the same data they were fit on.
print(preds.mean())
print(data_y['Survival_in_days'].mean())
output:
-0.044114643249153422
121.62773722627738
So what exactly are preds? Clearly .predict means something pretty different here than in scikit-learn, but I can't figure out what. The API Reference says it returns "The predicted decision function," but what does that mean? And how do I get to the predicted estimate in months yhat for a given X? I'm new to survival analysis so I'm obviously missing something.
I posted this question on github, though the author renamed the issue question.
I got some helpful explanation of what the predict output is, but still am not sure how to get to a set of predicted survival times, which is what I really want. Here's a couple helpful explanations from that github thread:
predictions are risk scores on an arbitrary scale, which means you can
usually only determine the sequence of events, but not their exact time.
-sebp (library author)
It [predict] returns a type of risk score. Higher value means higher
risk of your event (class value = True)...You were probably looking
for a predicted time. You can get the predicted survival function with
estimator.predict_survival_function as in the example 00
notebook...EDIT: Actually, I’m trying to extract this but it’s been a
bit of a pain to munge
-pavopax.
There's more explanation at the github thread, though I wasn't really able to follow all of it. I need to play around with predict_survival_function and predict_cumulative_hazard_function and see if I can get to a set of predictions for most likely survival time by row in X, which is what I really want.
I'm not going to accept this answer here, in case anyone else has a better one.
With the X input, you get an evaluation of the input array:
def predict(self, X, alpha=None):
"""The linear predictor of the model.
Parameters
----------
X : array-like, shape = (n_samples, n_features)
Test data of which to calculate log-likelihood from
alpha : float, optional
Constant that multiplies the penalty terms. If the same alpha was used during training, exact
coefficients are used, otherwise coefficients are interpolated from the closest alpha values that
were used during training. If set to ``None``, the last alpha in the solution path is used.
Returns
-------
T : array, shape = (n_samples,)
The predicted decision function
"""
X = check_array(X)
coef = self._get_coef(alpha)
return numpy.dot(X, coef)
The definition check_array comes from another library.
You can review the code of coxnet.