In Scikit, How Do You Fix Value Error When Predicting? - python

The following code gives me the following error: ValueError: Found array with 0 sample(s) (shape=(0, 3)) while a minimum of 1 is required.
The error is produced in line where prediction is invoked. I am assuming there's something wrong about the shape of the dataframe, 'obs_to_pred.' I checked the shape, which is (1046, 3).
What do you recommend so I can fix this and run the prediction?
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
import scipy.stats as stats
from sklearn import linear_model
# Import Titanic Data
train_loc = 'C:/Users/Young/Desktop/Kaggle/Titanic/train.csv'
test_loc = 'C:/Users/Young/Desktop/Kaggle/Titanic/test.csv'
train = pd.read_csv(train_loc)
test = pd.read_csv(test_loc)
# Predict Missing Age Values Based on Factors Pclass, SibSp, and Parch.
# In the function, combine train and test data.
def regressionPred (traindata,testdata):
allobs = pd.concat([traindata, testdata])
allobs = allobs[~allobs.Age.isnull()]
y = allobs.Age
y, X = dmatrices('y ~ Pclass + SibSp + Parch', data = allobs, return_type = 'dataframe')
mod = sm.OLS(y,X)
res = mod.fit()
predictors = ['Pclass', 'SibSp', 'Parch']
regr = linear_model.LinearRegression()
regr.fit(allobs.ix[:,predictors], y)
obs_to_pred = allobs[allobs.Age.isnull()].ix[:,predictors]
prediction = regr.predict( obs_to_pred ) # Error Produced in This Line ***
return res.summary(), prediction
regressionPred(train,test)
In case you may want to look at the dataset, the link will take you there: https://www.kaggle.com/c/titanic/data

In the line
allobs = allobs[~allobs.Age.isnull()]
you define allobs as all the cases with no NaN in Age column.
Later, with:
obs_to_pred = allobs[allobs.Age.isnull()].ix[:,predictors]
you do not have any case to predict on as all allobs.Age.isnull() will be evaluated to False and you'll get an empty obs_to_pred. Thus your error:
array with 0 sample(s) (shape=(0, 3)) while a minimum of 1 is required.
Check the logic what you want with your predictions.

Related

LinearRegression TypeError

The above screenshot is refereed to as: sample.xlsx. I've been having trouble getting the beta for each stock using the LinearRegression() function.
Input:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
df = pd.read_excel('sample.xlsx')
mean = df['ChangePercent'].mean()
for index, row in df.iterrows():
symbol = row['stock']
perc = row['ChangePercent']
x = np.array(perc).reshape((-1, 1))
y = np.array(mean)
model = LinearRegression().fit(x, y)
print(model.coef_)
Output:
Line 16: model = LinearRegression().fit(x, y)
"Singleton array %r cannot be considered a valid collection." % x
TypeError: Singleton array array(3.34) cannot be considered a valid collection.
How can I make the collection valid so that I can get a beta value(model.coef_) for each stock?
X and y must have same shape, so you need to reshape both x and y to 1 row and 1 column. In this case it is resumed to the following:
np.array(mean).reshape(-1,1) or np.array(mean).reshape(1,1)
Given that you are training 5 classifiers, each one with just one value, is not surprising that the 5 models will "learn" that the coefficient of the linear regression is 0 and the intercept is 3.37 (y).
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
df = pd.DataFrame({
"stock": ["ABCD", "XYZ", "JK", "OPQ", "GHI"],
"ChangePercent": [-1.7, 30, 3.7, -15.3, 0]
})
mean = df['ChangePercent'].mean()
for index, row in df.iterrows():
symbol = row['stock']
perc = row['ChangePercent']
x = np.array(perc).reshape(-1,1)
y = np.array(mean).reshape(-1,1)
model = LinearRegression().fit(x, y)
print(f"{model.intercept_} + {model.coef_}*{x} = {y}")
Which is correct from an algorithmic point of view, but it doesn't make any practical sense given that you're only providing one example to train each model.

Why does the predict_proba function return 2 columns?

Why does the predict_proba function give 2 columns?
I looked to this website:
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_proba
However, it just says returns: T: array-like of shape (n_samples, n_classes)
Returns the probability of the sample for each class in the model, where classes are ordered as they are in self.classes_.
I still don't understand why the output always returns 2 columns.
import numpy as np
import pandas as pd
from pylab import rcParams
import seaborn as sb
from sklearn.preprocessing import scale
from collections import Counter
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline
rcParams['figure.figsize'] = 5,4
sb.set_style('whitegrid')
from sklearn.linear_model import LogisticRegression
import os
cwd = os.getcwd()
file_path = cwd + '\\Default.xlsx'
default_data = pd.read_excel(file_path)
default_data = pd.read_excel('Default.xlsx')
default_data = default_data.drop(['Unnamed: 0'], axis=1)
default_data['default_factor'] = default_data.default.factorize()[0]
default_data['student_factor'] = default_data.student.factorize()[0]
X = default_data[['balance']]
y = default_data['default_factor']
lr = LogisticRegression()
lr.fit(X, y)
X_pred = np.linspace(start = 0, stop = 3000, num = 2).reshape(-1,1)
y_pred = lr.predict_proba(X_pred)
X_pred
X_pred.shape
y_pred.shape
Short answer
In every column it gives you information about the probability, that sample belong to this class (zero column shows the probability for belonging to class 0, first column shows the probability for belonging to class 1 and so on)
Detailed answer
Let's say that y_pred.shape gives you shape (2, 2) means, that you have 2 samples and 2 classes.
let's say that your X_pred looks like this:
In: print(X_pred)
Out: [[ 0.],
[3000.]]
that means that you have two samples:
sample one, with only feature x = [0] and
sample two, with only feature x = [3000]
let's say that output of your prediction looks like this:
In: print(y_pred)
Out: [[0.28, 0.72]
[0.65, 0.35]]
so it means, that sample one most probably belongs to class = 1 (first row tells you that it could be class 0 with probability 28% and class 1 with probability 72%)
and sample two most probably belongs to class = 0 (second row tells you that it could be class 0 with probability 65% and class 1 with probability 35%)

Non-Linearity Test NaN error

I wanted to try statsmodel's linear_harvey_collier test with an easy example. However, I get nan as a result. Can you see, where my error lies?
import numpy as np
from statsmodels.regression.linear_model import OLS
np.random.seed(44)
n_samples, n_features = 50, 4
X = np.random.randn(n_samples, n_features)
coef=np.random.uniform(-12,12,4)
y = np.dot(X, coef)
var = 400
y += var**(1/2) * np.random.normal(size=n_samples)
regr=OLS(y, X).fit()
print(regr.params)
print(regr.summary())
sms.linear_harvey_collier(regr)
I get the result Ttest_1sampResult(statistic=nan, pvalue=nan).
If I perform the test while exluding one variable I get a result:
X3=X[:,:3]
regr3=OLS(y, X3).fit()
In [1]: sms.linear_harvey_collier(regr3)
Out[2]: Ttest_1sampResult(statistic=0.2447803429683807, pvalue=0.806727747845282)
Is there a problem with not adding a constant and intercept? This is just a feeling and if there is indeed a problem, I don't understand why.
There is a bug in linear_harvey_collier, that hard codes the number of initial observations to 3.
https://github.com/statsmodels/statsmodels/pull/6727
linear_harvey_collier has only two lines of code.
A workaround is to compute the test directly
res = regr
from scipy import stats
skip = len(res.params) # bug in linear_harvey_collier
rr = sms.recursive_olsresiduals(res, skip=skip, alpha=0.95, order_by=None)
stats.ttest_1samp(rr[3][skip:], 0)
Ttest_1sampResult(statistic=0.03092937323130299, pvalue=0.9754626388210277)

NameError: name 'X' is not defined sklearn

I am working through this multiple regression problem with this walk through however the code that starts at
section : #Treating categorical variables with One-hot-encoding at website: https://towardsdatascience.com/what-makes-a-movie-hit-a-jackpot-learning-from-data-with-multiple-linear-regression-339f6c1a7022
I ran code up to this point but it doesn't work for (X)
Actual code:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
# LabelEncoder for a number of columns
class MultiColumnLabelEncoder:
def __init__(self, columns = None):
self.columns = columns # list of column to encode
def fit(self, X, y=None):
return self
def transform(self, X):
'''
Transforms columns of X specified in self.columns using
LabelEncoder(). If no columns specified, transforms all
columns in X.
'''
output = X.copy()
if self.columns is not None:
for col in self.columns:
output[col] = LabelEncoder().fit_transform(output[col])
else:
for colname, col in output.iteritems():
output[colname] = LabelEncoder().fit_transform(col)
return output
def fit_transform(self, X, y=None):
return self.fit(X, y).transform(X)
le = MultiColumnLabelEncoder()
X_train_le = le.fit_transform(X)
Here is the error that I get:
Traceback (most recent call last):
File "<ipython-input-63-581cea150670>", line 34, in <module>
X_train_le = le.fit_transform(X)
NameError: name 'X' is not defined
Your code shouldn't be able to work because you left out 40 lines of codes that she wrote before that snippet of codes. She has defined X earlier. The codes can be obtained from Github.
#importing the libraries
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn.linear_model import RidgeCV, LassoCV, Ridge, Lasso
import statsmodels.api as sm
import pyreadr
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import explained_variance_score
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
result = pyreadr.read_r('Movies.RData')# also works for Rds
print(result.keys())
df = pd.DataFrame(result['movies'], columns=result['movies'].keys() )
df.shape
df.shape[0]
df.set_index("title", inplace=True) #setting the index name
df_1 = df.loc[:, ['imdb_rating','genre', 'runtime', 'best_pic_nom',
'top200_box', 'director', 'actor1']]
#Let's also check the column-wise distribution of null values
print(df_1.isnull().values.sum())
print(df_1.isnull().sum())
#Dropping missing values from my dataset
df_1.dropna(how='any', inplace=True)
print(df_1.isnull().values.sum()) #checking for missing values after the dropna()
#Splitting for 2 matrices: independent variables used for prediction and dependent variables (that is predicted)
X = df_1.drop(["imdb_rating", 'runtime'], axis = 1) #Feature Matrix
y = df_1["imdb_rating"] #Dependent Variables

Python statsmodel VARMAX Results

Everytime I run a VARMAX model I get different coefficients.
Is there any way I could replicate my previous results without imposing a seed?
Thank you
I tried to replicate the VARMA(p,q) example posted on the statsmodels webpage: ( https://www.statsmodels.org/dev/examples/notebooks/generated/statespace_varmax.html ). In order to check the replicability of the results, I just added a loop to reestimate the model and a dataframe (parameters) for saving the results. So this is my code:
%matplotlib inline
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
dta = sm.datasets.webuse('lutkepohl2', 'https://www.stata-press.com/data/r12/')
dta.index = dta.qtr
endog = dta.loc['1960-04-01':'1978-10-01', ['dln_inv', 'dln_inc', 'dln_consump']]
exog = endog['dln_consump']
parameters=pd.DataFrame()
for p in range(10):
print(p)
mod = sm.tsa.VARMAX(endog[['dln_inv', 'dln_inc']], order=(1,1))
res = mod.fit(maxiter=1000, disp=False)
print(res.summary())
param= pd.DataFrame(res.params,columns= ["estimation "+str(p)])
parameters=pd.concat([parameters, param], axis=1)
print(parameters)
As you can see, the results change everytime I reestimate the model:
estimation 0 estimation 1 estimation 2 \
const.dln_inv 0.010974 0.010934 0.010934
const.dln_inc 0.016554 0.016536 0.016536
L1.dln_inv.dln_inv -0.010164 -0.010087 -0.010087
L1.dln_inc.dln_inv 0.360306 0.362187 0.362187
L1.dln_inv.dln_inc -0.032975 -0.033071 -0.033071
L1.dln_inc.dln_inc 0.230657 0.231421 0.231421
L1.e(dln_inv).dln_inv -0.249916 -0.250307 -0.250307
L1.e(dln_inc).dln_inv 0.125546 0.125581 0.125581
L1.e(dln_inv).dln_inc 0.088878 0.089001 0.089001
L1.e(dln_inc).dln_inc -0.235258 -0.235176 -0.235176
sqrt.var.dln_inv 0.044926 0.044927 0.044927
sqrt.cov.dln_inv.dln_inc 0.001670 0.001662 0.001662
sqrt.var.dln_inc 0.011554 0.011554 0.011554
Thank you. But I tried to replicate the VARMA(p,q) example posted on the statsmodels webpage: ( https://www.statsmodels.org/dev/examples/notebooks/generated/statespace_varmax.html ). In order to check the replicability of the results, I just added a loop to reestimate the model and a dataframe (parameters) for saving the results. So this is my code:
%matplotlib inline
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
dta = sm.datasets.webuse('lutkepohl2', 'https://www.stata-press.com/data/r12/')
dta.index = dta.qtr
endog = dta.loc['1960-04-01':'1978-10-01', ['dln_inv', 'dln_inc', 'dln_consump']]
exog = endog['dln_consump']
parameters=pd.DataFrame()
for p in range(10):
print(p)
mod = sm.tsa.VARMAX(endog[['dln_inv', 'dln_inc']], order=(1,1))
res = mod.fit(maxiter=1000, disp=False)
print(res.summary())
param= pd.DataFrame(res.params,columns= ["estimation "+str(p)])
parameters=pd.concat([parameters, param], axis=1)
print(parameters)
As you can see, the results change everytime I reestimate the model:
estimation 0 estimation 1 estimation 2 \
const.dln_inv 0.010974 0.010934 0.010934
const.dln_inc 0.016554 0.016536 0.016536
L1.dln_inv.dln_inv -0.010164 -0.010087 -0.010087
L1.dln_inc.dln_inv 0.360306 0.362187 0.362187
L1.dln_inv.dln_inc -0.032975 -0.033071 -0.033071
L1.dln_inc.dln_inc 0.230657 0.231421 0.231421
L1.e(dln_inv).dln_inv -0.249916 -0.250307 -0.250307
L1.e(dln_inc).dln_inv 0.125546 0.125581 0.125581
L1.e(dln_inv).dln_inc 0.088878 0.089001 0.089001
L1.e(dln_inc).dln_inc -0.235258 -0.235176 -0.235176
sqrt.var.dln_inv 0.044926 0.044927 0.044927
sqrt.cov.dln_inv.dln_inc 0.001670 0.001662 0.001662
sqrt.var.dln_inc 0.011554 0.011554 0.011554

Categories