The above screenshot is refereed to as: sample.xlsx. I've been having trouble getting the beta for each stock using the LinearRegression() function.
Input:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
df = pd.read_excel('sample.xlsx')
mean = df['ChangePercent'].mean()
for index, row in df.iterrows():
symbol = row['stock']
perc = row['ChangePercent']
x = np.array(perc).reshape((-1, 1))
y = np.array(mean)
model = LinearRegression().fit(x, y)
print(model.coef_)
Output:
Line 16: model = LinearRegression().fit(x, y)
"Singleton array %r cannot be considered a valid collection." % x
TypeError: Singleton array array(3.34) cannot be considered a valid collection.
How can I make the collection valid so that I can get a beta value(model.coef_) for each stock?
X and y must have same shape, so you need to reshape both x and y to 1 row and 1 column. In this case it is resumed to the following:
np.array(mean).reshape(-1,1) or np.array(mean).reshape(1,1)
Given that you are training 5 classifiers, each one with just one value, is not surprising that the 5 models will "learn" that the coefficient of the linear regression is 0 and the intercept is 3.37 (y).
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
df = pd.DataFrame({
"stock": ["ABCD", "XYZ", "JK", "OPQ", "GHI"],
"ChangePercent": [-1.7, 30, 3.7, -15.3, 0]
})
mean = df['ChangePercent'].mean()
for index, row in df.iterrows():
symbol = row['stock']
perc = row['ChangePercent']
x = np.array(perc).reshape(-1,1)
y = np.array(mean).reshape(-1,1)
model = LinearRegression().fit(x, y)
print(f"{model.intercept_} + {model.coef_}*{x} = {y}")
Which is correct from an algorithmic point of view, but it doesn't make any practical sense given that you're only providing one example to train each model.
Related
Why does the predict_proba function give 2 columns?
I looked to this website:
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_proba
However, it just says returns: T: array-like of shape (n_samples, n_classes)
Returns the probability of the sample for each class in the model, where classes are ordered as they are in self.classes_.
I still don't understand why the output always returns 2 columns.
import numpy as np
import pandas as pd
from pylab import rcParams
import seaborn as sb
from sklearn.preprocessing import scale
from collections import Counter
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline
rcParams['figure.figsize'] = 5,4
sb.set_style('whitegrid')
from sklearn.linear_model import LogisticRegression
import os
cwd = os.getcwd()
file_path = cwd + '\\Default.xlsx'
default_data = pd.read_excel(file_path)
default_data = pd.read_excel('Default.xlsx')
default_data = default_data.drop(['Unnamed: 0'], axis=1)
default_data['default_factor'] = default_data.default.factorize()[0]
default_data['student_factor'] = default_data.student.factorize()[0]
X = default_data[['balance']]
y = default_data['default_factor']
lr = LogisticRegression()
lr.fit(X, y)
X_pred = np.linspace(start = 0, stop = 3000, num = 2).reshape(-1,1)
y_pred = lr.predict_proba(X_pred)
X_pred
X_pred.shape
y_pred.shape
Short answer
In every column it gives you information about the probability, that sample belong to this class (zero column shows the probability for belonging to class 0, first column shows the probability for belonging to class 1 and so on)
Detailed answer
Let's say that y_pred.shape gives you shape (2, 2) means, that you have 2 samples and 2 classes.
let's say that your X_pred looks like this:
In: print(X_pred)
Out: [[ 0.],
[3000.]]
that means that you have two samples:
sample one, with only feature x = [0] and
sample two, with only feature x = [3000]
let's say that output of your prediction looks like this:
In: print(y_pred)
Out: [[0.28, 0.72]
[0.65, 0.35]]
so it means, that sample one most probably belongs to class = 1 (first row tells you that it could be class 0 with probability 28% and class 1 with probability 72%)
and sample two most probably belongs to class = 0 (second row tells you that it could be class 0 with probability 65% and class 1 with probability 35%)
This code manually selects a column from the y table and then joins it to the X table. The program then performs linear regression. Any idea how to do this for every single column from the y table?
yDF = pd.read_csv('ytable.csv')
yDF.drop('Dates', axis = 1, inplace = True)
XDF = pd.read_csv('Xtable.csv')
ycolumnDF = yDF.iloc[:,0].to_frame()
regressionDF = pd.concat([XDF,ycolumnDF], axis=1)
X = regressionDF.iloc[:,1:20]
y = regressionDF.iloc[:,20:].squeeze()
lm = linear_model.LinearRegression()
lm.fit(X,y)
cf = lm.coef_
print(cf)
You can regress multiple y's on the same X's at the same time. Something like this should work
import numpy as np
from sklearn.linear_model import LinearRegression
df_X = pd.DataFrame(columns = ['x1','x2','x3'], data = np.random.normal(size = (10,3)))
df_y = pd.DataFrame(columns = ['y1','y2'], data = np.random.normal(size = (10,2)))
X = df_X.iloc[:,:]
y = df_y.iloc[:,:]
lm = LinearRegression().fit(X,y)
print(lm.coef_)
produces
[[ 0.16115884 0.08471495 0.39169592]
[-0.51929011 0.29160846 -0.62106353]]
The first row here ([ 0.16115884 0.08471495 0.39169592]) are the regression coefs of y1 on xs and the second are the regression coefs of y2 on xs.
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
df = pd.read_excel("Book1.xlsx")
for column in df:
X = df["Row Labels"]
Y = df[column]
y1 =Y.values.reshape(-1,1)
x1 =X.values.reshape(-1,1)
regressor = LinearRegression()
regressor.fit(x1, y1)
y_new = []
y_i = []
for i in range(12,24):
y_new.append(regressor.predict([[i]]))
y_i.append(i)
df2 = pd.DataFrame({'column':y_new})
i write this code to loop through the dataframe columns to do simple linear regression and put all the predicted value in dataframe. but it is predicting only the last columns value.
df2 = pd.DataFrame({'column':y_new}) creates a column named 'column' verbatim (not the name saved in the variable column. Moreover, df2 is recreated in every iteration, each iteration it only saves the last y_new.
I think what you want is to create a new column in df2 in each iteration:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
df = pd.read_excel("Book1.xlsx")
df2 = pd.DataFrame()
for column in df:
X = df["Row Labels"]
Y = df[column]
y1 =Y.values.reshape(-1,1)
x1 =X.values.reshape(-1,1)
regressor = LinearRegression()
regressor.fit(x1, y1)
y_new = []
y_i = []
for i in range(12,24):
y_new.append(regressor.predict([[i]]))
y_i.append(i)
df2[column] = y_new
I get a ValueError: Found input variables with inconsistent numbers of samples: [20000, 1] when I run the following even though the row values of x and y are correct. I load in the RCV1 dataset, get indices of the categories with the top x documents, create list of tuples with equal number of randomly-selected positives and negatives for each category, and then finally attempt to run a logistic regression on one of the categories.
import sklearn.datasets
from sklearn import model_selection, preprocessing
from sklearn.linear_model import LogisticRegression
from matplotlib import pyplot as plt
from scipy import sparse
rcv1 = sklearn.datasets.fetch_rcv1()
def get_top_cat_indices(target_matrix, num_cats):
cat_counts = target_matrix.sum(axis=0)
#cat_counts = cat_counts.reshape((1,103)).tolist()[0]
cat_counts = cat_counts.reshape((103,))
#b = sorted(cat_counts, reverse=True)
ind_temp = np.argsort(cat_counts)[::-1].tolist()[0]
ind = [ind_temp[i] for i in range(5)]
return ind
def prepare_data(x, y, top_cat_indices, sample_size):
res_lst = []
for i in top_cat_indices:
# get column of indices with relevant cat
temp = y.tocsc()[:, i]
# all docs with labeled category
cat_present = x.tocsr()[np.where(temp.sum(axis=1)>0)[0],:]
# all docs other than labelled category
cat_notpresent = x.tocsr()[np.where(temp.sum(axis=1)==0)[0],:]
# get indices equal to 1/2 of sample size
idx_cat = np.random.randint(cat_present.shape[0], size=int(sample_size/2))
idx_nocat = np.random.randint(cat_notpresent.shape[0], size=int(sample_size/2))
# concatenate the ids
sampled_x_pos = cat_present.tocsr()[idx_cat,:]
sampled_x_neg = cat_notpresent.tocsr()[idx_nocat,:]
sampled_x = sparse.vstack((sampled_x_pos, sampled_x_neg))
sampled_y_pos = temp.tocsr()[idx_cat,:]
sampled_y_neg = temp.tocsr()[idx_nocat,:]
sampled_y = sparse.vstack((sampled_y_pos, sampled_y_neg))
res_lst.append((sampled_x, sampled_y))
return res_lst
ind = get_top_cat_indices(rcv1.target, 5)
test_res = prepare_data(train_x, train_y, ind, 20000)
x, y = test_res[0]
print(x.shape)
print(y.shape)
LogisticRegression().fit(x, y)
Could it be an issue with the sparse matrices, or problem with dimensionality (there are 20K samples and 47K features)
When I run your code, I get following error:
AttributeError: 'bool' object has no attribute 'any'
That's because y for LogisticRegression needs to numpy array. So, I changed last line to:
LogisticRegression().fit(x, y.A.flatten())
Then I get following error:
ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: 0
This is because your sampling code has a bug. You need to subset y array with rows having that category before using sampling indices. See code below:
def prepare_data(x, y, top_cat_indices, sample_size):
res_lst = []
for i in top_cat_indices:
# get column of indices with relevant cat
temp = y.tocsc()[:, i]
# all docs with labeled category
c1 = np.where(temp.sum(axis=1)>0)[0]
c2 = np.where(temp.sum(axis=1)==0)[0]
cat_present = x.tocsr()[c1,:]
# all docs other than labelled category
cat_notpresent = x.tocsr()[c2,:]
# get indices equal to 1/2 of sample size
idx_cat = np.random.randint(cat_present.shape[0], size=int(sample_size/2))
idx_nocat = np.random.randint(cat_notpresent.shape[0], size=int(sample_size/2))
# concatenate the ids
sampled_x_pos = cat_present.tocsr()[idx_cat,:]
sampled_x_neg = cat_notpresent.tocsr()[idx_nocat,:]
sampled_x = sparse.vstack((sampled_x_pos, sampled_x_neg))
sampled_y_pos = temp.tocsr()[c1][idx_cat,:]
print(sampled_y_pos.nnz)
sampled_y_neg = temp.tocsr()[c2][idx_nocat,:]
print(sampled_y_neg.nnz)
sampled_y = sparse.vstack((sampled_y_pos, sampled_y_neg))
res_lst.append((sampled_x, sampled_y))
return res_lst
Now, Everything works like a charm
The following code gives me the following error: ValueError: Found array with 0 sample(s) (shape=(0, 3)) while a minimum of 1 is required.
The error is produced in line where prediction is invoked. I am assuming there's something wrong about the shape of the dataframe, 'obs_to_pred.' I checked the shape, which is (1046, 3).
What do you recommend so I can fix this and run the prediction?
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
import scipy.stats as stats
from sklearn import linear_model
# Import Titanic Data
train_loc = 'C:/Users/Young/Desktop/Kaggle/Titanic/train.csv'
test_loc = 'C:/Users/Young/Desktop/Kaggle/Titanic/test.csv'
train = pd.read_csv(train_loc)
test = pd.read_csv(test_loc)
# Predict Missing Age Values Based on Factors Pclass, SibSp, and Parch.
# In the function, combine train and test data.
def regressionPred (traindata,testdata):
allobs = pd.concat([traindata, testdata])
allobs = allobs[~allobs.Age.isnull()]
y = allobs.Age
y, X = dmatrices('y ~ Pclass + SibSp + Parch', data = allobs, return_type = 'dataframe')
mod = sm.OLS(y,X)
res = mod.fit()
predictors = ['Pclass', 'SibSp', 'Parch']
regr = linear_model.LinearRegression()
regr.fit(allobs.ix[:,predictors], y)
obs_to_pred = allobs[allobs.Age.isnull()].ix[:,predictors]
prediction = regr.predict( obs_to_pred ) # Error Produced in This Line ***
return res.summary(), prediction
regressionPred(train,test)
In case you may want to look at the dataset, the link will take you there: https://www.kaggle.com/c/titanic/data
In the line
allobs = allobs[~allobs.Age.isnull()]
you define allobs as all the cases with no NaN in Age column.
Later, with:
obs_to_pred = allobs[allobs.Age.isnull()].ix[:,predictors]
you do not have any case to predict on as all allobs.Age.isnull() will be evaluated to False and you'll get an empty obs_to_pred. Thus your error:
array with 0 sample(s) (shape=(0, 3)) while a minimum of 1 is required.
Check the logic what you want with your predictions.