Reshaping error in multivariate normal function with Numpy - Python - python

I have this data (c4), I want to use 4-fold cross validation testing on this matrix.
The way that I'm splitting the data is as follows:
from scipy.stats import multivariate_normal
from sklearn.model_selection import KFold
import math
c4 = np.array([
[5,10,14,18,22,19,21,18,18,19,19,18,15,15,12,4,4,4,3,3,3,3,3,3,3,3,3,3,3,1],
[6,9,11,12,10,10,13,16,18,21,20,19,8,5,4,4,4,4,4,4,4,4,4,4,3,3,3,3,3,3],
[4,8,12,17,18,21,21,21,17,16,15,13,7,8,8,7,7,4,4,4,3,3,3,3,4,4,3,3,3,2],
[3,7,12,17,19,20,22,20,20,19,19,18,17,16,16,15,14,13,12,9,4,4,4,3,3,3,3,3,2,1],
[2,5,8,10,10,11,11,10,13,17,19,20,22,22,20,16,15,15,13,11,8,3,3,3,3,3,3,3,2,1],
[4,8,10,11,10,15,15,17,18,19,18,20,18,17,15,13,12,7,4,4,4,4,4,4,4,4,3,3,3,2],
[2,8,12,15,18,20,19,20,21,21,23,19,19,16,16,16,14,12,10,7,7,7,7,6,3,3,3,3,2,1],
[2,13,17,18,21,22,20,18,18,17,17,15,13,11,8,8,4,4,4,4,4,4,4,4,4,4,4,4,3,1],
[6,6,9,14,15,18,20,20,22,20,16,16,15,11,8,8,8,5,4,4,4,4,4,4,4,5,5,5,5,4],
[8,13,16,20,20,20,19,17,17,17,17,15,14,13,10,6,3,3,3,4,4,4,3,3,4,3,3,3,2,2],
[5,9,17,18,19,18,17,16,14,13,12,12,11,10,4,4,4,3,3,3,3,3,3,3,4,4,3,3,3,3],
[4,6,8,11,16,17,18,20,16,17,16,17,17,16,14,12,12,10,9,9,8,8,6,4,3,3,3,2,2,2] ])
kf = KFold(n_splits=4)
for train_index, test_index in kf.split(c4):
X_train, X_test = c4[train_index], c4[test_index]
X_train_mean = np.mean(X_train)
X_train_cov = np.cov(X_train.T)
v = multivariate_normal(X_train_mean, X_train_cov)
res = v.pdf(X_test)
print (res)
but it didn't work with me, despite that the splitting loop works well with small sample of data.
The error message that I got:
ValueError: cannot reshape array of size 900 into shape (1,1)
Note: the length of all rows is equal.
Thanks in advance.

You are taking the mean of entire matrix X_train when you do np.mean(X_train). What you should do is take mean across the sample axis i.e. if your features are across columns and different samples are across rows, then replace np.mean(X_train) by np.mean(X_train, axis=0). This should solve the error.
Including this line in the above code makes it work. Basically, np.mean(c4[test_index], axis=0) will given you a 1 x 30 mean vector instead of a scalar mean.
from scipy.stats import multivariate_normal as mvn
v = mvn(np.mean(c4[test_index], axis=0), X_train_cov + np.eye(30))
I had to add an identity matrix because I was getting a singular matrix error. However, that has to do with how c4 is defined and nothing to do with this code. Note that to avoid the singularity, you typically add a very small value on the diagonal and not an identity matrix. This is just for illustration.

What is multivariate_normal ? If it is from scipy.stats, then per the doc you must do
multivariate_normal.pdf(X_test, np.mean(X_train, axis=0), X_train_cov)
The doc is here.

Related

facing problem while running reg.predict in jupyter ntbk says "ValueError"

Trying to learn sklearn in python. But the jupyter ntbk is giving error saying "ValueError: Expected 2D array, got scalar array instead:
array=750.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample."
*But I have already defined x to be 2D array using x.values.reshape(-1,1)
You can find the CSV file and screenshot of the Error Code here -> https://github.com/CaptainRD/CSV-for-StackOverflow
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.linear_model import LinearRegression
data = pd.read_csv('1.02. Multiple linear regression.csv')
data.head()
x = data[['SAT','Rand 1,2,3']]
y = data['GPA']
reg = LinearRegression()
reg.fit(x,y)r2 = reg.score(x,y)
n = x.shape[0]
p = x.shape[1]
adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
adjusted_r2
reg.predict(1750)
As you can see in your code, your x has two variables, SAT and Rand 1,2,3. Which means, you need to provide a two dimensional input for your predict method. example:
reg.predict([[1750, 1]])
which returns:
>>> array([1.88])
You are facing this error because you did not provide the second value (for the Rand 1,2,3 variable). Note, if this variable is not important, you should remove it from your x data.
This model is mapping two inputs (SAT and Rand 1,2,3) to a single output (GPA), and thus requires a list of two elements as input for a valid prediction. I'm guessing the 1750 that you're supplying is meant to be the SAT value, but you also need to provide the Rand 1,2,3 value. Something like [1750, 1] would work.

How should I modify the test data for SVM method to be able to use the `precomputed` kernel function without error?

I am using sklearn.svm.SVR for a "regression task" which I want to use my "customized kernel method". Here is the dataset samples and the code:
index density speed label
0 14 58.844020 77.179139
1 29 67.624946 78.367394
2 44 77.679100 79.143744
3 59 79.361877 70.048869
4 74 72.529289 74.499239
.... and so on
from sklearn import svm
import pandas as pd
import numpy as np
density = np.random.randint(0,100, size=(3000, 1))
speed = np.random.randint(20,80, size=(3000, 1)) + np.random.random(size=(3000, 1))
label = np.random.randint(20,80, size=(3000, 1)) + np.random.random(size=(3000, 1))
d = np.hstack((a,b,c))
data = pd.DataFrame(d, columns=['density', 'speed', 'label'])
data.density = data.density.astype(dtype=np.int32)
def my_kernel(X,Y):
return np.dot(X,X.T)
svr = svm.SVR(kernel=my_kernel)
x = data[['density', 'speed']].iloc[:2000]
y = data['label'].iloc[:2000]
x_t = data[['density', 'speed']].iloc[2000:3000]
y_t = data['label'].iloc[2000:3000]
svr.fit(x,y)
y_preds = svr.predict(x_t)
the problem happens in the last line svm.predict which says:
X.shape[1] = 1000 should be equal to 2000, the number of samples at training time
I searched the web to find a way to deal with the problem but many questions alike (like {1}, {2}, {3}) were left unanswered.
Actually, I had used SVM methods with rbf, sigmoid, ... before and the code was working just fine but this was my first time using customized kernels and I suspected that it must be the reason why this error happened.
So after a little research and reading documentation I found out that when using precomputed kernels, the shape of the matrix for SVR.predict() must be like [n_samples_test, n_samples_train] shape.
I wonder how to modify x_test in order to get predictions and everything works just fine with no problem like when we don't use customized kernels?
If possible please describe "the reason that why the inputs for svm.predict function in precomputed kernel differentiates with the other kernels".
I really hope the unanswered questions that are related to this issue could be answered respectively.
The problem is in your kernel function, it doesn't do the job.
As the documentation https://scikit-learn.org/stable/modules/svm.html#using-python-functions-as-kernels says, "Your kernel must take as arguments two matrices of shape (n_samples_1, n_features), (n_samples_2, n_features) and return a kernel matrix of shape (n_samples_1, n_samples_2)." The sample kernel on the same page satisfies this criteria:
def my_kernel(X, Y):
return np.dot(X, Y.T)
In your function the second argument of dot is X.T and thus the output will have shape (n_samples_1, n_samples_1) which is not that is expected.
The shape does not match means the test data and train data are of not equal shape, always think about matrix or array in numpy. If you are doing any arithmetic operation you always need a similar shape. That's why we check array.shape.
[n_samples_test, n_samples_train] you can modify shapes but its not best idea.
array.shape, reshape, resize
are used for that

How can I get the feature names from sklearn TruncatedSVD object?

I have the following code
import pandas as pd
import numpy as np
from sklearn.decomposition import TruncatedSVD
df = df = pd.DataFrame(np.random.randn(1000, 25), index=dates, columns=list('ABCDEFGHIJKLMOPQRSTUVWXYZ'))
def reduce(dim):
svd = sklearn.decomposition.TruncatedSVD(n_components=dim, n_iter=7, random_state=42)
return svd.fit(df)
fitted = reduce(5)
how do i get the column names from fitted?
In continuation of Mikhail post.
Assume that you already have feature_names from vectorizer.get_feature_names() and after that you have called svd.fit(X)
Now you can also extract sorted best feature names using the following code:
best_fearures = [feature_names[i] for i in svd.components_[0].argsort()[::-1]]
The above code, try to return the arguement of descending sort of svd.components_[0] and find the relative index from feature_names (all of the features) and construct the best_features array.
Then you can see for example the 10 best features:
In[21]: best_features[:10]
Out[21]:
['manag',
'develop',
'busi',
'solut',
'initi',
'enterprise',
'project',
'program',
'process',
'plan']
fitted column names would be SVD dimensions.
Each dimension is a linear combination of input features. To understand what a particular dimension mean take a look at svd.components_ array - it contains a matrix of coefficients input features are multiplied by.
Your original example, slightly changed:
import pandas as pd
import numpy as np
from sklearn.decomposition import TruncatedSVD
feature_names = list('ABCDEF')
df = pd.DataFrame(
np.random.randn(1000, len(feature_names)),
columns=feature_names
)
def reduce(dim):
svd = TruncatedSVD(n_components=dim, n_iter=7, random_state=42)
return svd.fit(df)
svd = reduce(3)
Then you can do something like that to get a more readable SVD dimension name - let's compute it for 0th dimension:
" ".join([
"%+0.3f*%s" % (coef, feat)
for coef, feat in zip(svd.components_[0], feature_names)
])
It shows +0.170*A -0.564*B -0.118*C +0.367*D +0.528*E +0.475*F - this is a "feature name" you can use for a 0th SVD dimension in this case (of course, coefficients depend on data, so feature name also depends on data).
If you have many input dimensions you may trade some "precision" with inspectability, e.g. sort coefficients and use only a few top of them. A more elaborate example can be found in https://github.com/TeamHG-Memex/eli5/pull/208 (disclaimer: I'm one of eli5 maintainers; pull request is not by me).

How to find the features names of the coefficients using scikit linear regression?

I use scikit linear regression and if I change the order of the features, the coef are still printed in the same order, hence I would like to know the mapping of the feature with the coeff.
#training the model
model_1_features = ['sqft_living', 'bathrooms', 'bedrooms', 'lat', 'long']
model_2_features = model_1_features + ['bed_bath_rooms']
model_3_features = model_2_features + ['bedrooms_squared', 'log_sqft_living', 'lat_plus_long']
model_1 = linear_model.LinearRegression()
model_1.fit(train_data[model_1_features], train_data['price'])
model_2 = linear_model.LinearRegression()
model_2.fit(train_data[model_2_features], train_data['price'])
model_3 = linear_model.LinearRegression()
model_3.fit(train_data[model_3_features], train_data['price'])
# extracting the coef
print model_1.coef_
print model_2.coef_
print model_3.coef_
The trick is that right after you have trained your model, you know the order of the coefficients:
model_1 = linear_model.LinearRegression()
model_1.fit(train_data[model_1_features], train_data['price'])
print(list(zip(model_1.coef_, model_1_features)))
This will print the coefficients and the correct feature. (Tested with pandas DataFrame)
If you want to reuse the coefficients later you can also put them in a dictionary:
coef_dict = {}
for coef, feat in zip(model_1.coef_,model_1_features):
coef_dict[feat] = coef
(You can test it for yourself by training two models with the same features but, as you said, shuffled order of features.)
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
coef_table = pd.DataFrame(list(X_train.columns)).copy()
coef_table.insert(len(coef_table.columns),"Coefs",regressor.coef_.transpose())
#Robin posted a great answer, but for me I had to make one tweak on it to work the way I wanted, and it was to refer to the dimension of the 'coef_' np.array that I wanted, namely modifying to this: model_1.coef_[0,:], as below:
coef_dict = {}
for coef, feat in zip(model_1.coef_[0,:],model_1_features):
coef_dict[feat] = coef
Then the dict was created as I pictured it, with {'feature_name' : coefficient_value} pairs.
Here is what I use for pretty printing of coefficients in Jupyter. I'm not sure I follow why order is an issue - as far as I know the order of the coefficients should match the order of the input data that you gave it.
Note that the first line assumes you have a Pandas data frame called df in which you originally stored the data prior to turning it into a numpy array for regression:
fieldList = np.array(list(df)).reshape(-1,1)
coeffs = np.reshape(np.round(clf.coef_,5),(-1,1))
coeffs=np.concatenate((fieldList,coeffs),axis=1)
print(pd.DataFrame(coeffs,columns=['Field','Coeff']))
Borrowing from Robin, but simplifying the syntax:
coef_dict = dict(zip(model_1_features, model_1.coef_))
Important note about zip: zip assumes its inputs are of equal length, making it especially important to confirm that the lengths of the features and coefficients match (which in more complicated models might not be the case). If one input is longer than the other, the longer input will have the values in its extra index positions cut off. Notice the missing 7 in the following example:
In [1]: [i for i in zip([1, 2, 3], [4, 5, 6, 7])]
Out[1]: [(1, 4), (2, 5), (3, 6)]
pd.DataFrame(data=regression.coef_, index=X_train.columns)
All of these answers were great but what personally worked for me was this, as the feature names I needed were the columns of my train_date dataframe:
pd.DataFrame(data=model_1.coef_,columns=train_data.columns)
Right after training the model, the coefficient values are stored in the variable model.coef_[0]. We can iterate over the column names and store the column name and their coefficient value in a dictionary.
model.fit(X_train,y)
# assuming all the columns except last one is used in training
columns = data.iloc[:,-1].columns
coef_dict = {}
for i in range(0,len(columns)):
coef_dict[columns[i]] = model.coef_[0][i]
Hope this helps!
As of scikit-learn version 1.0, the LinearRegression estimator has a feature_names_in_ attribute. From the docs:
feature_names_in_ : ndarray of shape (n_features_in_,)
Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
Assuming you're fitting on a pandas.DataFrame (train_data), your estimators (model_1, model_2, and model_3) will have the attribute. You can line up your coefficients using any of the methods listed in previous answers, but I'm in favor of this one:
coef_series = pd.Series(
data=model_1.coef_,
index=model_1.feature_names_in_
)
A minimally reproducible example
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
# for repeatability
np.random.seed(0)
# random data
Xy = pd.DataFrame(
data=np.random.random((10, 3)),
columns=["x0", "x1", "y"]
)
# separate X and y
X = Xy.drop(columns="y")
y = Xy.y
# initialize estimator
lr = LinearRegression()
# fit to pandas.DataFrame
lr.fit(X, y)
# get coeficients and their respective feature names
coef_series = pd.Series(
data=lr.coef_,
index=lr.feature_names_in_
)
print(coef_series)
x0 0.230524
x1 -0.275611
dtype: float64

Linear Regression on Pandas DataFrame using Sklearn ( IndexError: tuple index out of range)

I'm new to Python and trying to perform linear regression using sklearn on a pandas dataframe. This is what I did:
data = pd.read_csv('xxxx.csv')
After that I got a DataFrame of two columns, let's call them 'c1', 'c2'. Now I want to do linear regression on the set of (c1,c2) so I entered
X=data['c1'].values
Y=data['c2'].values
linear_model.LinearRegression().fit(X,Y)
which resulted in the following error
IndexError: tuple index out of range
What's wrong here? Also, I'd like to know
visualize the result
make predictions based on the result?
I've searched and browsed a large number of sites but none of them seemed to instruct beginners on the proper syntax. Perhaps what's obvious to experts is not so obvious to a novice like myself.
Can you please help? Thank you very much for your time.
PS: I have noticed that a large number of beginner questions were down-voted in stackoverflow. Kindly take into account the fact that things that seem obvious to an expert user may take a beginner days to figure out. Please use discretion when pressing the down arrow lest you'd harm the vibrancy of this discussion community.
Let's assume your csv looks something like:
c1,c2
0.000000,0.968012
1.000000,2.712641
2.000000,11.958873
3.000000,10.889784
...
I generated the data as such:
import numpy as np
from sklearn import datasets, linear_model
import matplotlib.pyplot as plt
length = 10
x = np.arange(length, dtype=float).reshape((length, 1))
y = x + (np.random.rand(length)*10).reshape((length, 1))
This data is saved to test.csv (just so you know where it came from, obviously you'll use your own).
data = pd.read_csv('test.csv', index_col=False, header=0)
x = data.c1.values
y = data.c2.values
print x # prints: [ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
You need to take a look at the shape of the data you are feeding into .fit().
Here x.shape = (10,) but we need it to be (10, 1), see sklearn. Same goes for y. So we reshape:
x = x.reshape(length, 1)
y = y.reshape(length, 1)
Now we create the regression object and then call fit():
regr = linear_model.LinearRegression()
regr.fit(x, y)
# plot it as in the example at http://scikit-learn.org/
plt.scatter(x, y, color='black')
plt.plot(x, regr.predict(x), color='blue', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
See sklearn linear regression example.
Dataset
Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
Importing the dataset
dataset = pd.read_csv('1.csv')
X = dataset[["mark1"]]
y = dataset[["mark2"]]
Fitting Simple Linear Regression to the set
regressor = LinearRegression()
regressor.fit(X, y)
Predicting the set results
y_pred = regressor.predict(X)
Visualising the set results
plt.scatter(X, y, color = 'red')
plt.plot(X, regressor.predict(X), color = 'blue')
plt.title('mark1 vs mark2')
plt.xlabel('mark1')
plt.ylabel('mark2')
plt.show()
I post an answer that addresses exactly the error that you got:
IndexError: tuple index out of range
Scikit-learn expects 2D inputs. Just reshape the X and Y.
Replace:
X=data['c1'].values # this has shape (XXX, ) - It's 1D
Y=data['c2'].values # this has shape (XXX, ) - It's 1D
linear_model.LinearRegression().fit(X,Y)
with
X=data['c1'].values.reshape(-1,1) # this has shape (XXX, 1) - it's 2D
Y=data['c2'].values.reshape(-1,1) # this has shape (XXX, 1) - it's 2D
linear_model.LinearRegression().fit(X,Y)
make predictions based on the result?
To predict,
lr = linear_model.LinearRegression().fit(X,Y)
lr.predict(X)
Is there any way I can view details of the regression?
The LinearRegression has coef_ and intercept_ attributes.
lr.coef_
lr.intercept_
show the slope and intercept.
You really should have a look at the docs for the fit method which you can view here
For how to visualize a linear regression, play with the example here. I'm guessing you haven't used ipython (Now called jupyter) much either, so you should definitely invest some time into learning that. It's a great tool for exploring data and machine learning. You can literally copy/paste the example from scikit linear regression into an ipython notebook and run it
For your specific problem with the fit method, by referring to the docs, you can see that the format of the data you are passing in for your X values is wrong.
Per the docs,
"X : numpy array or sparse matrix of shape [n_samples,n_features]"
You can fix your code with this
X = [[x] for x in data['c1'].values]

Categories