LinearDiscriminantAnalysis - Single column output from .transform(X) - python

I have been successfully playing around with replicating one of the sklearn tutorials using the iris dataset in PyCharm using Python 2.7. However, when trying to repeat this with my own data I have been encountering an issue. I have been importing data from a .csv file using 'np.genfromtxt', but for some reason I keep getting a single column output for X_r2 (see below), when I should get a 2 column output. I have therefore replaced my data with some randomly generated variables to post onto SO, and I am still getting the same issue.
I have included the 'problem' code below, and I would be interested to know what I have done wrong. I have extensively used the debugging features in PyCharm to check that the type and shape of my variables are similar to the original sklearn example, but it did not help me with the problem. Any help or suggestions would be appreciated.
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
y = np.random.randint(2, size=500)
X = np.random.randint(1, high=1000, size=(500, 6))
target_names = np.array([['XX'], ['YY']])
lda = LinearDiscriminantAnalysis(n_components=2)
X_r2 = lda.fit(X, y).transform(X)

The array y in the example you posted has values of 0, 1 and 2 while yours only has values of 0 and 1. This change achieves what you want:
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
y = np.random.randint(3, size=500)
X = np.random.randint(1, high=1000, size=(500, 6))
target_names = np.array([['XX'], ['YY']])
lda = LinearDiscriminantAnalysis(n_components=2)
X_r2 = lda.fit(X, y).transform(X)

Related

facing problem while running reg.predict in jupyter ntbk says "ValueError"

Trying to learn sklearn in python. But the jupyter ntbk is giving error saying "ValueError: Expected 2D array, got scalar array instead:
array=750.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample."
*But I have already defined x to be 2D array using x.values.reshape(-1,1)
You can find the CSV file and screenshot of the Error Code here -> https://github.com/CaptainRD/CSV-for-StackOverflow
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.linear_model import LinearRegression
data = pd.read_csv('1.02. Multiple linear regression.csv')
data.head()
x = data[['SAT','Rand 1,2,3']]
y = data['GPA']
reg = LinearRegression()
reg.fit(x,y)r2 = reg.score(x,y)
n = x.shape[0]
p = x.shape[1]
adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
adjusted_r2
reg.predict(1750)
As you can see in your code, your x has two variables, SAT and Rand 1,2,3. Which means, you need to provide a two dimensional input for your predict method. example:
reg.predict([[1750, 1]])
which returns:
>>> array([1.88])
You are facing this error because you did not provide the second value (for the Rand 1,2,3 variable). Note, if this variable is not important, you should remove it from your x data.
This model is mapping two inputs (SAT and Rand 1,2,3) to a single output (GPA), and thus requires a list of two elements as input for a valid prediction. I'm guessing the 1750 that you're supplying is meant to be the SAT value, but you also need to provide the Rand 1,2,3 value. Something like [1750, 1] would work.

Output of a statsmodels regression

I would like to perform a simple linear regression using statsmodels and I've tried several different methods by now but I just don't get it to work. The code that I have constructed now doesn't give me any errors but it also doesn't show me the result
I am trying to create a model for the variable "Direction" which takes the value 0 if the return for the corresponding date was negative and 1 if it was positive. The explinatory variables are the (5) lags of the returns. The df13 contains the lags and also the direction for each observed date. I tried this code and as I mentioned it doesn't give an error but says " Optimization terminated successfully.
Current function value: 0.682314
Iterations 5
However, I would like to see the typical table with all the beta values, their significance etc.
Also, what would you say, since Direction is a binary variable may it be better to use a logit instead of a linear model? However, in the assignment it appeared as a linear model.
And lastly, I am sorry its not displayed here correctly but I don't know how to write as code or insert my dataframe
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import os
import itertools
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import statsmodels.api as sm
import matplotlib.pyplot as plt
from statsmodels.sandbox.regression.predstd import wls_prediction_std
...
X = df13[['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5']]
Y = df13['Direction']
X = sm.add_constant(X)
model = sm.Logit(Y.astype(float), X.astype(float)).fit()
predictions = model.predict(X)
print_model = model.summary
print(print_model)
Edit: I'm sure it has to be a logit regression so I updated that part
I don't know if this is unintentional, but it looks like you need to define X and Y separately:
X = df13[['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5']]
Y = df13['Direction']
Secondly, I'm not familiar with statsmodel, but I would try converting your dataframes to numpy arrays. You can do this with
Xnum = X.to_numpy()
ynum = y.to_numpy()
And try passing those to the regressors.

using dask Parallel post fit on sklearn predictors (ParallelPostFit wrapper)

I am trying to evaluate an sklearn predictor which I have made over a larger than memory dask array of inputs. I have read over the parallel post fit documentation https://dask-ml.readthedocs.io/en/latest/modules/generated/dask_ml.wrappers.ParallelPostFit.html and am still having some problems. The following code illustrates the kind issue that I am running into:
from dask.base import tokenize
import numpy as np
import dask.array as da
from dask.array import Array
from sklearn.linear_model import LinearRegression
from dask_ml.wrappers import ParallelPostFit
"""
for stack overflow question
"""
x = np.linspace(0,100,100,dtype=np.int32)
y = np.linspace(0,100,100,dtype=np.int32)
z = np.linspace(0,100,100,dtype=np.int32)
Y = np.random.normal(size=(100,))
X = np.stack([x,y,z],axis=1)
reg = LinearRegression().fit(X,Y)
#now try to compute on dask arrays over the whole space
x= da.linspace(0,100,100,chunks=(10,)).astype(np.int32)
y= da.linspace(0,100,100,chunks=(10,)).astype(np.int32)
z= da.linspace(0,100,100,chunks=(10,)).astype(np.int32)
x,y,z = da.meshgrid(x,y,z,sparse=False,indexing='ij')
stacked = da.stack([x.flatten(),y.flatten(),z.flatten()],axis=1)
clf = ParallelPostFit(estimator=reg)
clf.predict(stacked)
Excecuting clf.predict throws a value error Can't drop an axis with more than 1 block. Please use atop instead.
which I dont understand how to correct.
Thank You for any help.

Shape error when using PolynomialFeatures

The Issue
To begin with I'm pretty new to machine learning. I have decided to test up some of the things that I have learned on some financial datam my machine learning model looks like this:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
df = pd.read_csv("/Users/Documents/Trading.csv")
poly_features = PolynomialFeatures(degree=2, include_bias=False)
linear_reg = LinearRegression(fit_intercept = True)
X = df_copy[["open","volume", "base volume", "RSI_14"]]
X_poly = poly_features.fit_transform(X)[1]
y = df_copy[["high"]]
linear_reg.fit(X_poly, y)
x = linear_reg.predict([[1.905E-05, 18637.07503453,0.35522205, 69.95820948552947]])
print(x)
all works great until the moment I try to implement PolynomialFeatures which brings to be the following error:
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Attempts to solve the issue:
Atempt 1
I've tried adding .values to X but the same error still comes up:
X_poly = poly_features.fit_transform(X.values)[1]
Atempt 2
I tried solving this problem by adding reshape(-1, 1) at the end of X_poly:
X_poly = poly_features.fit_transform(X)[1].reshape(-1, 1)
but it just replaces the previous error with this one:
ValueError: Found input variables with inconsistent numbers of samples: [14, 5696]
Thank you very much in advance for your help.
It wants you to transform your input. Try using X_poly = poly_features.fit_transform(X.values.reshape(1,-1))[1]

Linear Regression on Pandas DataFrame using Sklearn ( IndexError: tuple index out of range)

I'm new to Python and trying to perform linear regression using sklearn on a pandas dataframe. This is what I did:
data = pd.read_csv('xxxx.csv')
After that I got a DataFrame of two columns, let's call them 'c1', 'c2'. Now I want to do linear regression on the set of (c1,c2) so I entered
X=data['c1'].values
Y=data['c2'].values
linear_model.LinearRegression().fit(X,Y)
which resulted in the following error
IndexError: tuple index out of range
What's wrong here? Also, I'd like to know
visualize the result
make predictions based on the result?
I've searched and browsed a large number of sites but none of them seemed to instruct beginners on the proper syntax. Perhaps what's obvious to experts is not so obvious to a novice like myself.
Can you please help? Thank you very much for your time.
PS: I have noticed that a large number of beginner questions were down-voted in stackoverflow. Kindly take into account the fact that things that seem obvious to an expert user may take a beginner days to figure out. Please use discretion when pressing the down arrow lest you'd harm the vibrancy of this discussion community.
Let's assume your csv looks something like:
c1,c2
0.000000,0.968012
1.000000,2.712641
2.000000,11.958873
3.000000,10.889784
...
I generated the data as such:
import numpy as np
from sklearn import datasets, linear_model
import matplotlib.pyplot as plt
length = 10
x = np.arange(length, dtype=float).reshape((length, 1))
y = x + (np.random.rand(length)*10).reshape((length, 1))
This data is saved to test.csv (just so you know where it came from, obviously you'll use your own).
data = pd.read_csv('test.csv', index_col=False, header=0)
x = data.c1.values
y = data.c2.values
print x # prints: [ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
You need to take a look at the shape of the data you are feeding into .fit().
Here x.shape = (10,) but we need it to be (10, 1), see sklearn. Same goes for y. So we reshape:
x = x.reshape(length, 1)
y = y.reshape(length, 1)
Now we create the regression object and then call fit():
regr = linear_model.LinearRegression()
regr.fit(x, y)
# plot it as in the example at http://scikit-learn.org/
plt.scatter(x, y, color='black')
plt.plot(x, regr.predict(x), color='blue', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
See sklearn linear regression example.
Dataset
Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
Importing the dataset
dataset = pd.read_csv('1.csv')
X = dataset[["mark1"]]
y = dataset[["mark2"]]
Fitting Simple Linear Regression to the set
regressor = LinearRegression()
regressor.fit(X, y)
Predicting the set results
y_pred = regressor.predict(X)
Visualising the set results
plt.scatter(X, y, color = 'red')
plt.plot(X, regressor.predict(X), color = 'blue')
plt.title('mark1 vs mark2')
plt.xlabel('mark1')
plt.ylabel('mark2')
plt.show()
I post an answer that addresses exactly the error that you got:
IndexError: tuple index out of range
Scikit-learn expects 2D inputs. Just reshape the X and Y.
Replace:
X=data['c1'].values # this has shape (XXX, ) - It's 1D
Y=data['c2'].values # this has shape (XXX, ) - It's 1D
linear_model.LinearRegression().fit(X,Y)
with
X=data['c1'].values.reshape(-1,1) # this has shape (XXX, 1) - it's 2D
Y=data['c2'].values.reshape(-1,1) # this has shape (XXX, 1) - it's 2D
linear_model.LinearRegression().fit(X,Y)
make predictions based on the result?
To predict,
lr = linear_model.LinearRegression().fit(X,Y)
lr.predict(X)
Is there any way I can view details of the regression?
The LinearRegression has coef_ and intercept_ attributes.
lr.coef_
lr.intercept_
show the slope and intercept.
You really should have a look at the docs for the fit method which you can view here
For how to visualize a linear regression, play with the example here. I'm guessing you haven't used ipython (Now called jupyter) much either, so you should definitely invest some time into learning that. It's a great tool for exploring data and machine learning. You can literally copy/paste the example from scikit linear regression into an ipython notebook and run it
For your specific problem with the fit method, by referring to the docs, you can see that the format of the data you are passing in for your X values is wrong.
Per the docs,
"X : numpy array or sparse matrix of shape [n_samples,n_features]"
You can fix your code with this
X = [[x] for x in data['c1'].values]

Categories