Trying to learn sklearn in python. But the jupyter ntbk is giving error saying "ValueError: Expected 2D array, got scalar array instead:
array=750.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample."
*But I have already defined x to be 2D array using x.values.reshape(-1,1)
You can find the CSV file and screenshot of the Error Code here -> https://github.com/CaptainRD/CSV-for-StackOverflow
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.linear_model import LinearRegression
data = pd.read_csv('1.02. Multiple linear regression.csv')
data.head()
x = data[['SAT','Rand 1,2,3']]
y = data['GPA']
reg = LinearRegression()
reg.fit(x,y)r2 = reg.score(x,y)
n = x.shape[0]
p = x.shape[1]
adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
adjusted_r2
reg.predict(1750)
As you can see in your code, your x has two variables, SAT and Rand 1,2,3. Which means, you need to provide a two dimensional input for your predict method. example:
reg.predict([[1750, 1]])
which returns:
>>> array([1.88])
You are facing this error because you did not provide the second value (for the Rand 1,2,3 variable). Note, if this variable is not important, you should remove it from your x data.
This model is mapping two inputs (SAT and Rand 1,2,3) to a single output (GPA), and thus requires a list of two elements as input for a valid prediction. I'm guessing the 1750 that you're supplying is meant to be the SAT value, but you also need to provide the Rand 1,2,3 value. Something like [1750, 1] would work.
Related
I am creating Accumulated Local Effect plots using Python's PyALE function. I am using a RandomForestRegression function to build the model.
I can create 1D ALE plots. However, I get a Value Error when I try to create a 2D ALE plot using the same model and training data.
Here is my code.
ale(training_data, model=model1, feature=["feature1", "feature2"])
I can plot a 1D ALE plot for feature1 and feature2 with the following code.
ale(training_data, model=model1, feature=["feature1"], feature_type="continuous")
ale(training_data, model=model1, feature=["feature2"], feature_type="continuous")
There are no missing or infinite values for any column in the data frame.
I am getting the following error with the 2D ALE plot command.
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
This is a link to the function https://pypi.org/project/PyALE/#description
I am not sure why I am getting this error. I would appreciate some help on this.
Thank you,
Rohin
This issue was addressed in release v1.1.2 of the package PyALE. For those using earlier versions the workaround mentioned in the issue thread in github is to reset the index of the dataset fed to the function ale. For completeness here's a code that reproduces the error and the workaround:
from PyALE import ale
import pandas as pd
import matplotlib.pyplot as plt
import random
from sklearn.ensemble import RandomForestRegressor
# get the raw diamond data (from R's ggplot2)
dat_diamonds = pd.read_csv(
"https://raw.githubusercontent.com/tidyverse/ggplot2/master/data-raw/diamonds.csv"
)
X = dat_diamonds.loc[:, ~dat_diamonds.columns.str.contains("price")].copy()
y = dat_diamonds.loc[:, "price"].copy()
features = ["carat","depth", "table", "x", "y", "z"]
# fit the model
model = RandomForestRegressor(random_state=1345)
model.fit(X[features], y)
# sample the data
random.seed(1234)
indices = random.sample(range(X.shape[0]), 10000)
sampleData = X.loc[indices, :]
# get the effects.....
# This throws the error
ale_eff = ale(X=sampleData[features], model=model, feature=["z", "table"], grid_size=100)
# This will work, just reset the index with drop=True
ale_eff = ale(X=sampleData[features].reset_index(drop=True), model=model, feature=["z", "table"], grid_size=100)
Sup guys, I'm new to Python and new to Neural Networks as well. I'm trying to implement a Neural Network to predict the Close price of Bitcoin in a day, based on Open price in the same day. So I get a CSV file, and I'm trying to use 'Open' column as entry, and 'Close' column as target, you can see this in the code below:
from sklearn.neural_network import MLPClassifier
import numpy as np
import pandas as pd
dataset = pd.read_csv('BTC_USD.csv')
X = dataset['Open']
y = dataset['Close']
NeuralNetwork = MLPClassifier(verbose = True,
max_iter = 1000,
tol = 0,
activation = 'logistic')
NeuralNetwork.fit(X, y)
When I run the code I get this error:
ValueError: Expected 2D array, got 1D array instead:
array=[4.95100000e-02 4.95100000e-02 8.58400000e-02 ... 6.70745996e+03
6.66883984e+03 7.32675977e+03].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
After this error, I did some research here in stackoverflow, and I tried some solutions proposed in other posts, like this one:
from sklearn.neural_network import MLPClassifier
import numpy as np
import pandas as pd
dataset = pd.read_csv('BTC_USD.csv')
X = np.array(dataset[['Open']])
X = X.reshape(-1, 1)
y = np.array(dataset[['Close']])
y = y.reshape(-1, 1)
NeuralNetwork = MLPClassifier(verbose = True,
max_iter = 1000,
tol = 0,
activation = 'logistic')
NeuralNetwork.fit(X, y)
After running this code, I get this new error:
ValueError: Unknown label type: (array([4.95100000e-02, 8.58400000e-02, 8.08000000e-02, ...,
6.66883984e+03, 6.30685010e+03, 7.49379980e+03]),)
and this ''warning'' at the first line (which contains the directory):
DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
Could you help me please? I tried many solutions, but any of them worked.
You should use the values attribute of a data frame to get the elements of one column. In addition, what you want to achieve is a regression, not a classification, thus you must use a regressor such as MLPRegressor, following
from sklearn.neural_network import MLPRegressor
import numpy as np
import pandas as pd
dataset = pd.read_csv('BTC_USD.csv')
X = dataset["Open"].values.reshape(-1, 1)
y = dataset["Close"].values
NeuralNetwork = MLPRegressor(verbose = True,
max_iter = 1000,
tol = 0,
activation = "logistic")
NeuralNetwork.fit(X, y)
The code works now, but the results are not correct as you will need to work on the features and your network hyperparameters. But this is beyond the scope of SO.
I'm working with the Mnist data set, in order to learn about Machine learning, and as for now I'm trying to display the first digit in the Mnist data set as an image, and I have encountered a problem.
I have a matrix with the dimensions 784x10000, where each column is a digit in the data set. I have created the matrix myself, because the Mnist data set came in the form of a text file, which in itself caused me quite a lot of problems, but that's a question for itself.
The MN_train matrix below, is my large 784x10000 matrix. So what I'm trying to do below, is to fill up a 28x28 matrix, in order to display my image.
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
grey = np.zeros(shape=(28,28))
k = 0
for l in range(28):
for p in range(28):
grey[p,l]=MN_train[k,0]
k = k + 1
print grey
plt.show(grey)
But when I try to display the image, I get the following error:
The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Followed by a image plot that does not look like the number five, as I would expect.
Is there something I have overlooked, or does this tell me that my manipulation of the text file, in order to construct the MN_train matrix, has resulted in an error?
The error you get is because you supply the array to show. show accepts only a single boolean argument hold=True or False.
In order to create an image plot, you need to use imshow.
plt.imshow(grey)
plt.show() # <- no argument here
Also note that the loop is rather inefficient. You may just reshape the input column array.
The complete code would then look like
import numpy as np
import matplotlib.pyplot as plt
MN_train = np.loadtxt( ... )
grey = MN_train[:,0].reshape((28,28))
plt.imshow(grey)
plt.show()
I'm new to Python and trying to perform linear regression using sklearn on a pandas dataframe. This is what I did:
data = pd.read_csv('xxxx.csv')
After that I got a DataFrame of two columns, let's call them 'c1', 'c2'. Now I want to do linear regression on the set of (c1,c2) so I entered
X=data['c1'].values
Y=data['c2'].values
linear_model.LinearRegression().fit(X,Y)
which resulted in the following error
IndexError: tuple index out of range
What's wrong here? Also, I'd like to know
visualize the result
make predictions based on the result?
I've searched and browsed a large number of sites but none of them seemed to instruct beginners on the proper syntax. Perhaps what's obvious to experts is not so obvious to a novice like myself.
Can you please help? Thank you very much for your time.
PS: I have noticed that a large number of beginner questions were down-voted in stackoverflow. Kindly take into account the fact that things that seem obvious to an expert user may take a beginner days to figure out. Please use discretion when pressing the down arrow lest you'd harm the vibrancy of this discussion community.
Let's assume your csv looks something like:
c1,c2
0.000000,0.968012
1.000000,2.712641
2.000000,11.958873
3.000000,10.889784
...
I generated the data as such:
import numpy as np
from sklearn import datasets, linear_model
import matplotlib.pyplot as plt
length = 10
x = np.arange(length, dtype=float).reshape((length, 1))
y = x + (np.random.rand(length)*10).reshape((length, 1))
This data is saved to test.csv (just so you know where it came from, obviously you'll use your own).
data = pd.read_csv('test.csv', index_col=False, header=0)
x = data.c1.values
y = data.c2.values
print x # prints: [ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
You need to take a look at the shape of the data you are feeding into .fit().
Here x.shape = (10,) but we need it to be (10, 1), see sklearn. Same goes for y. So we reshape:
x = x.reshape(length, 1)
y = y.reshape(length, 1)
Now we create the regression object and then call fit():
regr = linear_model.LinearRegression()
regr.fit(x, y)
# plot it as in the example at http://scikit-learn.org/
plt.scatter(x, y, color='black')
plt.plot(x, regr.predict(x), color='blue', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
See sklearn linear regression example.
Dataset
Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
Importing the dataset
dataset = pd.read_csv('1.csv')
X = dataset[["mark1"]]
y = dataset[["mark2"]]
Fitting Simple Linear Regression to the set
regressor = LinearRegression()
regressor.fit(X, y)
Predicting the set results
y_pred = regressor.predict(X)
Visualising the set results
plt.scatter(X, y, color = 'red')
plt.plot(X, regressor.predict(X), color = 'blue')
plt.title('mark1 vs mark2')
plt.xlabel('mark1')
plt.ylabel('mark2')
plt.show()
I post an answer that addresses exactly the error that you got:
IndexError: tuple index out of range
Scikit-learn expects 2D inputs. Just reshape the X and Y.
Replace:
X=data['c1'].values # this has shape (XXX, ) - It's 1D
Y=data['c2'].values # this has shape (XXX, ) - It's 1D
linear_model.LinearRegression().fit(X,Y)
with
X=data['c1'].values.reshape(-1,1) # this has shape (XXX, 1) - it's 2D
Y=data['c2'].values.reshape(-1,1) # this has shape (XXX, 1) - it's 2D
linear_model.LinearRegression().fit(X,Y)
make predictions based on the result?
To predict,
lr = linear_model.LinearRegression().fit(X,Y)
lr.predict(X)
Is there any way I can view details of the regression?
The LinearRegression has coef_ and intercept_ attributes.
lr.coef_
lr.intercept_
show the slope and intercept.
You really should have a look at the docs for the fit method which you can view here
For how to visualize a linear regression, play with the example here. I'm guessing you haven't used ipython (Now called jupyter) much either, so you should definitely invest some time into learning that. It's a great tool for exploring data and machine learning. You can literally copy/paste the example from scikit linear regression into an ipython notebook and run it
For your specific problem with the fit method, by referring to the docs, you can see that the format of the data you are passing in for your X values is wrong.
Per the docs,
"X : numpy array or sparse matrix of shape [n_samples,n_features]"
You can fix your code with this
X = [[x] for x in data['c1'].values]
I am generating a random.uniform(low=0.0, high=100.0, size=(150,150)) array.
I input this into a function that generates the X, x, and y.
However, if the random test matrix is greater than 100, I get the error below.
I have tried playing around with theta values.
Has anyone had this problem? Is this a bug?
I am using python2.6 and scikit-learn-0.10. Should I try python3?
Any suggestions or comments are welcome.
Thank you.
gp.fit( XKrn, yKrn )
File "/usr/lib/python2.6/scikit_learn-0.10_git-py2.6-linux-x86_64.egg/sklearn/gaussian_process/gaussian_process.py", line 258, in fit
raise ValueError("X and y must have the same number of rows.")
ValueError: X and y must have the same number of rows.
ValueError: X and y must have the same number of rows. means that in your case XKrn.shape[0] should be equal to yKrn.shape[0]. You probably have an error in the code generating the dataset.
Here is a working example:
In [1]: from sklearn.gaussian_process import GaussianProcess
In [2]: import numpy as np
In [3]: X, y = np.random.randn(150, 10), np.random.randn(150)
In [4]: GaussianProcess().fit(X, y)
Out[4]:
GaussianProcess(beta0=None,
corr=<function squared_exponential at 0x10d42aaa0>, normalize=True,
nugget=array(2.220446049250313e-15), optimizer='fmin_cobyla',
random_start=1,
random_state=<mtrand.RandomState object at 0x10b4c8360>,
regr=<function constant at 0x10d42a488>, storage_mode='full',
theta0=array([[ 0.1]]), thetaL=None, thetaU=None, verbose=False)
Python 3 is not supported yet and the latest released version of scikit-learn is 0.12.1 at this time.
My original post was deleted. Thanks, Flexo.
I had the same problem, and number of rows I was passing in was the same in my X and y.
In my case, the problem was in fact that I was passing in a number of features to fit against in my output. Gaussian processes fit to a single output feature.
The "number of rows" error was misleading, and stemmed from the fact that I wasn't using the package correctly. To fit multiple output features like this, you'll need a GP for each feature.