sklearn issue: Found arrays with inconsistent numbers of samples when doing regression - python

this question seems to have been asked before, but I can't seem to comment for further clarification on the accepted answer and I couldn't figure out the solution provided.
I am trying to learn how to use sklearn with my own data. I essentially just got the annual % change in GDP for 2 different countries over the past 100 years. I am just trying to learn using a single variable for now. What I am essentially trying to do is use sklearn to predict what the GDP % change for country A will be given the percentage change in country B's GDP.
The problem is that I receive an error saying:
ValueError: Found arrays with inconsistent numbers of samples: [ 1
107]
Here is my code:
import sklearn.linear_model as lm
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
def bytespdate2num(fmt, encoding='utf-8'):#function to convert bytes to string for the dates.
strconverter = mdates.strpdate2num(fmt)
def bytesconverter(b):
s = b.decode(encoding)
return strconverter(s)
return bytesconverter
dataCSV = open('combined_data.csv')
comb_data = []
for line in dataCSV:
comb_data.append(line)
date, chngdpchange, ausgdpchange = np.loadtxt(comb_data, delimiter=',', unpack=True, converters={0: bytespdate2num('%d/%m/%Y')})
chntrain = chngdpchange[:-1]
chntest = chngdpchange[-1:]
austrain = ausgdpchange[:-1]
austest = ausgdpchange[-1:]
regr = lm.LinearRegression()
regr.fit(chntrain, austrain)
print('Coefficients: \n', regr.coef_)
print("Residual sum of squares: %.2f"
% np.mean((regr.predict(chntest) - austest) ** 2))
print('Variance score: %.2f' % regr.score(chntest, austest))
plt.scatter(chntest, austest, color='black')
plt.plot(chntest, regr.predict(chntest), color='blue')
plt.xticks(())
plt.yticks(())
plt.show()
What am I doing wrong? I essentially tried to apply the sklearn tutorial (They used some diabetes data set) to my own simple data. My data just contains the date, country A's % change in GDP for that specific year, and country B's % change in GDP for that same year.
I tried the solutions here and here (basically trying to find more out about the solution in the first link), but just receive the exact same error.
Here is the full traceback in case you want to see it:
Traceback (most recent call last):
File "D:\My Stuff\Dropbox\Python\Python projects\test regression\tester.py", line 34, in <module>
regr.fit(chntrain, austrain)
File "D:\Programs\Installed\Python34\lib\site-packages\sklearn\linear_model\base.py", line 376, in fit
y_numeric=True, multi_output=True)
File "D:\Programs\Installed\Python34\lib\site-packages\sklearn\utils\validation.py", line 454, in check_X_y
check_consistent_length(X, y)
File "D:\Programs\Installed\Python34\lib\site-packages\sklearn\utils\validation.py", line 174, in check_consistent_length
"%s" % str(uniques))
ValueError: Found arrays with inconsistent numbers of samples: [ 1 107]

In fit(X,y),the input parameter X is supposed to be a 2-D array. But if X in your data is only one-dimension, you can just reshape it into a 2-D array like this:regr.fit(chntrain_X.reshape(len(chntrain_X), 1), chntrain_Y)

regr.fit(chntrain, austrain)
This doesn't look right. The first parameter to fit should be an X, which refers to a feature vector. The second parameter should be a y, which is the correct answers (targets) vector associated with X.
For example, if you have GDP, you might have:
X[0] = [43, 23, 52] -> y[0] = 5
# meaning the first year had the features [43, 23, 52] (I just made them up)
# and the change that year was 5
Judging by your names, both chntrain and austrain are feature vectors. Judging by how you load your data, maybe the last column is the target?
Maybe you need to do something like:
chntrain_X, chntrain_y = chntrain[:, :-1], chntrain[:, -1]
# you can do the same with austrain and concatenate them or test on them if this part works
regr.fit(chntrain_X, chntrain_y)
But we can't tell without knowing the exact storage format of your data.

Try changing chntrain to a 2-D array instead of 1-D, i.e. reshape to (len(chntrain), 1).
For prediction, also change chntest to a 2-D array.

I have been having similar problems to you and have found a solution.
Where you have the following error:
ValueError: Found arrays with inconsistent numbers of samples: [ 1 107]
The [ 1 107] part is basically saying that your array is the wrong way around. Sklearn thinks you have 107 columns of data with 1 row.
To fix this try transposing the X data like so:
chntrain.T
The re-run your fit:
regr.fit(chntrain, austrain)
Depending on what your "austrain" data looks like you may need to transpose this too.

You may use np.newaxis as well. The example can be X = X[:, np.newaxis]. I found the method at Logistic function

Related

facing problem while running reg.predict in jupyter ntbk says "ValueError"

Trying to learn sklearn in python. But the jupyter ntbk is giving error saying "ValueError: Expected 2D array, got scalar array instead:
array=750.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample."
*But I have already defined x to be 2D array using x.values.reshape(-1,1)
You can find the CSV file and screenshot of the Error Code here -> https://github.com/CaptainRD/CSV-for-StackOverflow
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.linear_model import LinearRegression
data = pd.read_csv('1.02. Multiple linear regression.csv')
data.head()
x = data[['SAT','Rand 1,2,3']]
y = data['GPA']
reg = LinearRegression()
reg.fit(x,y)r2 = reg.score(x,y)
n = x.shape[0]
p = x.shape[1]
adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
adjusted_r2
reg.predict(1750)
As you can see in your code, your x has two variables, SAT and Rand 1,2,3. Which means, you need to provide a two dimensional input for your predict method. example:
reg.predict([[1750, 1]])
which returns:
>>> array([1.88])
You are facing this error because you did not provide the second value (for the Rand 1,2,3 variable). Note, if this variable is not important, you should remove it from your x data.
This model is mapping two inputs (SAT and Rand 1,2,3) to a single output (GPA), and thus requires a list of two elements as input for a valid prediction. I'm guessing the 1750 that you're supplying is meant to be the SAT value, but you also need to provide the Rand 1,2,3 value. Something like [1750, 1] would work.

ValueError: matmul when trying to fit sklearn's linear regressor to pandas dataframe instanses

I've been trying to perform a simple multivariate linear regression on some dummy data using sklearn. I initially passed sklearn.linear_model.LinearRegression.fit numpy arrays and kept getting this error:
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 2 is different from 1)
which I thought was due to some mistake with the transposition of my arrays or something, so I pulled up a tutorial that used pandas dataframes and set out my code in the same way:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
VWC = np.array((0,0.2,0.4,0.6,0.8,1))
Sensor_Voltage = np.array((515,330,275,250,245,240))
X = np.column_stack((VWC,VWC*VWC))
df = pd.DataFrame(X,columns=["VWC","VWC2"])
target = pd.DataFrame(Sensor_Voltage,columns=["Volt"])
model = LinearRegression()
model.fit(df,target["Volt"])
x = np.linspace(0,1,30)
y = model.predict(x[:,np.newaxis])
plt.plot(VWC, Sensor_Voltage)
plt.plot(x,y,dashes=(3,1))
plt.title("Simple Linear Regression")
plt.xlabel("Volumetric Water Content")
plt.ylabel("Sensor response (4.9mV)")
plt.show()
And I still get the following traceback:
Traceback (most recent call last):
File "C:\Users\Vivian Imbriotis\AppData\Local\Programs\Python\Python37\simple_linear_regression.py", line 16, in <module>
y = model.predict(x[:,np.newaxis])
File "C:\Users\Vivian Imbriotis\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\linear_model\_base.py", line 225, in predict
return self._decision_function(X)
File "C:\Users\Vivian Imbriotis\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\linear_model\_base.py", line 209, in _decision_function
dense_output=True) + self.intercept_
File "C:\Users\Vivian Imbriotis\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\utils\extmath.py", line 151, in safe_sparse_dot
ret = a # b
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 2 is different from 1)
I have been banging my head against this for hours now and I just don't understand what I am doing wrong.
Scikit-learn, numpy, and pandas are all the latest versions; this is on python 3.7.3
SOLVED: I am very silly and misunderstood how np.newaxis worked. The goal here was to fit a quadratic to the data, so I just needed to change:
x = np.linspace(0,1,30)
y = model.predict(x[:,np.newaxis])
to
x = np.columnstack([np.linspace(0,1,30),np.linspace(0,1,30)**2])
y = model.predict(x)
I am sure there is a more elegant way to write that but eh.
You train your model using shape of (6,2) dataset.if you check shape of df
df.shape = (6,2).
And when you try to predict you are trying with different shape of dataset.
x.shape=(30,1)
what you need is to use the correct shape of dataset.Try this
x = np.linspace((0,0),(1,1),30)
y = model.predict(x)
I also ran into this error when using sklearn and LinearRegression, turns out the issue was that I was passing the Y variable to the LinearRegression object in the first position, and the X variables in the second position. But you actually pass the X variables first, and then Y variables - the opposite of the order you use in R with R's lm().
Hopefully this helps someone out someday.

Linear Regression on Pandas DataFrame using Sklearn ( IndexError: tuple index out of range)

I'm new to Python and trying to perform linear regression using sklearn on a pandas dataframe. This is what I did:
data = pd.read_csv('xxxx.csv')
After that I got a DataFrame of two columns, let's call them 'c1', 'c2'. Now I want to do linear regression on the set of (c1,c2) so I entered
X=data['c1'].values
Y=data['c2'].values
linear_model.LinearRegression().fit(X,Y)
which resulted in the following error
IndexError: tuple index out of range
What's wrong here? Also, I'd like to know
visualize the result
make predictions based on the result?
I've searched and browsed a large number of sites but none of them seemed to instruct beginners on the proper syntax. Perhaps what's obvious to experts is not so obvious to a novice like myself.
Can you please help? Thank you very much for your time.
PS: I have noticed that a large number of beginner questions were down-voted in stackoverflow. Kindly take into account the fact that things that seem obvious to an expert user may take a beginner days to figure out. Please use discretion when pressing the down arrow lest you'd harm the vibrancy of this discussion community.
Let's assume your csv looks something like:
c1,c2
0.000000,0.968012
1.000000,2.712641
2.000000,11.958873
3.000000,10.889784
...
I generated the data as such:
import numpy as np
from sklearn import datasets, linear_model
import matplotlib.pyplot as plt
length = 10
x = np.arange(length, dtype=float).reshape((length, 1))
y = x + (np.random.rand(length)*10).reshape((length, 1))
This data is saved to test.csv (just so you know where it came from, obviously you'll use your own).
data = pd.read_csv('test.csv', index_col=False, header=0)
x = data.c1.values
y = data.c2.values
print x # prints: [ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
You need to take a look at the shape of the data you are feeding into .fit().
Here x.shape = (10,) but we need it to be (10, 1), see sklearn. Same goes for y. So we reshape:
x = x.reshape(length, 1)
y = y.reshape(length, 1)
Now we create the regression object and then call fit():
regr = linear_model.LinearRegression()
regr.fit(x, y)
# plot it as in the example at http://scikit-learn.org/
plt.scatter(x, y, color='black')
plt.plot(x, regr.predict(x), color='blue', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
See sklearn linear regression example.
Dataset
Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
Importing the dataset
dataset = pd.read_csv('1.csv')
X = dataset[["mark1"]]
y = dataset[["mark2"]]
Fitting Simple Linear Regression to the set
regressor = LinearRegression()
regressor.fit(X, y)
Predicting the set results
y_pred = regressor.predict(X)
Visualising the set results
plt.scatter(X, y, color = 'red')
plt.plot(X, regressor.predict(X), color = 'blue')
plt.title('mark1 vs mark2')
plt.xlabel('mark1')
plt.ylabel('mark2')
plt.show()
I post an answer that addresses exactly the error that you got:
IndexError: tuple index out of range
Scikit-learn expects 2D inputs. Just reshape the X and Y.
Replace:
X=data['c1'].values # this has shape (XXX, ) - It's 1D
Y=data['c2'].values # this has shape (XXX, ) - It's 1D
linear_model.LinearRegression().fit(X,Y)
with
X=data['c1'].values.reshape(-1,1) # this has shape (XXX, 1) - it's 2D
Y=data['c2'].values.reshape(-1,1) # this has shape (XXX, 1) - it's 2D
linear_model.LinearRegression().fit(X,Y)
make predictions based on the result?
To predict,
lr = linear_model.LinearRegression().fit(X,Y)
lr.predict(X)
Is there any way I can view details of the regression?
The LinearRegression has coef_ and intercept_ attributes.
lr.coef_
lr.intercept_
show the slope and intercept.
You really should have a look at the docs for the fit method which you can view here
For how to visualize a linear regression, play with the example here. I'm guessing you haven't used ipython (Now called jupyter) much either, so you should definitely invest some time into learning that. It's a great tool for exploring data and machine learning. You can literally copy/paste the example from scikit linear regression into an ipython notebook and run it
For your specific problem with the fit method, by referring to the docs, you can see that the format of the data you are passing in for your X values is wrong.
Per the docs,
"X : numpy array or sparse matrix of shape [n_samples,n_features]"
You can fix your code with this
X = [[x] for x in data['c1'].values]

Having problems with dimensions in machine learning ( Python Scikit )

I am a bit new to applying machine learning, so I was trying to teach myself how to do linear regression with any kind of data on mldata.org and in the Python scikit package. I tested out the linear regression example code (http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html) and the code worked well with the diabetes dataset. However, I tried to use the code with other datasets, such as one about earthquakes on mldata (http://mldata.org/repository/data/viewslug/global-earthquakes/). However, I was not able to do so due to the dimension problems on there.
Warning (from warnings module):
File "/usr/lib/python2.7/dist-packages/numpy/core/_methods.py", line 55
warnings.warn("Mean of empty slice.", RuntimeWarning)
RuntimeWarning: Mean of empty slice.
Warning (from warnings module):
File "/usr/lib/python2.7/dist-packages/numpy/core/_methods.py", line 65
ret, rcount, out=ret, casting='unsafe', subok=False)
RuntimeWarning: invalid value encountered in true_divide
Traceback (most recent call last):
File "/home/anthony/Documents/Programming/Python/Machine Learning/Scikit/earthquake_linear_regression.py", line 38, in <module>
regr.fit(earthquake_X_train, earthquake_y_train)
File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/base.py", line 371, in fit
linalg.lstsq(X, y)
File "/usr/lib/python2.7/dist-packages/scipy/linalg/basic.py", line 518, in lstsq
raise ValueError('incompatible dimensions')
ValueError: incompatible dimensions
How do I set up the dimensions of the data?
Size of the data:
earthquake_X.shape
(59209, 1, 4)
earthquake_X_train.shape
(59189, 1)
earthquake_y_test.shape
(3, 59209)
earthquake.target.shape
(3, 59209)
The code:
# Code source: Jaques Grobler
# License: BSD 3 clause
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
#Experimenting with earthquake data
from sklearn.datasets.mldata import fetch_mldata
import tempfile
test_data_home = tempfile.mkdtemp()
# Load the diabetes dataset
earthquake = fetch_mldata('Global Earthquakes', data_home = test_data_home)
# Use only one feature
earthquake_X = earthquake.data[:, np.newaxis]
earthquake_X_temp = earthquake_X[:, :, 2]
# Split the data into training/testing sets
earthquake_X_train = earthquake_X_temp[:-20]
earthquake_X_test = earthquake_X_temp[-20:]
# Split the targets into training/testing sets
earthquake_y_train = earthquake.target[:-20]
earthquake_y_test = earthquake.target[-20:]
print "Splitting of data for preformance check completed"
# Create linear regression object
regr = linear_model.LinearRegression()
print "Created linear regression object"
# Train the model using the training sets
regr.fit(earthquake_X_train, earthquake_y_train)
print "Dataset trained"
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
% np.mean((regr.predict(earthquake_X_test) - earthquake_y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(earthquake_X_test, earthquake_y_test))
# Plot outputs
plt.scatter(earthquake_X_test, earthquake_y_test, color='black')
plt.plot(earthquake_X_test, regr.predict(earthquake_X_test), color='blue',
linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
Your array of targets (earthquake_y_train) is of wrong shape. Moreover actually it's empty.
When you do
earthquake_y_train = earthquake.target[:-20]
you select all rows but last 20 among first axis. And, according to the data you posted, earthquake.target has shape (3, 59209), so there are no rows to select!
But even if there were any, it'd be still an error. Why? Because first dimensions of X and y must be the same. According to the sklearn's documentation, LinearRegression's fit expects X to be of shape [n_samples, n_features] and y — [n_samples, n_targets].
In order to fix it change definitions of ys to the following:
earthquake_y_train = earthquake.target[:, :-20].T
earthquake_y_test = earthquake.target[:, -20:].T
P.S. Even if you fix all these problem there's still a problem in your script: plt.scatter can't work with "multidimensional" ys.

sklearn.gaussian_process fit() not working with array sizes greater than 100

I am generating a random.uniform(low=0.0, high=100.0, size=(150,150)) array.
I input this into a function that generates the X, x, and y.
However, if the random test matrix is greater than 100, I get the error below.
I have tried playing around with theta values.
Has anyone had this problem? Is this a bug?
I am using python2.6 and scikit-learn-0.10. Should I try python3?
Any suggestions or comments are welcome.
Thank you.
gp.fit( XKrn, yKrn )
File "/usr/lib/python2.6/scikit_learn-0.10_git-py2.6-linux-x86_64.egg/sklearn/gaussian_process/gaussian_process.py", line 258, in fit
raise ValueError("X and y must have the same number of rows.")
ValueError: X and y must have the same number of rows.
ValueError: X and y must have the same number of rows. means that in your case XKrn.shape[0] should be equal to yKrn.shape[0]. You probably have an error in the code generating the dataset.
Here is a working example:
In [1]: from sklearn.gaussian_process import GaussianProcess
In [2]: import numpy as np
In [3]: X, y = np.random.randn(150, 10), np.random.randn(150)
In [4]: GaussianProcess().fit(X, y)
Out[4]:
GaussianProcess(beta0=None,
corr=<function squared_exponential at 0x10d42aaa0>, normalize=True,
nugget=array(2.220446049250313e-15), optimizer='fmin_cobyla',
random_start=1,
random_state=<mtrand.RandomState object at 0x10b4c8360>,
regr=<function constant at 0x10d42a488>, storage_mode='full',
theta0=array([[ 0.1]]), thetaL=None, thetaU=None, verbose=False)
Python 3 is not supported yet and the latest released version of scikit-learn is 0.12.1 at this time.
My original post was deleted. Thanks, Flexo.
I had the same problem, and number of rows I was passing in was the same in my X and y.
In my case, the problem was in fact that I was passing in a number of features to fit against in my output. Gaussian processes fit to a single output feature.
The "number of rows" error was misleading, and stemmed from the fact that I wasn't using the package correctly. To fit multiple output features like this, you'll need a GP for each feature.

Categories