Having problems with dimensions in machine learning ( Python Scikit )

Having problems with dimensions in machine learning ( Python Scikit ) - python

I am a bit new to applying machine learning, so I was trying to teach myself how to do linear regression with any kind of data on mldata.org and in the Python scikit package. I tested out the linear regression example code (http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html) and the code worked well with the diabetes dataset. However, I tried to use the code with other datasets, such as one about earthquakes on mldata (http://mldata.org/repository/data/viewslug/global-earthquakes/). However, I was not able to do so due to the dimension problems on there.
Warning (from warnings module):
File "/usr/lib/python2.7/dist-packages/numpy/core/_methods.py", line 55
warnings.warn("Mean of empty slice.", RuntimeWarning)
RuntimeWarning: Mean of empty slice.
Warning (from warnings module):
File "/usr/lib/python2.7/dist-packages/numpy/core/_methods.py", line 65
ret, rcount, out=ret, casting='unsafe', subok=False)
RuntimeWarning: invalid value encountered in true_divide
Traceback (most recent call last):
File "/home/anthony/Documents/Programming/Python/Machine Learning/Scikit/earthquake_linear_regression.py", line 38, in <module>
regr.fit(earthquake_X_train, earthquake_y_train)
File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/base.py", line 371, in fit
linalg.lstsq(X, y)
File "/usr/lib/python2.7/dist-packages/scipy/linalg/basic.py", line 518, in lstsq
raise ValueError('incompatible dimensions')
ValueError: incompatible dimensions
How do I set up the dimensions of the data?
Size of the data:
earthquake_X.shape
(59209, 1, 4)
earthquake_X_train.shape
(59189, 1)
earthquake_y_test.shape
(3, 59209)
earthquake.target.shape
(3, 59209)
The code:
# Code source: Jaques Grobler
# License: BSD 3 clause
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
#Experimenting with earthquake data
from sklearn.datasets.mldata import fetch_mldata
import tempfile
test_data_home = tempfile.mkdtemp()
# Load the diabetes dataset
earthquake = fetch_mldata('Global Earthquakes', data_home = test_data_home)
# Use only one feature
earthquake_X = earthquake.data[:, np.newaxis]
earthquake_X_temp = earthquake_X[:, :, 2]
# Split the data into training/testing sets
earthquake_X_train = earthquake_X_temp[:-20]
earthquake_X_test = earthquake_X_temp[-20:]
# Split the targets into training/testing sets
earthquake_y_train = earthquake.target[:-20]
earthquake_y_test = earthquake.target[-20:]
print "Splitting of data for preformance check completed"
# Create linear regression object
regr = linear_model.LinearRegression()
print "Created linear regression object"
# Train the model using the training sets
regr.fit(earthquake_X_train, earthquake_y_train)
print "Dataset trained"
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
% np.mean((regr.predict(earthquake_X_test) - earthquake_y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(earthquake_X_test, earthquake_y_test))
# Plot outputs
plt.scatter(earthquake_X_test, earthquake_y_test, color='black')
plt.plot(earthquake_X_test, regr.predict(earthquake_X_test), color='blue',
linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()

Your array of targets (earthquake_y_train) is of wrong shape. Moreover actually it's empty.
When you do
earthquake_y_train = earthquake.target[:-20]
you select all rows but last 20 among first axis. And, according to the data you posted, earthquake.target has shape (3, 59209), so there are no rows to select!
But even if there were any, it'd be still an error. Why? Because first dimensions of X and y must be the same. According to the sklearn's documentation, LinearRegression's fit expects X to be of shape [n_samples, n_features] and y — [n_samples, n_targets].
In order to fix it change definitions of ys to the following:
earthquake_y_train = earthquake.target[:, :-20].T
earthquake_y_test = earthquake.target[:, -20:].T
P.S. Even if you fix all these problem there's still a problem in your script: plt.scatter can't work with "multidimensional" ys.

Related

Using GPy Multiple-output coregionalized prediction

I have been facing a problem recently where I believe that a multiple-output GP might be a good candidate. I am at the moment applying a single-output GP to my data and as dimensionality increases, my results keep getting worse. I have tried multiple-output with SKlearn and was able to get better results for higher dimensions, however I believe that GPy is more complete for such tasks and I would have more control over the model. For the single-output GP I was setting the kernel as the following:
kernel = GPy.kern.RBF(input_dim=4, variance=1.0, lengthscale=1.0, ARD = True)
m = GPy.models.GPRegression(X, Y_single_output, kernel = kernel, normalizer = True)
m.optimize_restarts(num_restarts=10)
In the example above X has size (20,4) and Y(20,1).
The implementation that I am using to multiple-output I got from
Introduction to Multiple Output Gaussian Processes
I prepare the data accordingly to the example, setting X_mult_output to size (80,2) - with the second column being the input indices - and rearranging Y to (80,1).
kernel = GPy.kern.RBF(1,lengthscale=1, ARD = True)**GPy.kern.Coregionalize(input_dim=1,output_dim=4, rank=1)
m = GPy.models.GPRegression(X_mult_output,Y_mult_output, kernel = kernel, normalizer = True)
Ok, everything seems to work so far, now I want to predict the values. The problem so is that it seems that I am not able to predict the values. From what I understood, you can just predict a single output by specifying the input index on the Y_metadata argument.
As I have 4 inputs, I set an array that I want to predict as the following:
x_pred = np.array([3,2,2,4])
Then, I imagine that I have to do separately the prediction of each value out of my x_pred array as shown in Coregionalized Regression Model (vector-valued regression) :
Y_metadata1 = {'output_index': np.array([[0]])}
y1_pred = m.predict(np.array(x[0]).reshape(1,-1),Y_metadata=Y_metadata1)
The problem is that I keep getting the following error:
IndexError: index 1 is out of bounds for axis 1 with size 1
Any suggestion about how to overcome that problem or is there any mistake on my implementation?
Traceback:
Traceback (most recent call last):
File "<ipython-input-9-edb25bc29817>", line 36, in <module>
y1_pred = m.predict(np.array(x[0]).reshape(1,-1),Y_metadata=Y_metadata1)
File "c:\users\johndoe\desktop\modules\sheffieldml-gpy-v1.9.9-0-g92f2e87\sheffieldml-gpy-92f2e87\GPy\core\gp.py", line 335, in predict
mean, var = self._raw_predict(Xnew, full_cov=full_cov, kern=kern)
File "c:\users\johndoe\desktop\modules\sheffieldml-gpy-v1.9.9-0-g92f2e87\sheffieldml-gpy-92f2e87\GPy\core\gp.py", line 292, in _raw_predict
mu, var = self.posterior._raw_predict(kern=self.kern if kern is None else kern, Xnew=Xnew, pred_var=self._predictive_variable, full_cov=full_cov)
File "c:\users\johndoe\desktop\modules\sheffieldml-gpy-v1.9.9-0-g92f2e87\sheffieldml-gpy-92f2e87\GPy\inference\latent_function_inference\posterior.py", line 276, in _raw_predict
Kx = kern.K(pred_var, Xnew)
File "c:\users\johndoe\desktop\modules\sheffieldml-gpy-v1.9.9-0-g92f2e87\sheffieldml-gpy-92f2e87\GPy\kern\src\kernel_slice_operations.py", line 109, in wrap
with _Slice_wrap(self, X, X2) as s:
File "c:\users\johndoe\desktop\modules\sheffieldml-gpy-v1.9.9-0-g92f2e87\sheffieldml-gpy-92f2e87\GPy\kern\src\kernel_slice_operations.py", line 65, in __init__
self.X2 = self.k._slice_X(X2) if X2 is not None else X2
File "<decorator-gen-140>", line 2, in _slice_X
File "C:\Users\johndoe\AppData\Roaming\Python\Python37\site-packages\paramz\caching.py", line 283, in g
return cacher(*args, **kw)
File "C:\Users\johndoe\AppData\Roaming\Python\Python37\site-packages\paramz\caching.py", line 172, in __call__
return self.operation(*args, **kw)
File "c:\users\johndoe\desktop\modules\sheffieldml-gpy-v1.9.9-0-g92f2e87\sheffieldml-gpy-92f2e87\GPy\kern\src\kern.py", line 117, in _slice_X
return X[:, self._all_dims_active]
IndexError: index 1 is out of bounds for axis 1 with size 1

problem
you have defined the kernel with X of dimention (-1, 4) and Y of dimension (-1, 1) but you are giving it X_pred of dimension (1, 1) (the first element of x_pred reshaped to (1, 1))
solution
give the x_pred to the model for prediction (an input with dimension of (-1, 4))
Y_metadata1 = {'output_index': np.array([[0]])}
y1_pred = m.predict(np.array(x_pred).reshape(1,-1), Y_metadata=Y_metadata1)
DIY
before executing your codes together try to run them seperatly and debug them easily, then you can make your code small and clean.
the example below is the debug code of your problem
Y_metadata1 = {'output_index': np.array([[0]])}
a = np.array(x_pred[0]).reshape(1,-1)
print(a.shape)
y1_pred = m.predict(a,Y_metadata=Y_metadata1)
the output is (1,1) and the error, which makes it obvious the error is from input dimension.
Reading errors also help, your error says, in kern.K(pred_var, Xnew) there is a problem, so the error is probably from kernel,
then it says its from X[:, self._all_dims_active] so the error is probably from X dimensions. then with a little experiment with x dimension you will get the idea.
hopefully after 7 days this would help!

ValueError: matmul when trying to fit sklearn's linear regressor to pandas dataframe instanses

I've been trying to perform a simple multivariate linear regression on some dummy data using sklearn. I initially passed sklearn.linear_model.LinearRegression.fit numpy arrays and kept getting this error:
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 2 is different from 1)
which I thought was due to some mistake with the transposition of my arrays or something, so I pulled up a tutorial that used pandas dataframes and set out my code in the same way:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
VWC = np.array((0,0.2,0.4,0.6,0.8,1))
Sensor_Voltage = np.array((515,330,275,250,245,240))
X = np.column_stack((VWC,VWC*VWC))
df = pd.DataFrame(X,columns=["VWC","VWC2"])
target = pd.DataFrame(Sensor_Voltage,columns=["Volt"])
model = LinearRegression()
model.fit(df,target["Volt"])
x = np.linspace(0,1,30)
y = model.predict(x[:,np.newaxis])
plt.plot(VWC, Sensor_Voltage)
plt.plot(x,y,dashes=(3,1))
plt.title("Simple Linear Regression")
plt.xlabel("Volumetric Water Content")
plt.ylabel("Sensor response (4.9mV)")
plt.show()
And I still get the following traceback:
Traceback (most recent call last):
File "C:\Users\Vivian Imbriotis\AppData\Local\Programs\Python\Python37\simple_linear_regression.py", line 16, in <module>
y = model.predict(x[:,np.newaxis])
File "C:\Users\Vivian Imbriotis\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\linear_model\_base.py", line 225, in predict
return self._decision_function(X)
File "C:\Users\Vivian Imbriotis\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\linear_model\_base.py", line 209, in _decision_function
dense_output=True) + self.intercept_
File "C:\Users\Vivian Imbriotis\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\utils\extmath.py", line 151, in safe_sparse_dot
ret = a # b
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 2 is different from 1)
I have been banging my head against this for hours now and I just don't understand what I am doing wrong.
Scikit-learn, numpy, and pandas are all the latest versions; this is on python 3.7.3
SOLVED: I am very silly and misunderstood how np.newaxis worked. The goal here was to fit a quadratic to the data, so I just needed to change:
x = np.linspace(0,1,30)
y = model.predict(x[:,np.newaxis])
to
x = np.columnstack([np.linspace(0,1,30),np.linspace(0,1,30)**2])
y = model.predict(x)
I am sure there is a more elegant way to write that but eh.

You train your model using shape of (6,2) dataset.if you check shape of df
df.shape = (6,2).
And when you try to predict you are trying with different shape of dataset.
x.shape=(30,1)
what you need is to use the correct shape of dataset.Try this
x = np.linspace((0,0),(1,1),30)
y = model.predict(x)

I also ran into this error when using sklearn and LinearRegression, turns out the issue was that I was passing the Y variable to the LinearRegression object in the first position, and the X variables in the second position. But you actually pass the X variables first, and then Y variables - the opposite of the order you use in R with R's lm().
Hopefully this helps someone out someday.

Adding a trendline to time series plot

I have this plot
Now I want to add a trend line to it, how do I do that?
The data looks like this:
I wanted to just plot how the median listing price in California has gone up over the years so I did this:
# Get California data
state_ca = []
state_median_price = []
state_ca_month = []
for state, price, date in zip(data['ZipName'], data['Median Listing Price'], data['Month']):
if ", CA" not in state:
continue
else:
state_ca.append(state)
state_median_price.append(price)
state_ca_month.append(date)
Then I converted the string state_ca_month to datetime:
# Convert state_ca_month to datetime
state_ca_month = [datetime.strptime(x, '%m/%d/%Y %H:%M') for x in state_ca_month]
Then plotted it
# Plot trends
figure(num=None, figsize=(12, 6), dpi=80, facecolor='w', edgecolor='k')
plt.plot(state_ca_month, state_median_price)
plt.show()
I thought of adding a trendline or some type of line but I am new to visualization. If anyone has any other suggestions I would appreciate it.
Following the advice in the comments I get this scatter plot
I am wondering if I should further format the data to make a clearer plot to examine.

If by "trend line" you mean a literal line, then you probably want to fit a linear regression to your data. sklearn provides this functionality in python.
From the example hyperlinked above:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
# Load the diabetes dataset
diabetes = datasets.load_diabetes()
# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print("Mean squared error: %.2f"
% mean_squared_error(diabetes_y_test, diabetes_y_pred))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(diabetes_y_test, diabetes_y_pred))
# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test, color='black')
plt.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
To clarify, "the overall trend" is not a well-defined thing. Many times, by "trend", people mean a literal line that "fits" the data well. By "fits the data", in turn, we mean "predicts the data." Thus, the most common way to get a trend line is to pick a line that best predicts the data that you have observed. As it turns out, we even need to be clear about what we mean by "predicts". One way to do this (and a very common one) is by defining "best predicts" in such a way as to minimize the sum of the squares of all of the errors between the "trend line" and the observed data. This is called ordinary least squares linear regression, and is one of the simplest ways to obtain a "trend line". This is the algorithm implemented in sklearn.linear_model.LinearRegression.

How can I manipulate my data to allow a random forest to run on it?

I want to train a random forest on a bunch of matrices (first link below for an example). I want to classify them as either "g" or "b" (good or bad, a or b, 1 or 0, it doesn't matter).
I've called the script randfore.py. I am currently using 10 examples, but I will be using a much bigger data set once I actually get this up and running.
Here is the code:
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
import os
import sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
working_dir = os.getcwd() # Grabs the working directory
directory = working_dir+"/fakesourcestuff/" ## The actual directory where the files are located
sources = list() # Just sets up a list here which is going to become the input for the random forest
for i in range(10):
cutoutfile = pd.read_csv(directory+ "image2_with_fake_geotran_subtracted_corrected_cutout_" + str(i) +".dat", dtype=object) ## Where we get the input data for the random forest from
sources.append(cutoutfile) # add it to our sources list
targets = pd.read_csv(directory + "faketargets.dat",sep='\n',header=None, dtype=object) # Reads in our target data... either "g" or "b" (Good or bad)
sources = pd.DataFrame(sources) ## I convert the list to a dataframe to avoid the "ValueError: cannot copy sequence with size 99 to array axis with dimension 1" error. Necessary?
# Training sets
X_train = sources[:8] # Inputs
y_train = targets[:8] # Targets
# Random Forest
rf = RandomForestClassifier(n_estimators=10)
rf_fit = rf.fit(X_train, y_train)
Here is the current error output:
Traceback (most recent call last):
File "randfore.py", line 31, in <module>
rf_fit = rf.fit(X_train, y_train)
File "/home/ithil/anaconda2/envs/iraf27/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 247, in fit
X = check_array(X, accept_sparse="csc", dtype=DTYPE)
File "/home/ithil/anaconda2/envs/iraf27/lib/python2.7/site-packages/sklearn/utils/validation.py", line 382, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.
I tried making the dtype = object, but it hasn't helped. I'm just not sure what sort of manipulation I need to perform to have this work.
I think the problem is because the files I appending to sources aren't just numbers but a mix of numbers, commas, and various square brackets (it's basically a big matrix). Is there a natural way to import this? The square brackets in particular are probably an issue.
Before I converted sources to a DataFrame I was getting the following error:
ValueError: cannot copy sequence with size 99 to array axis with dimension 1
This is due to the dimensions of my input (100 lines long) and my target which has 10 rows and 1 column.
Here is the contents of the first file that's read into cutouts (they're all the exact same style) to be used as the input:
https://pastebin.com/tkysqmVu
And here is the contents of faketargets.dat, the targets:
https://pastebin.com/632RBqWc
Any ideas? Help greatly appreciated. I am sure there is a lot of fundamental confusion going on here.

Try writing:
X_train = sources.values[:8] # Inputs
y_train = targets.values[:8] # Targets
I hope this will solve your problem!

sklearn issue: Found arrays with inconsistent numbers of samples when doing regression

this question seems to have been asked before, but I can't seem to comment for further clarification on the accepted answer and I couldn't figure out the solution provided.
I am trying to learn how to use sklearn with my own data. I essentially just got the annual % change in GDP for 2 different countries over the past 100 years. I am just trying to learn using a single variable for now. What I am essentially trying to do is use sklearn to predict what the GDP % change for country A will be given the percentage change in country B's GDP.
The problem is that I receive an error saying:
ValueError: Found arrays with inconsistent numbers of samples: [ 1
107]
Here is my code:
import sklearn.linear_model as lm
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
def bytespdate2num(fmt, encoding='utf-8'):#function to convert bytes to string for the dates.
strconverter = mdates.strpdate2num(fmt)
def bytesconverter(b):
s = b.decode(encoding)
return strconverter(s)
return bytesconverter
dataCSV = open('combined_data.csv')
comb_data = []
for line in dataCSV:
comb_data.append(line)
date, chngdpchange, ausgdpchange = np.loadtxt(comb_data, delimiter=',', unpack=True, converters={0: bytespdate2num('%d/%m/%Y')})
chntrain = chngdpchange[:-1]
chntest = chngdpchange[-1:]
austrain = ausgdpchange[:-1]
austest = ausgdpchange[-1:]
regr = lm.LinearRegression()
regr.fit(chntrain, austrain)
print('Coefficients: \n', regr.coef_)
print("Residual sum of squares: %.2f"
% np.mean((regr.predict(chntest) - austest) ** 2))
print('Variance score: %.2f' % regr.score(chntest, austest))
plt.scatter(chntest, austest, color='black')
plt.plot(chntest, regr.predict(chntest), color='blue')
plt.xticks(())
plt.yticks(())
plt.show()
What am I doing wrong? I essentially tried to apply the sklearn tutorial (They used some diabetes data set) to my own simple data. My data just contains the date, country A's % change in GDP for that specific year, and country B's % change in GDP for that same year.
I tried the solutions here and here (basically trying to find more out about the solution in the first link), but just receive the exact same error.
Here is the full traceback in case you want to see it:
Traceback (most recent call last):
File "D:\My Stuff\Dropbox\Python\Python projects\test regression\tester.py", line 34, in <module>
regr.fit(chntrain, austrain)
File "D:\Programs\Installed\Python34\lib\site-packages\sklearn\linear_model\base.py", line 376, in fit
y_numeric=True, multi_output=True)
File "D:\Programs\Installed\Python34\lib\site-packages\sklearn\utils\validation.py", line 454, in check_X_y
check_consistent_length(X, y)
File "D:\Programs\Installed\Python34\lib\site-packages\sklearn\utils\validation.py", line 174, in check_consistent_length
"%s" % str(uniques))
ValueError: Found arrays with inconsistent numbers of samples: [ 1 107]

In fit(X,y),the input parameter X is supposed to be a 2-D array. But if X in your data is only one-dimension, you can just reshape it into a 2-D array like this:regr.fit(chntrain_X.reshape(len(chntrain_X), 1), chntrain_Y)

regr.fit(chntrain, austrain)
This doesn't look right. The first parameter to fit should be an X, which refers to a feature vector. The second parameter should be a y, which is the correct answers (targets) vector associated with X.
For example, if you have GDP, you might have:
X[0] = [43, 23, 52] -> y[0] = 5
# meaning the first year had the features [43, 23, 52] (I just made them up)
# and the change that year was 5
Judging by your names, both chntrain and austrain are feature vectors. Judging by how you load your data, maybe the last column is the target?
Maybe you need to do something like:
chntrain_X, chntrain_y = chntrain[:, :-1], chntrain[:, -1]
# you can do the same with austrain and concatenate them or test on them if this part works
regr.fit(chntrain_X, chntrain_y)
But we can't tell without knowing the exact storage format of your data.

Try changing chntrain to a 2-D array instead of 1-D, i.e. reshape to (len(chntrain), 1).
For prediction, also change chntest to a 2-D array.

I have been having similar problems to you and have found a solution.
Where you have the following error:
ValueError: Found arrays with inconsistent numbers of samples: [ 1 107]
The [ 1 107] part is basically saying that your array is the wrong way around. Sklearn thinks you have 107 columns of data with 1 row.
To fix this try transposing the X data like so:
chntrain.T
The re-run your fit:
regr.fit(chntrain, austrain)
Depending on what your "austrain" data looks like you may need to transpose this too.

You may use np.newaxis as well. The example can be X = X[:, np.newaxis]. I found the method at Logistic function

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Having problems with dimensions in machine learning ( Python Scikit ) - python

Related

Using GPy Multiple-output coregionalized prediction

ValueError: matmul when trying to fit sklearn's linear regressor to pandas dataframe instanses

Adding a trendline to time series plot

How can I manipulate my data to allow a random forest to run on it?

sklearn issue: Found arrays with inconsistent numbers of samples when doing regression

Categories

Resources