Using GPy Multiple-output coregionalized prediction - python

I have been facing a problem recently where I believe that a multiple-output GP might be a good candidate. I am at the moment applying a single-output GP to my data and as dimensionality increases, my results keep getting worse. I have tried multiple-output with SKlearn and was able to get better results for higher dimensions, however I believe that GPy is more complete for such tasks and I would have more control over the model. For the single-output GP I was setting the kernel as the following:
kernel = GPy.kern.RBF(input_dim=4, variance=1.0, lengthscale=1.0, ARD = True)
m = GPy.models.GPRegression(X, Y_single_output, kernel = kernel, normalizer = True)
m.optimize_restarts(num_restarts=10)
In the example above X has size (20,4) and Y(20,1).
The implementation that I am using to multiple-output I got from
Introduction to Multiple Output Gaussian Processes
I prepare the data accordingly to the example, setting X_mult_output to size (80,2) - with the second column being the input indices - and rearranging Y to (80,1).
kernel = GPy.kern.RBF(1,lengthscale=1, ARD = True)**GPy.kern.Coregionalize(input_dim=1,output_dim=4, rank=1)
m = GPy.models.GPRegression(X_mult_output,Y_mult_output, kernel = kernel, normalizer = True)
Ok, everything seems to work so far, now I want to predict the values. The problem so is that it seems that I am not able to predict the values. From what I understood, you can just predict a single output by specifying the input index on the Y_metadata argument.
As I have 4 inputs, I set an array that I want to predict as the following:
x_pred = np.array([3,2,2,4])
Then, I imagine that I have to do separately the prediction of each value out of my x_pred array as shown in Coregionalized Regression Model (vector-valued regression) :
Y_metadata1 = {'output_index': np.array([[0]])}
y1_pred = m.predict(np.array(x[0]).reshape(1,-1),Y_metadata=Y_metadata1)
The problem is that I keep getting the following error:
IndexError: index 1 is out of bounds for axis 1 with size 1
Any suggestion about how to overcome that problem or is there any mistake on my implementation?
Traceback:
Traceback (most recent call last):
File "<ipython-input-9-edb25bc29817>", line 36, in <module>
y1_pred = m.predict(np.array(x[0]).reshape(1,-1),Y_metadata=Y_metadata1)
File "c:\users\johndoe\desktop\modules\sheffieldml-gpy-v1.9.9-0-g92f2e87\sheffieldml-gpy-92f2e87\GPy\core\gp.py", line 335, in predict
mean, var = self._raw_predict(Xnew, full_cov=full_cov, kern=kern)
File "c:\users\johndoe\desktop\modules\sheffieldml-gpy-v1.9.9-0-g92f2e87\sheffieldml-gpy-92f2e87\GPy\core\gp.py", line 292, in _raw_predict
mu, var = self.posterior._raw_predict(kern=self.kern if kern is None else kern, Xnew=Xnew, pred_var=self._predictive_variable, full_cov=full_cov)
File "c:\users\johndoe\desktop\modules\sheffieldml-gpy-v1.9.9-0-g92f2e87\sheffieldml-gpy-92f2e87\GPy\inference\latent_function_inference\posterior.py", line 276, in _raw_predict
Kx = kern.K(pred_var, Xnew)
File "c:\users\johndoe\desktop\modules\sheffieldml-gpy-v1.9.9-0-g92f2e87\sheffieldml-gpy-92f2e87\GPy\kern\src\kernel_slice_operations.py", line 109, in wrap
with _Slice_wrap(self, X, X2) as s:
File "c:\users\johndoe\desktop\modules\sheffieldml-gpy-v1.9.9-0-g92f2e87\sheffieldml-gpy-92f2e87\GPy\kern\src\kernel_slice_operations.py", line 65, in __init__
self.X2 = self.k._slice_X(X2) if X2 is not None else X2
File "<decorator-gen-140>", line 2, in _slice_X
File "C:\Users\johndoe\AppData\Roaming\Python\Python37\site-packages\paramz\caching.py", line 283, in g
return cacher(*args, **kw)
File "C:\Users\johndoe\AppData\Roaming\Python\Python37\site-packages\paramz\caching.py", line 172, in __call__
return self.operation(*args, **kw)
File "c:\users\johndoe\desktop\modules\sheffieldml-gpy-v1.9.9-0-g92f2e87\sheffieldml-gpy-92f2e87\GPy\kern\src\kern.py", line 117, in _slice_X
return X[:, self._all_dims_active]
IndexError: index 1 is out of bounds for axis 1 with size 1

problem
you have defined the kernel with X of dimention (-1, 4) and Y of dimension (-1, 1) but you are giving it X_pred of dimension (1, 1) (the first element of x_pred reshaped to (1, 1))
solution
give the x_pred to the model for prediction (an input with dimension of (-1, 4))
Y_metadata1 = {'output_index': np.array([[0]])}
y1_pred = m.predict(np.array(x_pred).reshape(1,-1), Y_metadata=Y_metadata1)
DIY
before executing your codes together try to run them seperatly and debug them easily, then you can make your code small and clean.
the example below is the debug code of your problem
Y_metadata1 = {'output_index': np.array([[0]])}
a = np.array(x_pred[0]).reshape(1,-1)
print(a.shape)
y1_pred = m.predict(a,Y_metadata=Y_metadata1)
the output is (1,1) and the error, which makes it obvious the error is from input dimension.
Reading errors also help, your error says, in kern.K(pred_var, Xnew) there is a problem, so the error is probably from kernel,
then it says its from X[:, self._all_dims_active] so the error is probably from X dimensions. then with a little experiment with x dimension you will get the idea.
hopefully after 7 days this would help!

Related

Having Trouble with numpy.histogramdd

I am trying to create N-Dimensional histogram from 2D array which has complex values. I want to count the number of occurrences in real and imaginary parts of the array given the bins and store the result in a 3D array. It only runs for the first iteration when I hard code i=0 and remove the for loop. I have never used histograms in python before and I just cannot understand the error. The code is given below.
xsoft is defined as 2d array of complex type and I somehow compute bnd_edges by finding max, min values from xsoft and create edges to be given as bins.
xsoft = np.empty((M, MAX,), dtype=complex) # e.g has dims 4*100
xsoft[:] = np.nan
edges = np.linspace(-bnd_edges, bnd_edges, numbin) #numbin=10
pSOFT = np.empty((len(edges)-1, M, len(edges)-1)) # len(edges)= 10
pSOFT[:] = np.nan
for i in range(M):
pSOFT[:, i, :], edges = np.histogramdd((xsoft[i, :].real, xsoft[i, :].imag), bins=(edges, edges))
The code results in the following error
Traceback (most recent call last):
File " ", line 194, in <module>
pSOFT[:, i, :], edges = np.histogramdd((xsoft[i, :].real, xsoft[i, :].imag), bins=(edges, edges))
File "<__array_function__ internals>", line 5, in histogramdd
File " " line 1066, in histogramdd
raise ValueError(
ValueError: `bins[0]` must be a scalar or 1d array
Process finished with exit code 1
You are getting this error because you are overriding the original definition of edges with the second return value of histogramdd.
Replace the last line in your code with this:
pSOFT[:, i, :], edges_i = np.histogramdd((xsoft[i, :].real, xsoft[i, :].imag), bins=(edges, edges))

librosa.feature.delta() illegal value in 4-th argument of internal None

I am working on a .wav signals using python 3.5 and trying to extract mfcc, mfcc delta, mfcc delta-deltas, and other signal features. but there is an error raised only with mfcc delta with is:
Traceback (most recent call last):
mfcc_delta = librosa.feature.delta(mfcc)
File "C:\Users\hp\AppData\Local\Programs\Python\Python35\lib\site-packages\librosa\feature\utils.py", line 116, in delta
**kwargs)
File "C:\Users\hp\AppData\Local\Programs\Python\Python35\lib\site-packages\scipy\signal\_savitzky_golay.py", line 337, in savgol_filter
coeffs = savgol_coeffs(window_length, polyorder, deriv=deriv, delta=delta)
File "C:\Users\hp\AppData\Local\Programs\Python\Python35\lib\site-packages\scipy\signal\_savitzky_golay.py", line 139, in savgol_coeffs
coeffs, _, _, _ = lstsq(A, y)
File "C:\Users\hp\AppData\Local\Programs\Python\Python35\lib\site-packages\scipy\linalg\basic.py", line 1226, in lstsq
% (-info, lapack_driver))
ValueError: illegal value in 4-th argument of internal None
I am working on the following code:
import librosa
import numpy as np
import librosa
from scipy import signal
import scipy.stats
def preprocess_cough(x,fs, cutoff = 6000, normalize = True, filter_ = True, downsample = True):
#Preprocess Data
if len(x.shape)>1:
x = np.mean(x,axis=1) # Convert to mono
if normalize:
x = x/(np.max(np.abs(x))+1e-17) # Norm to range between -1 to 1
if filter_:
b, a = butter(4, fs_downsample/fs, btype='lowpass') # 4th order butter lowpass filter
x = filtfilt(b, a, x)
if downsample:
x = signal.decimate(x, int(fs/fs_downsample)) # Downsample for anti-aliasing
fs_new = fs_downsample
return np.float32(x), fs_new
audio_data = 'F:/test/'
files = librosa.util.find_files(audio_data, ext=['wav'])
x,fs = librosa.load(myFile,sr=48000)
arr, f = preprocess_cough(x,fs)
mfcc = librosa.feature.mfcc(y=arr, sr=f, n_mfcc=13)
mfcc_delta = librosa.feature.delta(mfcc)
mfcc_delta2 = librosa.feature.delta(mfcc, order=2)
when I remove the mffcs calculations and calculate the other wav signal features the error does not appear again. Also, I have tried to remove n_mfcc=13 parameter but the error still raises.
Sample of the output and the shape of mfcc variable
[-3.86701782e+02 -4.14421021e+02 -4.67373749e+02 -4.76989105e+02
-4.23713501e+02 -3.71329285e+02 -3.47003693e+02 -3.19309082e+02
-3.29547089e+02 -3.32584625e+02 -2.78399109e+02 -2.43284348e+02
-2.47878128e+02 -2.59308533e+02 -2.71102844e+02 -2.87314514e+02
-2.58869965e+02 -6.01125565e+01 1.66160011e+01 -8.58060551e+00
-8.49179382e+01 -9.29880371e+01 -9.96001358e+01 -1.04499428e+02
-3.65511665e+01 -3.82106819e+01 -8.69802475e+01 -1.22267052e+02
-1.70187592e+02 -2.35996841e+02 -2.96493286e+02 -3.39086365e+02
-3.59514771e+02]
and the shape is (13,33)
Can anyone help me, please?
Thanks in advance
Somewhat similarly to the issue raised in this question the issue is related to the intricacies of the underlying numerical operations that librosa defers to scipy. SciPy depends on LAPACK library being installed. So at first I would check if you have it installed.
Also, you may want to debug the script step-by-step to step into SciPy and examine actual values that are percolating from librosa.feature.delta to scipy.signal.savgol_filter which may tell you the reason when you cross-check them with documentation.

ValueError: matmul when trying to fit sklearn's linear regressor to pandas dataframe instanses

I've been trying to perform a simple multivariate linear regression on some dummy data using sklearn. I initially passed sklearn.linear_model.LinearRegression.fit numpy arrays and kept getting this error:
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 2 is different from 1)
which I thought was due to some mistake with the transposition of my arrays or something, so I pulled up a tutorial that used pandas dataframes and set out my code in the same way:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
VWC = np.array((0,0.2,0.4,0.6,0.8,1))
Sensor_Voltage = np.array((515,330,275,250,245,240))
X = np.column_stack((VWC,VWC*VWC))
df = pd.DataFrame(X,columns=["VWC","VWC2"])
target = pd.DataFrame(Sensor_Voltage,columns=["Volt"])
model = LinearRegression()
model.fit(df,target["Volt"])
x = np.linspace(0,1,30)
y = model.predict(x[:,np.newaxis])
plt.plot(VWC, Sensor_Voltage)
plt.plot(x,y,dashes=(3,1))
plt.title("Simple Linear Regression")
plt.xlabel("Volumetric Water Content")
plt.ylabel("Sensor response (4.9mV)")
plt.show()
And I still get the following traceback:
Traceback (most recent call last):
File "C:\Users\Vivian Imbriotis\AppData\Local\Programs\Python\Python37\simple_linear_regression.py", line 16, in <module>
y = model.predict(x[:,np.newaxis])
File "C:\Users\Vivian Imbriotis\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\linear_model\_base.py", line 225, in predict
return self._decision_function(X)
File "C:\Users\Vivian Imbriotis\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\linear_model\_base.py", line 209, in _decision_function
dense_output=True) + self.intercept_
File "C:\Users\Vivian Imbriotis\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\utils\extmath.py", line 151, in safe_sparse_dot
ret = a # b
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 2 is different from 1)
I have been banging my head against this for hours now and I just don't understand what I am doing wrong.
Scikit-learn, numpy, and pandas are all the latest versions; this is on python 3.7.3
SOLVED: I am very silly and misunderstood how np.newaxis worked. The goal here was to fit a quadratic to the data, so I just needed to change:
x = np.linspace(0,1,30)
y = model.predict(x[:,np.newaxis])
to
x = np.columnstack([np.linspace(0,1,30),np.linspace(0,1,30)**2])
y = model.predict(x)
I am sure there is a more elegant way to write that but eh.
You train your model using shape of (6,2) dataset.if you check shape of df
df.shape = (6,2).
And when you try to predict you are trying with different shape of dataset.
x.shape=(30,1)
what you need is to use the correct shape of dataset.Try this
x = np.linspace((0,0),(1,1),30)
y = model.predict(x)
I also ran into this error when using sklearn and LinearRegression, turns out the issue was that I was passing the Y variable to the LinearRegression object in the first position, and the X variables in the second position. But you actually pass the X variables first, and then Y variables - the opposite of the order you use in R with R's lm().
Hopefully this helps someone out someday.

Having problems with dimensions in machine learning ( Python Scikit )

I am a bit new to applying machine learning, so I was trying to teach myself how to do linear regression with any kind of data on mldata.org and in the Python scikit package. I tested out the linear regression example code (http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html) and the code worked well with the diabetes dataset. However, I tried to use the code with other datasets, such as one about earthquakes on mldata (http://mldata.org/repository/data/viewslug/global-earthquakes/). However, I was not able to do so due to the dimension problems on there.
Warning (from warnings module):
File "/usr/lib/python2.7/dist-packages/numpy/core/_methods.py", line 55
warnings.warn("Mean of empty slice.", RuntimeWarning)
RuntimeWarning: Mean of empty slice.
Warning (from warnings module):
File "/usr/lib/python2.7/dist-packages/numpy/core/_methods.py", line 65
ret, rcount, out=ret, casting='unsafe', subok=False)
RuntimeWarning: invalid value encountered in true_divide
Traceback (most recent call last):
File "/home/anthony/Documents/Programming/Python/Machine Learning/Scikit/earthquake_linear_regression.py", line 38, in <module>
regr.fit(earthquake_X_train, earthquake_y_train)
File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/base.py", line 371, in fit
linalg.lstsq(X, y)
File "/usr/lib/python2.7/dist-packages/scipy/linalg/basic.py", line 518, in lstsq
raise ValueError('incompatible dimensions')
ValueError: incompatible dimensions
How do I set up the dimensions of the data?
Size of the data:
earthquake_X.shape
(59209, 1, 4)
earthquake_X_train.shape
(59189, 1)
earthquake_y_test.shape
(3, 59209)
earthquake.target.shape
(3, 59209)
The code:
# Code source: Jaques Grobler
# License: BSD 3 clause
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
#Experimenting with earthquake data
from sklearn.datasets.mldata import fetch_mldata
import tempfile
test_data_home = tempfile.mkdtemp()
# Load the diabetes dataset
earthquake = fetch_mldata('Global Earthquakes', data_home = test_data_home)
# Use only one feature
earthquake_X = earthquake.data[:, np.newaxis]
earthquake_X_temp = earthquake_X[:, :, 2]
# Split the data into training/testing sets
earthquake_X_train = earthquake_X_temp[:-20]
earthquake_X_test = earthquake_X_temp[-20:]
# Split the targets into training/testing sets
earthquake_y_train = earthquake.target[:-20]
earthquake_y_test = earthquake.target[-20:]
print "Splitting of data for preformance check completed"
# Create linear regression object
regr = linear_model.LinearRegression()
print "Created linear regression object"
# Train the model using the training sets
regr.fit(earthquake_X_train, earthquake_y_train)
print "Dataset trained"
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
% np.mean((regr.predict(earthquake_X_test) - earthquake_y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(earthquake_X_test, earthquake_y_test))
# Plot outputs
plt.scatter(earthquake_X_test, earthquake_y_test, color='black')
plt.plot(earthquake_X_test, regr.predict(earthquake_X_test), color='blue',
linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
Your array of targets (earthquake_y_train) is of wrong shape. Moreover actually it's empty.
When you do
earthquake_y_train = earthquake.target[:-20]
you select all rows but last 20 among first axis. And, according to the data you posted, earthquake.target has shape (3, 59209), so there are no rows to select!
But even if there were any, it'd be still an error. Why? Because first dimensions of X and y must be the same. According to the sklearn's documentation, LinearRegression's fit expects X to be of shape [n_samples, n_features] and y — [n_samples, n_targets].
In order to fix it change definitions of ys to the following:
earthquake_y_train = earthquake.target[:, :-20].T
earthquake_y_test = earthquake.target[:, -20:].T
P.S. Even if you fix all these problem there's still a problem in your script: plt.scatter can't work with "multidimensional" ys.

Python zero-size array to ufunc.reduce without identity

I'm trying to make a histogram of some data that is being stored in an ndarray. The histogram is part of a set of analysis which I've made into a class in a python program. The part of the code that isn't working is below.
def histogram(self, iters):
samples = T.MCMC(iters) #Returns an [iters,3,4] ndarray
histAC = plt.figure(self.ip) #plt is matplotlib's pyplot
self.ip+=1 #defined at the beginning of the class to start at 0
for l in range(0,4):
h = histAC.add_subplot(2,(iters+1)/2,l+1)
for i in range(0,0.5*self.chan_num):
intAvg = mean(samples[:,i,l])
print intAvg
for k in range(0,iters):
samples[k,i,l]=samples[k,i,l]-intAvg
print "Samples is ",samples
h.hist(samples,bins=5000,range=[-6e-9,6e-9],histtype='step')
h.legend(loc='upper right')
h.set_title("AC Pulse Integral Histograms: "+str(l))
figname = 'ACHistograms.png'
figpath = 'plot'+str(self.ip)
print "Finished!"
#plt.savefig(figpath + figname, format = 'png')
This gives me the following error message:
File "johnmcmc.py", line 257, in histogram
h.hist(samples,bins=5000,range=[-6e-9,6e-9],histtype='step') #removed label=apdlabel
File "/x/tsfit/local/lib/python2.6/site-packages/matplotlib/axes.py", line 7238, in hist
ymin = np.amin(m[m!=0]) # filter out the 0 height bins
File "/x/tsfit/local/lib/python2.6/site-packages/numpy/core/fromnumeric.py", line 1829, in amin
return amin(axis, out)
ValueError: zero-size array to ufunc.reduce without identity
The only search results I've found have been multiple copies of the same two conversations, from which the only thing I learned was that python histograms don't like getting fed empty arrays, which is why I added the print statement right above the line that's giving me trouble to make sure the array isn't empty.
Has anyone else come across this error before?

Categories