Error when calculating percent error in python - python

So I have a neural network and I am trying to calculate the percent error.
for i in range(len(y_test_predicted)):
difference = np.array(abs(y_test_predicted[i] - y_test_unscaled[i]))
print("Difference: ",difference)
error = np.array(difference/y_test_predicted[i])
print("error: ",error)
print("---------------")
av_error = np.mean(error)
av_per_error = av_error * 100
I have predicted values and actual values. I take the absolute value of their difference and divide by the predicted value. However the error array is only a single value. It gets over written each time the loop iterates. I tried using
error[i] = np.array(difference/y_test_predicted[i])
But it throws an error saying that it is out of bounds. I also tried hard coding the problem to avoid using array's by just having a running sum of all the error values but it keeps returning NaN for some reason.

Assuming that y_test_predicted and y_test_unscaled are numpy arrays, you can use numpy's vectorised operators and avoid the for loop entirely, like so:
difference = np.abs(y_test_predicted - y_test_unscaled)
error = difference / y_test_predicted
av_error = np.mean(error)
For instance:
>>> import numpy as np
>>> y_test_unscaled = np.array([0.11, 0.63, 0.44, 0.54, 0.65])
>>> y_test_predicted = np.array([0.1, 0.5, 0.3, 0.5, 0.7])
>>> difference = np.abs(y_test_predicted - y_test_unscaled)
>>> error = difference / y_test_predicted
>>> av_error = np.mean(error)
>>> av_error
0.19561904761904764
If you're hellbent on using a loop, then the error you're getting is probably because error is the wrong shape (though I can't tell that for sure as it's not included in your question). Something like:
error = np.zeros(y_test_predicted.shape)
before your loop would probably resolve it -- this pre-allocates an array which is the same shape as y_test_predicted.

Related

Getting ValueError when expanding my GMMHMM from 2 to three states

I am trying to expand my GMMHMM model from two to three states but get the error below
"ValueError: startprob_ must sum to 1 (got nan)"
. It looks like it states that my initial state distribution does not sum up to one but it does (see Pi). Furthermore
I get the following warning, might have something to do with it:
"UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=1."
Furthermore, If I look into it I can see that my state transition matrix returns nan values.
import numpy as np
from hmmlearn.hmm import GMMHMM
import pandas as pd
Pi = np.array([0.24, 0.37, 0.39])
A = np.array([[0.74, 0.20, 0.06],
[0.20, 0.53, 0.27 ],
[0.05, 0.40, 0.54]])
model = GMMHMM(n_components=3, n_mix=1, startprob_prior=Pi, transmat_prior=A,
min_covar=0.001, tol=0.0001, n_iter=10000)
Obs = df[['gdp','un','inf','inx','itr']].to_numpy()
print(Obs)
model.fit(Obs)
print(model.transmat_)
seq = model.decode(Obs)
print(seq)
I am not a really experienced Python programmer, so might be an easy solve but unfortunately I do not see how. Any help would be highly appreciated!

How to create a two-column matrix in rpy2

I'm using rpy2 to run a method from an R library. According to the documentation:
method_name(x, range.x)
x: a two-column numeric matrix.
range.x: a list containing two vectors.
And it includes an example:
data(geyser, package="MASS")
x <- cbind(geyser$duration, geyser$waiting)
est <- method_name(x, range.x)
I checked the type of geyser$duration and geyser$waiting and both are double. I also tried replacing geyser$duration and geyser$waiting by g = c(.016, 2.15, 4.00) and h = c(.012, 2.11, 2.50) in R, and the code still works.
In my current Python code, I have:
import numpy as np
import rpy2.robjects as robjects
import rpy2.robjects.packages as rpackages
from rpy2.robjects.vectors import StrVector, FloatVector, ListVector # I tried these before too
from rpy2.robjects import numpy2ri, pandas2ri
numpy2ri.activate()
pandas2ri.activate()
base = rpackages.importr(('base'))
a = np.array([1.2, 2.1, 2.5]); b = np.array([5.2, 1.3, 2.15])
x = base.cbind(base.c(a), base.c(b))
ranges = base.range(x)
result = method_name(x, ranges)
As you can see, I'm trying to make my code as similar to the example as possible. However, I can't make the method work. I get the error Error in seq.default(a[2L], b[2L], length = M[2L]) which probably has to do with a problem in the arguments.
There's and obvious problem with ranges because it contains just two values, the minimum and maximum of x, however, it should contain two minimum values and to maximum values (one pair for each column of the matrix). I can achieve that by doing this:
ranges = base.cbind(base.range(a), base.range(b))
But this implies that there's a problem with the way I'm creating the matrix. Otherwise, I would get two pairs of values just by using base.range(x).
I also tried x = robjects.r.matrix(x, ncol = 2) but didn't work. I still get just a global minimum and maximum value for the whole matrix when calling range.
What is the correct way of creating this matrix so that the method can run?
According to the documentation of the range function, it accepts as input vectors (one dimensional arrays). Thus, it would work by applying it to each column of your matrix or by applying it first directly to the a and b elements as you have mentioned. Thus, you second approach should work
# Define a,b vectors
a = np.array([1.2, 2.1, 2.5]); b = np.array([5.2, 1.3, 2.15])
# Calculate vector ranges
range_a = base.range(base.c(a))
range_b = base.range(base.c(b))
# Define the matrix
x = base.cbind(base.c(a), base.c(b))
print(x)
>>> [[1.2 5.2 ]
[2.1 1.3 ]
[2.5 2.15]]
# Define the ranges
ranges = base.cbind(range_a, range_b)
print(ranges)
>>> [[1.2 1.3]
[2.5 5.2]]

Which dtype would be correct to prevent numpy.arange() from getting the wrong length?

I am trying to get a shifting array containing 200 values in a range with a difference of 40.
Therefore i am using numpy.arange(a, b, 0.2) with starting values a=0 and b=40 and going upwards (a=0.2 b=40.2, a=0.4 b=40.4 and so on).
When I reach numpy.arange(25.4, 65.4, 0.2) however I suddenly get an array with a length 201 values:
x = numpy.arange(25.2, 65.2, 0.2)
print(len(x))
Returns 200
x = numpy.arange(25.4, 65.4, 0.2)
print(len(x))
Returns 201
I got so far to notice that this happens probably due to rounding issues because of the data type...
I know there is a option 'dtype' in numpy.arrange():
numpy.arange(star, stop, step, dtype)
The question is which data type would fit this problem and why? (I am not so confident with data types jet and https://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html#numpy.dtype hasn't helped me to get this issue resolved. Please help!
np.arange is most useful when you want to precisely control the difference between adjacent elements. np.linspace, on the other hand, gives you precise control over the total number of elements. It sounds like you want to use np.linspace instead:
import numpy as np
offset = 25.4
x = np.linspace(offset, offset + 40, 200)
print(x)
print(len(x))
Here's the documentation page for np.linspace: https://docs.scipy.org/doc/numpy/reference/generated/numpy.linspace.html

Precision Matlab and Python (numpy)

I'm converting a Matlab script to Python and I am getting different results in the 10**-4 order.
In matlab:
f_mean=f_mean+nanmean(f);
f = f - nanmean(f);
f_t = gradient(f);
f_tt = gradient(f_t);
if n_loop==1
theta = atan2( sum(f.*f_tt), sum(f.^2) );
end
theta = -2.2011167e+03
In Python:
f_mean = f_mean + np.nanmean(vel)
vel = vel - np.nanmean(vel)
firstDerivative = np.gradient(vel)
secondDerivative = np.gradient(firstDerivative)
if numberLoop == 1:
theta = np.arctan2(np.sum(vel * secondDerivative),
np.sum([vel**2]))
Although first and secondDerivative give the same results in Python and Matlab, f_mean is slightly different: -0.0066412 (Matlab) and -0.0066414 (Python); and so theta: -0.4126186 (M) and -0.4124718 (P). It is a small difference, but in the end leads to different results in my scripts.
I know some people asked about this difference, but always regarding std, which I get, but not regarding mean values. I wonder why it is.
One possible source of the initial difference you describe (between means) could be numpy's use of pairwise summation which on large arrays will typically be appreciably more accurate than the naive method:
a = np.random.uniform(-1, 1, (10**6,))
a = np.r_[-a, a]
# so the sum should be zero
a.sum()
# 7.815970093361102e-14
# use cumsum to get naive summation:
a.cumsum()[-1]
# -1.3716805469243809e-11
Edit (thanks #sascha): for the last word and as a "provably exact" reference you could use math.fsum:
import math
math.fsum(a)
# 0.0
Don't have matlab, so can't check what they are doing.

Moving average of an array in Python

I have an array where discreet sinewave values are recorded and stored. I want to find the max and min of the waveform. Since the sinewave data is recorded voltages using a DAQ, there will be some noise, so I want to do a weighted average. Assuming self.yArray contains my sinewave values, here is my code so far:
filterarray = []
filtersize = 2
length = len(self.yArray)
for x in range (0, length-(filtersize+1)):
for y in range (0,filtersize):
summation = sum(self.yArray[x+y])
ave = summation/filtersize
filterarray.append(ave)
My issue seems to be in the second for loop, where depending on my averaging window size (filtersize), I want to sum up the values in the window to take the average of them. I receive an error saying:
summation = sum(self.yArray[x+y])
TypeError: 'float' object is not iterable
I am an EE with very little experience in programming, so any help would be greatly appreciated!
The other answers correctly describe your error, but this type of problem really calls out for using numpy. Numpy will run faster, be more memory efficient, and is more expressive and convenient for this type of problem. Here's an example:
import numpy as np
import matplotlib.pyplot as plt
# make a sine wave with noise
times = np.arange(0, 10*np.pi, .01)
noise = .1*np.random.ranf(len(times))
wfm = np.sin(times) + noise
# smoothing it with a running average in one line using a convolution
# using a convolution, you could also easily smooth with other filters
# like a Gaussian, etc.
n_ave = 20
smoothed = np.convolve(wfm, np.ones(n_ave)/n_ave, mode='same')
plt.plot(times, wfm, times, -.5+smoothed)
plt.show()
If you don't want to use numpy, it should also be noted that there's a logical error in your program that results in the TypeError. The problem is that in the line
summation = sum(self.yArray[x+y])
you're using sum within the loop where your also calculating the sum. So either you need to use sum without the loop, or loop through the array and add up all the elements, but not both (and it's doing both, ie, applying sum to the indexed array element, that leads to the error in the first place). That is, here are two solutions:
filterarray = []
filtersize = 2
length = len(self.yArray)
for x in range (0, length-(filtersize+1)):
summation = sum(self.yArray[x:x+filtersize]) # sum over section of array
ave = summation/filtersize
filterarray.append(ave)
or
filterarray = []
filtersize = 2
length = len(self.yArray)
for x in range (0, length-(filtersize+1)):
summation = 0.
for y in range (0,filtersize):
summation = self.yArray[x+y]
ave = summation/filtersize
filterarray.append(ave)
self.yArray[x+y] is returning a single item out of the self.yArray list. If you are trying to get a subset of the yArray, you can use the slice operator instead:
summation = sum(self.yArray[x:y])
to return an iterable that the sum builtin can use.
A bit more information about python slices can be found here (scroll down to the "Sequences" section): http://docs.python.org/2/reference/datamodel.html#the-standard-type-hierarchy
You could use numpy, like:
import numpy
filtersize = 2
ysums = numpy.cumsum(numpy.array(self.yArray, dtype=float))
ylags = numpy.roll(ysums, filtersize)
ylags[0:filtersize] = 0.0
moving_avg = (ysums - ylags) / filtersize
Your original code attempts to call sum on the float value stored at yArray[x+y], where x+y is evaluating to some integer representing the index of that float value.
Try:
summation = sum(self.yArray[x:y])
Indeed numpy is the way to go. One of the nice features of python is list comprehensions, allowing you to do away with the typical nested for loop constructs. Here goes an example, for your particular problem...
import numpy as np
step=2
res=[np.sum(myarr[i:i+step],dtype=np.float)/step for i in range(len(myarr)-step+1)]

Categories