I want to calculate root mean square of a function in Python. My function is in a simple form like y = f(x). x and y are arrays.
I tried Numpy and Scipy Docs and couldn't find anything.
I'm going to assume that you want to compute the expression given by the following pseudocode:
ms = 0
for i = 1 ... N
ms = ms + y[i]^2
ms = ms / N
rms = sqrt(ms)
i.e. the square root of the mean of the squared values of elements of y.
In numpy, you can simply square y, take its mean and then its square root as follows:
rms = np.sqrt(np.mean(y**2))
So, for example:
>>> y = np.array([0, 0, 1, 1, 0, 1, 0, 1, 1, 1]) # Six 1's
>>> y.size
10
>>> np.mean(y**2)
0.59999999999999998
>>> np.sqrt(np.mean(y**2))
0.7745966692414834
Do clarify your question if you mean to ask something else.
You could use the sklearn function
from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(y_actual,[0 for _ in y_actual], squared=False)
numpy.std(x) tends to rms(x) in cases of mean(x) value tends to 0 (thanks to #Seb), like it can be with sound records, vibrations, and other signals of fluctuations from zero.
rms = lambda x_seq: (sum(x*x for x in x_seq)/len(x_seq))**(1/2)
In case you'd like to frame your array before compute RMS, this is a numpy solution:
nframes = 1000
rms = np.array([
np.sqrt(np.mean(arr**2))
for arr in np.array_split(arr,nframes)
])
If you'd like to specify frame length instead of frame counts, you'd do this first:
frame_length = 200
arr_length = arr.shape[0]
nframes = arr_length // frame_length +1
Related
I have an array of magnetometer data with artifacts every two hours due to power cycling.
I'd like to replace those indices with NaN so that the length of the array is preserved.
Here's a code example, adapted from https://www.kdnuggets.com/2017/02/removing-outliers-standard-deviation-python.html.
import numpy as np
import plotly.express as px
# For pulling data from CDAweb:
from ai import cdas
import datetime
# Import data:
start = datetime.datetime(2016, 1, 24, 0, 0, 0)
end = datetime.datetime(2016, 1, 25, 0, 0, 0)
data = cdas.get_data(
'sp_phys',
'THG_L2_MAG_'+ 'PG2',
start,
end,
['thg_mag_'+ 'pg2']
)
x =data['UT']
y =data['VERTICAL_DOWN_-_Z']
def reject_outliers(y): # y is the data in a 1D numpy array
n = 5 # 5 std deviations
mean = np.mean(y)
sd = np.std(y)
final_list = [x for x in y if (x > mean - 2 * sd)]
final_list = [x for x in final_list if (x < mean + 2 * sd)]
return final_list
px.scatter(reject_outliers(y))
print('Length of y: ')
print(len(y))
print('Length of y with outliers removed (should be the same): ')
print(len(reject_outliers(y)))
px.line(y=y, x=x)
# px.scatter(y) # It looks like the outliers are successfully dropped.
# px.line(y=reject_outliers(y), x=x) # This is the line I'd like to see work.
When I run 'px.scatter(reject_outliers(y))', it looks like the outliers are successfully getting dropped:
...but that's looking at the culled y vector relative to the index, rather than the datetime vector x as in the above plot. As the debugging text indicates, the vector is shortened because the outlier values are dropped rather than replaced.
How can I edit my 'reject_outliers()` function to assign those values to NaN, or to adjacent values, in order to keep the length of the array the same so that I can plot my data?
Use else in the list comprehension along the lines of:
[x if x_condition else other_value for x in y]
Got a less compact version to work. Full code:
import numpy as np
import plotly.express as px
# For pulling data from CDAweb:
from ai import cdas
import datetime
# Import data:
start = datetime.datetime(2016, 1, 24, 0, 0, 0)
end = datetime.datetime(2016, 1, 25, 0, 0, 0)
data = cdas.get_data(
'sp_phys',
'THG_L2_MAG_'+ 'PG2',
start,
end,
['thg_mag_'+ 'pg2']
)
x =data['UT']
y =data['VERTICAL_DOWN_-_Z']
def reject_outliers(y): # y is the data in a 1D numpy array
mean = np.mean(y)
sd = np.std(y)
final_list = np.copy(y)
for n in range(len(y)):
final_list[n] = y[n] if y[n] > mean - 5 * sd else np.nan
final_list[n] = final_list[n] if final_list[n] < mean + 5 * sd else np.nan
return final_list
px.scatter(reject_outliers(y))
print('Length of y: ')
print(len(y))
print('Length of y with outliers removed (should be the same): ')
print(len(reject_outliers(y)))
# px.line(y=y, x=x)
px.line(y=reject_outliers(y), x=x) # This is the line I wanted to get working - check!
More compact answer, sent via email by a friend:
In numpy you can select/index based on a Boolean array, and then make assignment with it:
def reject_outliers(y): # y is the data in a 1D numpy array
n = 5 # 5 std deviations
mean = np.mean(y)
sd = np.std(y)
final_list = y.copy()
final_list[np.abs(y - mean) > n * sd] = np.nan
return final_list
I also noticed that you didn’t use the value of n in your example code.
Alternatively, you can use the where method (https://numpy.org/doc/stable/reference/generated/numpy.where.html)
np.where(np.abs(y - mean) > n * sd, np.nan, y)
You don’t need the .copy() if you don’t mind modifying the input array.
Replace np.mean and np.std with np.nanmean and np.nanstd if you want the function to work on arrays that already contain nans, i.e. if you want to use this function recursively.
The answer about using if else in a list comprehension would work, but avoiding the list comprehension makes the function much faster if the arrays are large.
To sample from N(1,2) with sample size 100 and calculating the mean of this sample we can do this:
import numpy as np
s = np.random.normal(1, 2, 100)
mean = np.mean(s)
Now if we want to produce 10000 samples and save mean of each of them we can do:
sample_means = []
for x in range(10000):
sample = np.random.normal(1, 2, 100)
sample_means.append (sample.mean())
How can I do it when we want to sample sequentially from N(1,2) and estimate the distribution mean sequentially?
IIUC you meant accumulative
sample = np.random.normal(1,2,(10000, 100))
sample_mean = []
for i,_ in enumerate(sample):
sample_mean.append(sample[:i+1,:].ravel().mean())
Then sample_mean contains the accumulative samples mean
sample_mean[:10]
[1.1185342714036368,
1.3270808654923423,
1.3266440422140355,
1.2542028664103761,
1.179358517854582,
1.1224645540064788,
1.1416887857272255,
1.1156887336750463,
1.0894328800573165,
1.0878896099712452]
Maybe list comprehension?
sample_means = [np.random.normal(1, 2, 100).mean() for i in range(10000)]
TIP Use lower case to name variables in Python
I'm working on a custom likelihood function for Theano (Attempting to fit a conditional logistic regression.)
The likelihood requires summing values by group. In R we have the "ave()" function, in Python Pandas we have "groupby()". How would I do something similar in Theano?
Edited for more detail
I want to create a cox proportional hazards model (same as conditional logistic regression.) The log likelihood requires the sum of values by group:
In Pandas, this would be:
temp = df.groupby('groupid')['eta'].aggregate(np.sum)
denominator = np.log(temp).sum()
In the data, we have a column with group ID, and the values to be summed
group eta
1 2.1
1 1.8
1 0.9
2 1.2
2 0.75
2 1.42
The output for the group sums would then be:
group sum
1 4.8
2 3.37
Then, the sum of the log of the sums:
log(4.8) + log(3.37) = 2.7835
This is quick and easy to do in Pandas. How can I do something similar in Thano? Sure, could write a nexted loop, but that seems slow and I try to avoid manually coded loops when possible as they are slow.
Thanks!
Let say you have "X" (a list of all your etas), with the dim. Nx1 (I guess) and a matrix H. H is a NxG matrix that has a on-hot encoding of the groups.
The you you write something like:
import numpy as np
from numpy import newaxis as na
import theano.tensor as T
X = T.vector()
H = T.matrix()
tmp = T.sum(X[:, na] * H, axis=0)
O = T.sum(T.log(tmp))
x = np.array([5, 10, 10, 0.5, 5, 0.5])
# create a 1-hot encoding
g = np.array([1, 2, 2, 0, 1, 0])
h = np.zeros(shape=(len(x), 3))
for i,j in enumerate(g):
h[i,j] = 1.0
O.eval({X:x, H: h})
This should work as long as there is at least one eta per point (or else -inf).
With numpy or scipy, is there any existing method that will return the endpoints of an interval which contains a specified percent of the values in a 1D array? I realize that this is simple to write myself, but it seems like the kind of thing that might be built in, although I can't find it.
E.g:
>>> import numpy as np
>>> x = np.random.randn(100000)
>>> print(np.bounding_interval(x, 0.68))
Would give approximately (-1, 1)
You can use np.percentile:
In [29]: x = np.random.randn(100000)
In [30]: p = 0.68
In [31]: lo = 50*(1 - p)
In [32]: hi = 50*(1 + p)
In [33]: np.percentile(x, [lo, hi])
Out[33]: array([-0.99206523, 1.0006089 ])
There is also scipy.stats.scoreatpercentile:
In [34]: scoreatpercentile(x, [lo, hi])
Out[34]: array([-0.99206523, 1.0006089 ])
I don't know of a built-in function to do it, but you can write one using the math package to specify approximate indices like this:
from __future__ import division
import math
import numpy as np
def bound_interval(arr_in, interval):
lhs = (1 - interval) / 2 # Specify left-hand side chunk to exclude
rhs = 1 - lhs # and the right-hand side
sorted = np.sort(arr_in)
lower = sorted[math.floor(lhs * len(arr_in))] # use floor to get index
upper = sorted[math.floor(rhs * len(arr_in))]
return (lower, upper)
On your specified array, I got the interval (-0.99072237819851039, 0.98691691784955549). Pretty close to (-1, 1)!
My problem is to extract in the most efficient way N Poisson random values (RV) each with a different mean/rate Lam. Basically the size(RV) == size(Lam).
Here it is a naive (very slow) implementation:
import numpy as NP
def multi_rate_poisson(Lam):
rv = NP.zeros(NP.size(Lam))
for i,lam in enumerate(Lam):
rv[i] = NP.random.poisson(lam=lam, size=1)
return rv
That, on my laptop, with 1e6 samples gives:
Lam = NP.random.rand(1e6) + 1
timeit multi_poisson(Lam)
1 loops, best of 3: 4.82 s per loop
Is it possible to improve from this?
Although the docstrings don't document this functionality, the source indicates it is possible to pass an array to the numpy.random.poisson function.
>>> import numpy
>>> # 1 dimension array of 1M random var's uniformly distributed between 1 and 2
>>> numpyarray = numpy.random.rand(1e6) + 1
>>> # pass to poisson
>>> poissonarray = numpy.random.poisson(lam=numpyarray)
>>> poissonarray
array([4, 2, 3, ..., 1, 0, 0])
The poisson random variable returns discrete multiples of one, and approximates a bell curve as lambda grows beyond one.
>>> import matplotlib.pyplot
>>> count, bins, ignored = matplotlib.pyplot.hist(
numpy.random.poisson(
lam=numpy.random.rand(1e6) + 10),
14, normed=True)
>>> matplotlib.pyplot.show()
This method of passing the array to the poisson generator appears to be quite efficient.
>>> timeit.Timer("numpy.random.poisson(lam=numpy.random.rand(1e6) + 1)",
'import numpy').repeat(3,1)
[0.13525915145874023, 0.12136101722717285, 0.12127304077148438]