Numpy: averaging many datapoints at each time step - python

This question is probably answered somewhere, but I cannot find where, so I will ask here:
I have a set of data consisting of several samples per timestep. So, I basically have two arrays, "times", which looks something like: (0,0,0,1,1,1,1,1,2,2,3,4,4,4,4,...) and my data which is the value for each time. Each timestep has a random number of samples. I would like to get the average value of the data at each timestep in an efficient manner.
I have prepared the following sample code to show what my data looks like. Basically, I am wondering if there is a more efficient way to write the "average_values" function.
import numpy as np
import matplotlib.pyplot as plt
def average_values(x,y):
unique_x = np.unique(x)
averaged_y = [np.mean(y[x==ux]) for ux in unique_x]
return unique_x, averaged_y
#generate our data
times = []
samples = []
#we have some timesteps:
for time in np.linspace(0,10,101):
#and a random number of samples at each timestep:
num_samples = np.random.random_integers(1,10)
for i in range(0,num_samples):
times.append(time)
samples.append(np.sin(time)+np.random.random()*0.5)
times = np.array(times)
samples = np.array(samples)
plt.plot(times,samples,'bo',ms=3,mec=None,alpha=0.5)
plt.plot(*average_values(times,samples),color='r')
plt.show()
Here is what it looks like:

A generic code to do this would do something as follows:
def average_values_bis(x, y):
unq_x, idx = np.unique(x, return_inverse=True)
count_x = np.bincount(idx)
sum_y = np.bincount(idx, weights=y)
return unq_x, sum_y / count_x
Adding the function above and following line for the plotting to your script
plt.plot(*average_values_bis(times, samples),color='g')
produces this output, with the red line hidden behind the green one:
But timing both approaches reveals the benefits of using bincount, a 30x speed-up:
%timeit average_values(times, samples)
100 loops, best of 3: 2.83 ms per loop
%timeit average_values_bis(times, samples)
10000 loops, best of 3: 85.9 us per loop

May I propose a pandas solution. It is highly recommended if you are going to be working with time series.
Create test data
import pandas as pd
import numpy as np
times = np.random.randint(0,10,size=50)
values = np.sin(times) + np.random.random_sample((len(times),))
s = pd.Series(values, index=times)
s.plot(linestyle='.', marker='o')
Calculate averages
avs = s.groupby(level=0).mean()
avs.plot()

Related

Any way to speed up runs?

Im currently trying to do a monte carlo simulation, the problem is its taking quite a while to run 100,000 runs or more when Im told it shouldnt take very long.
Heres my code:
runs = 10000
import matplotlib.pyplot as plt
import random
import numpy as np
from scipy.stats import norm
from scipy.stats import uniform
import seaborn as sns
import pandas
def steadystate():
p=0.88
Cout=4700000000
LambdaAER=0.72
Vol=44.5
Depo=0.42
Uptime=0.1
Effic=0.38
Recirc=4.3
x = random.randint(86900000,2230000000000)
conc = ((p*Cout*LambdaAER)+(x/Vol))/(LambdaAER+Depo+(Uptime*Effic*Recirc))
return conc
x = 0
while x < runs:
#results = steadystate (Faster)
results = np.array([steadystate() for _ in range(1000)])
print(results)
x+=1
ax = sns.distplot(results,
bins=100,
kde=True,
color='skyblue',
hist_kws={"linewidth": 15,'alpha':1})
ax.set(xlabel='Uniform Distribution ', ylabel='Frequency')
Im fairly new at python so Im unsure of where to optimize my code. Any help or suggestions would be much appreaciated.
You're not actually benefiting from numpy here, because you produce each value one at a time, doing all the math for that one value, then producing the array from the results. Work with arrays from the get-go, and do all the work on all elements in bulk to derive the benefits of vectorization:
import numpy.random
def steadystate(count): # Receive desired number of values for bulk generation
p=0.88
Cout=4700000000
LambdaAER=0.72
Vol=44.5
Depo=0.42
Uptime=0.1
Effic=0.38
Recirc=4.3
x = numpy.random.randint(86900000, 2230000000000, count) # Make array of count values all at once
# Perform all the math in bulk
conc = ((p*Cout*LambdaAER)+(x/Vol))/(LambdaAER+Depo+(Uptime*Effic*Recirc))
return conc
x = 0
while x < runs:
results = steadystate(1000) # Just call with number of desired items
print(results)
x+=1
Note that this code matches your original code by replacing results each time, rather than accumulating results. I'm not clear on what you what to do instead, so this is just doing the (probably) wrong thing much faster.
About 70% of the time you are losing is with the creation of the random numbers.
The question is whether you need each time random numbers? Would it be sufficient may be to generate the random matrix just once and reuse it.
However, the code is pretty quick isn't it. Except the drawing part this par took for one iteration just 1.2 ms.
%timeit results = np.array([steadystate() for _ in range(1000)])
1.24 ms ± 3.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Vectorize a for loop over pandas ewm function

So I want to vectorize a for loop to speed things up. my code is the following:
import numpy as np
import pandas as pd
def my_func(array, n):
return pd.Series(array).ewm(span = n, min_periods = n-1).mean().to_numpy()
np.random.seed(0)
data_size = 120000
data = np.random.uniform(0,1000, size = data_size)+29000
loop_size = 1000
step_size = 1
X = np.zeros([data.shape[0], loop_size])
parameter_array = np.arange(1,loop_size+ step_size, step_size)
for i in parameter_array:
X[:, i-1] = my_func(data, i)
The entire for-loop takes up about a min to finish, which could be a problem for future application. I have already checked the numpy.vectorize(), but it states clearly that it is for convenience only, so using it won't speed up the code by an order of magnitude.
My question is that is there a way to vectorize the for loop like this? If so, can I see a simple example of how this can be done?
Thank you in advance

Compute rolling z-score in pandas dataframe

Is there a open source function to compute moving z-score like https://turi.com/products/create/docs/generated/graphlab.toolkits.anomaly_detection.moving_zscore.create.html. I have access to pandas rolling_std for computing std, but want to see if it can be extended to compute rolling z scores.
rolling.apply with a custom function is significantly slower than using builtin rolling functions (such as mean and std). Therefore, compute the rolling z-score from the rolling mean and rolling std:
def zscore(x, window):
r = x.rolling(window=window)
m = r.mean().shift(1)
s = r.std(ddof=0).shift(1)
z = (x-m)/s
return z
According to the definition given on this page the rolling z-score depends on the rolling mean and std just prior to the current point. The shift(1) is used above to achieve this effect.
Below, even for a small Series (of length 100), zscore is over 5x faster than using rolling.apply. Since rolling.apply(zscore_func) calls zscore_func once for each rolling window in essentially a Python loop, the advantage of using the Cythonized r.mean() and r.std() functions becomes even more apparent as the size of the loop increases.
Thus, as the length of the Series increases, the speed advantage of zscore increases.
In [58]: %timeit zscore(x, N)
1000 loops, best of 3: 903 µs per loop
In [59]: %timeit zscore_using_apply(x, N)
100 loops, best of 3: 4.84 ms per loop
This is the setup used for the benchmark:
import numpy as np
import pandas as pd
np.random.seed(2017)
def zscore(x, window):
r = x.rolling(window=window)
m = r.mean().shift(1)
s = r.std(ddof=0).shift(1)
z = (x-m)/s
return z
def zscore_using_apply(x, window):
def zscore_func(x):
return (x[-1] - x[:-1].mean())/x[:-1].std(ddof=0)
return x.rolling(window=window+1).apply(zscore_func)
N = 5
x = pd.Series((np.random.random(100) - 0.5).cumsum())
result = zscore(x, N)
alt = zscore_using_apply(x, N)
assert not ((result - alt).abs() > 1e-8).any()
You should use native functions of pandas:
# Compute rolling zscore for column ="COL" and window=window
col_mean = df["COL"].rolling(window=window).mean()
col_std = df["COL"].rolling(window=window).std()
df["COL_ZSCORE"] = (df["COL"] - col_mean)/col_std
def zscore(arr, window):
x = arr.rolling(window = 1).mean()
u = arr.rolling(window = window).mean()
o = arr.rolling(window = window).std()
return (x-u)/o
df['zscore'] = zscore(df['value'],window)
Let us say you have a data frame called data, which looks like this:
enter image description here
then you run the following code,
data_zscore=data.apply(lambda x: (x-x.expanding().mean())/x.expanding().std())
enter image description here
Please note that the first row will always have NaN values as it doesn't have a standard deviation.
This can be solved in a single line of code. Given that s is the input series and wlen is the window length:
zscore = s.sub(s.rolling(wlen).mean()).div(s.rolling(wlen).std())
If you need to shift the mean and std it can still be done:
zscore = s.sub(s.rolling(wlen).mean().shift()).div(s.rolling(wlen).std().shift())

how to code vector to matrix hamming distance in Python?

I like to develop a query system that finds the most similar items to given one based on a binary signature extracted from the data. I probe for the most efficient way since I have runtime constraints. I tried to use scipy distance but it was too slow. Do you know any other useful library or trick to make it in a faster manner.
For being and example scenario,
I have a query vector with binary values with length 68, and I have a dataset with a matrix size 3000Kx68. I like to find the most similar item in this matrix to given query by using Hamming distance.
thanks for any comment
Nice problem, I liked the answers of Alex and Piotr. My first naive attempt resulted also in a solution time around 800ms (on my system). My second attempt, using numpy's (un)packbits, resulted in a 4x speed increase.
import numpy as np
LENGTH = 68
K = 1024
DATASIZE = 3000 * K
DATA = np.random.randint(0, 2, (DATASIZE, LENGTH)).astype(np.bool)
def RandomVect():
return np.random.randint(0, 2, (LENGTH)).astype(np.bool)
def HammingDist(vec1, vec2):
return np.sum(np.logical_xor(vec1, vec2))
def SmallestHamming(vec):
XorData = np.logical_xor(DATA, vec[np.newaxis, :])
Lengths = np.sum(XorData, axis=1)
return DATA[np.argmin(Lengths)] # returns first smallest
def main():
v1 = RandomVect()
v2 = SmallestHamming(v1)
print(HammingDist(v1, v2))
# oke, lets try make it faster... (using numpy.(un)packbits)
DATA2 = np.packbits(DATA, axis=1)
NBYTES = DATA2.shape[-1]
BYTE2ONES = np.zeros((256), dtype=np.uint8)
for i in range(0,256):
BYTE2ONES[i] = np.sum(np.unpackbits(np.uint8(i)))
def RandomVect2():
return np.packbits(RandomVect())
def HammingDist2(vec1, vec2):
v1 = np.unpackbits(vec1)
v2 = np.unpackbits(vec2)
return np.sum(np.logical_xor(v1, v2))
def SmallestHamming2(vec):
XorData = DATA2 ^ vec[np.newaxis, :]
Lengths = np.sum(BYTE2ONES[XorData], axis=1)
return DATA2[np.argmin(Lengths)] # returns first smallest
def main2():
v1 = RandomVect2()
v2 = SmallestHamming2(v1)
print(HammingDist2(v1, v2))
Use cdist from SciPy:
from scipy.spatial.distance import cdist
Y = cdist(XA, XB, 'hamming')
Computes the normalized Hamming distance, or the proportion of those vector elements between two n-vectors u and v which disagree. To save memory, the matrix X can be of type boolean
Reference: http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html
I would be surprised if there was a significantly faster way than this: Put your data into a pandas DataFrame (M), each vector by columns, and your target vector into a pandas Series (x),
import numpy as np
import pandas as pd
rows = 68
columns=3000
M = pd.DataFrame(np.random.rand(rows,columns)>0.5)
x = pd.Series(np.random.rand(rows)>0.5)
then do the following
%timeit M.apply(lambda y: x==y).astype(int).sum().idxmax()
1 loop, best of 3: 746 ms per loop
Edit: Actually, I am surprised this is a much faster way
%timeit M.eq(x, axis=0).astype(int).sum().idxmax()
100 loops, best of 3: 2.68 ms per loop

Concurrently reading numpy arrays in parallel

Consider the following:
fine = np.random.uniform(0,100,10)
fine[fine<20] = 0 # introduce some intermittency
coarse = np.sum(fine.reshape(-1,2),axis=1)
fine is a timeseries of magnitudes (e.g. volume of rainfall). coarse is the same timeseries but at a halved resolution, so every 2 timesteps in fine are aggregated to a single value in coarse.
I am then interested in the weighting that determines the proportions of the magnitude of coarse that corresponds to each timestep in fine for the instances where the value of coarse is above zero.
def w_xx(fine, coarse):
weights = []
for i, val in enumerate(coarse):
if val > 0:
w = fine[i*2:i*2+2]/val # returns both w1 and w2, w1 is 1st element, w2 = 1-w1 is second
weights.append(w)
return np.asarray(weights)
So w_xx(fine,coarse) would return an array of shape 5,2 where the elements of axis=1 are the weights of fine for a value of coarse.
This is all fine for smaller timeseries, but I'm running this analysis on ~60k-sized arrays of fine, plus in a loop of 300+ iterations.
I have been trying to make this run in parallel using the multiprocessing library in Python2.7 but I've not managed to get far. I need to be be reading both timeseries at the same time in order to get the corresponding values of fine for every value in coarse, plus to only work for values above 0, which is what my analysis requires.
I would appreciate suggestions on a better way to do this. I imagine if I can define a mapping function to use with Pool.map in multiprocessing, I should be able to parallelize this? I've only just started out with multiprocessing so I don't know if there is another way?
Thank you.
You can achieve the same result in a vectorized form by simply doing:
>>> (fine / np.repeat(coarse, 2)).reshape(-1, 2)
then you may filter out rows which coarse is zero, by using np.isfinite since if coarse is zero the output is either inf or nan.
In addition to the NumPy expression proposed by #behzad.nouri, you can use the Pythran compiler to reap extra speedups:
$ cat w_xx.py
#pythran export w_xx(float[], float[])
import numpy as np
def w_xx(fine, coarse):
w = (fine / np.repeat(coarse, 2))
return w[np.isfinite(w)].reshape(-1, 2)
$ python -m timeit -s 'import numpy as np; fine = np.random.uniform(0, 100, 100000); fine[fine<20] = 0; coarse = np.sum(fine.reshape(-1, 2), axis=1); from w_xx import w_xx' 'w_xx(fine, coarse)'
1000 loops, best of 3: 1.5 msec per loop
$ pythran w_xx.py -fopenmp -march=native # yes, this generates parallel code
$ python -m timeit -s 'import numpy as np; fine = np.random.uniform(0, 100, 100000); fine[fine<20] = 0; coarse = np.sum(fine.reshape(-1, 2), axis=1); from w_xx import w_xx' 'w_xx(fine, coarse)'
1000 loops, best of 3: 867 usec per loop
Disclaimer: I am a Pythran dev.
Excellent! I didn't know about np.repeat, thank you very much.
To answer my original question in the form it was presented, I've then also managed to make this work with multiprocessing:
import numpy as np
from multiprocessing import Pool
fine = np.random.uniform(0,100,100000)
fine[fine<20] = 0
coarse = np.sum(fine.reshape(-1,2),axis=1)
def wfunc(zipped):
return zipped[0]/zipped[1]
def wpar(zipped, processes):
p = Pool(processes)
calc = np.asarray(p.map(wfunc, zip(fine,np.repeat(coarse,2))))
p.close()
p.join()
return calc[np.isfinite(calc)].reshape(-1,2)
However, the suggestion by #behzad.nouri is evidently better:
def w_opt(fine, coarse):
w = (fine / np.repeat(coarse, 2))
return w[np.isfinite(w)].reshape(-1,2)
#using some iPython magic
%timeit w_opt(fine,coarse)
1000 loops, best of 3: 1.88 ms per loop
%timeit w_xx(fine,coarse)
1 loops, best of 3: 342 ms per loop
%timeit wpar(zip(fine,np.repeat(coarse,2)),6) #I've 6 cores at my disposal
1 loops, best of 3: 1.76 s per loop
Thanks again!

Categories