I'd like to compare each value x of an array with a rolling window of the n previous values. More precisely I'd like to see at which percentile this new value x would be, if we added it to the previous window:
import numpy as np
A = np.array([1, 4, 9, 28, 28.5, 2, 283, 3.2, 7, 15])
print A
n = 4 # window width
for i in range(len(A)-n):
W = A[i:i+n]
x = A[i+n]
q = sum(W <= x) * 1.0 / n
print 'Value:', x, ' Window before this value:', W, ' Quantile:', q
[ 1. 4. 9. 28. 28.5 2. 283. 3.2 7. 15. ]
Value: 28.5 Window before this value: [ 1. 4. 9. 28.] Quantile: 1.0
Value: 2.0 Window before this value: [ 4. 9. 28. 28.5] Quantile: 0.0
Value: 283.0 Window before this value: [ 9. 28. 28.5 2. ] Quantile: 1.0
Value: 3.2 Window before this value: [ 28. 28.5 2. 283. ] Quantile: 0.25
Value: 7.0 Window before this value: [ 28.5 2. 283. 3.2] Quantile: 0.5
Value: 15.0 Window before this value: [ 2. 283. 3.2 7. ] Quantile: 0.75
Question: What is the name of this computation? Is there a clever numpy way to compute this more efficiently on arrays of millions of items (with n that can be ~5000)?
Note: here is a simulation for 1M items and n=5000 but it would take ~ 2 hours:
import numpy as np
A = np.random.random(1000*1000) # the following is not very interesting with a [0,1]
n = 5000 # uniform random variable, but anyway...
Q = np.zeros(len(A)-n)
for i in range(len(Q)):
Q[i] = sum(A[i:i+n] <= A[i+n]) * 1.0 / n
if i % 100 == 0:
print "%.2f %% already done. " % (i * 100.0 / len(A))
print Q
Note: this is not similar to How to compute moving (or rolling, if you will) percentile/quantile for a 1d array in numpy?
Your code is so slow because you're using Python's own sum() instead of numpy.sum() or numpy.array.sum(); Python's sum() has to convert all the raw values to Python objects before doing the calculations, which is really slow. Just by changing sum(...) to np.sum(...) or (...).sum(), the runtime drops to under 20 seconds.
you can use np.lib.stride_tricks.as_strided as in the accepted answer of the question you linked. With the first example you give, it is pretty easy to understand:
A = np.array([1, 4, 9, 28, 28.5, 2, 283, 3.2, 7, 15])
n=4
print (np.lib.stride_tricks.as_strided(A, shape=(A.size-n,n),
strides=(A.itemsize,A.itemsize)))
# you get the A.size-n columns of the n rolling elements
array([[ 1. , 4. , 9. , 28. , 28.5, 2. ],
[ 4. , 9. , 28. , 28.5, 2. , 283. ],
[ 9. , 28. , 28.5, 2. , 283. , 3.2],
[ 28. , 28.5, 2. , 283. , 3.2, 7. ]])
Now to do the calculation, you can compare this array to A[n:], sum over the rows and divide by n:
print ((np.lib.stride_tricks.as_strided(A, shape=(n,A.size-n),
strides=(A.itemsize,A.itemsize))
<= A[n:]).sum(0)/(1.*n))
[1. 0. 1. 0.25 0.5 0.75] # same anwser
Now the problem is the size of you data (several M and n around 5000), not sure you can use directly this method. One way could be to chunk the data. Let's define a function
def compare_strides (arr, n):
return (np.lib.stride_tricks.as_strided(arr, shape=(n,arr.size-n),
strides=(arr.itemsize,arr.itemsize))
<= arr[n:]).sum(0)
and do the chunk, with np.concatenate and don't forget to divide by n:
nb_chunk = 1000 #this number depends on the capacity of you computer,
# not sure how to optimize it
Q = np.concatenate([compare_strides(A[chunk*nb_chunk:(chunk+1)*nb_chunk+n],n)
for chunk in range(0,A[n:].size/nb_chunk+1)])/(1.*n)
I can't do the 1M - 5000 test, but on a 5000 - 100, see the difference in timeit:
A = np.random.random(5000)
n = 100
%%timeit
Q = np.zeros(len(A)-n)
for i in range(len(Q)):
Q[i] = sum(A[i:i+n] <= A[i+n]) * 1.0 / n
#1 loop, best of 3: 6.75 s per loop
%%timeit
nb_chunk = 100
Q1 = np.concatenate([compare_strides(A[chunk*nb_chunk:(chunk+1)*nb_chunk+n],n)
for chunk in range(0,A[n:].size/nb_chunk+1)])/(1.*n)
#100 loops, best of 3: 7.84 ms per loop
#check for egality
print ((Q == Q1).all())
Out[33]: True
See the difference in time from 6750 ms to 7.84 ms. Hope it works on bigger data
Using np.sum instead of sum was already mentioned, so my only suggestion left is to additionally consider using pandas and its rolling window function, which you can apply any arbitrary function to:
import numpy as np
import pandas as pd
A = np.random.random(1000*1000)
df = pd.DataFrame(A)
n = 5000
def fct(x):
return np.sum(x[:-1] <= x[-1]) * 1.0 / (len(x)-1)
percentiles = df.rolling(n+1).apply(fct)
print(percentiles)
Additional benchmark: comparison between this solution and this solution:
import numpy as np, time
A = np.random.random(1000*1000)
n = 5000
def compare_strides (arr, n):
return (np.lib.stride_tricks.as_strided(arr, shape=(n,arr.size-n), strides=(arr.itemsize,arr.itemsize)) <= arr[n:]).sum(0)
# Test #1: with strides ===> 11.0 seconds
t0 = time.time()
nb_chunk = 10*1000
Q = np.concatenate([compare_strides(A[chunk*nb_chunk:(chunk+1)*nb_chunk+n],n) for chunk in range(0,A[n:].size/nb_chunk+1)])/(1.*n)
print time.time() - t0, Q
# Test #2: with just np.sum ===> 18.0 seconds
t0 = time.time()
Q2 = np.zeros(len(A)-n)
for i in range(len(Q2)):
Q2[i] = np.sum(A[i:i+n] <= A[i+n])
Q2 *= 1.0 / n # here the multiplication is vectorized; if instead, we move this multiplication to the previous line: np.sum(A[i:i+n] <= A[i+n]) * 1.0 / n, it is 6 seconds slower
print time.time() - t0, Q2
print all(Q == Q2)
There's also another (better) way, with numba and #jit decorator. Then it is much faster: only 5.4 seconds!
from numba import jit
import numpy as np
#jit # if you remove this line, it is much slower (similar to Test #2 above)
def doit():
A = np.random.random(1000*1000)
n = 5000
Q2 = np.zeros(len(A)-n)
for i in range(len(Q2)):
Q2[i] = np.sum(A[i:i+n] <= A[i+n])
Q2 *= 1.0/n
print(Q2)
doit()
When adding numba parallelization, it's even faster: 1.8 seconds!
import numpy as np
from numba import jit, prange
#jit(parallel=True)
def doit(A, Q, n):
for i in prange(len(Q)):
Q[i] = np.sum(A[i:i+n] <= A[i+n])
A = np.random.random(1000*1000)
n = 5000
Q = np.zeros(len(A)-n)
doit(A, Q, n)
You can use the np.quantile instead of sum(A[i:i+n] <= A[i+n]) * 1.0 / n. That may be as good as it gets. Not sure if there really is a better approach for your question.
Related
I am now working on a calculation shown below. I want to update the values of each element based on their adjacent elements. I am now using two for loops, but it shows the calculation is very slow since there are several outer iterations. I want to know whether there is any way can speed up this calculation>
for i in range(1,nx+1):
for j in range(1,ny+1):
p[i,j]=(a*p[i-1,j]+b*p[i+1,j]+c*p[i,j-1]+d*p[i,j+1])
a, b, c, d are some constant, p is numpy.array type
Sample input:
import numpy as np
p = np.ones((5,5))
for i in range(1,4):
for j in range(1,4):
p[i,j]=p[i-1,j] + p[i+1,j] +2*p[i,j+1]+2*p[i,j-1]
print(p)
The final output should be:
[[ 1. 1. 1. 1. 1.]
[ 1. 6. 16. 36. 1.]
[ 1. 11. 41. 121. 1.]
[ 1. 16. 76. 276. 1.]
[ 1. 1. 1. 1. 1.]]
Don't have enough rep to comment and this doesn't fully answer the question, but if you are using NumPy, you should definitely look at array broadcasting. Hard to tell exactly what your code is doing, but using broadcasting should make it a lot easier to update the full matrix instead of value by value
We can at least get rid of one nested loop using np.cumsum. In favorable conditions (large number of columns) this can give a 30fold speedup. Sample run:
results equal True
original 31.644793 ms
optimized 0.861980 ms
Code:
import numpy as np
n, m = 50, 600
a, b, c, d = np.random.random((4,))
P = np.random.random((n, m))
def f_OP(P):
p = P.copy()
for i in range(1, n-1):
for j in range(1, m-1):
p[i,j]=a*p[i-1,j] + b*p[i+1,j] +c*p[i,j-1]+d*p[i,j+1]
return p
def f_pp(P):
p = P.copy()
pp = d*p[1:-1, 2:] + b*p[2:, 1:-1]
pp[0] += a*p[0, 1:-1]
pp[:, 0] += c*p[1:-1, 0]
x = np.full((m-2,), c)
x[0] = 1
x = np.cumprod(x)[::-1]
pp = np.cumsum(pp * x, axis=1)
for i in range(1, n-2):
pp[i] += a * np.cumsum(pp[i-1])
p[1:-1, 1:-1] = pp / x
return(p)
print('results equal', np.allclose(f_OP(P), f_pp(P)))
from timeit import timeit
kwds = dict(globals=globals(), number=10)
print('original {:10.6f} ms'.format(timeit('f_OP(P)', **kwds)*100))
print('optimized {:10.6f} ms'.format(timeit('f_pp(P)', **kwds)*100))
Say I have 2 numpy 2D arrays, mins, and maxs, that will always be the same dimension as one another. I'd like to create a third array, results, that is the result of applying linspace to max and min value. Is there some "numpy"/vectorized way to do this? Example non-vectorized code is below to show results I would like.
import numpy as np
mins = np.random.rand(2,2)
maxs = np.random.rand(2,2)
# Number of elements in the linspace
x = 3
m, n = mins.shape
results = np.zeros((m, n, x))
for i in range(m):
for j in range(n):
min = mins[i][j]
max = maxs[i][j]
results[i][j] = np.linspace(min, max, num=x)
Here's one vectorized approach based on this post to cover for generic n-dim cases -
def create_ranges_nd(start, stop, N, endpoint=True):
if endpoint==1:
divisor = N-1
else:
divisor = N
steps = (1.0/divisor) * (stop - start)
return start[...,None] + steps[...,None]*np.arange(N)
Sample run -
In [536]: mins = np.array([[3,5],[2,4]])
In [537]: maxs = np.array([[13,16],[11,12]])
In [538]: create_ranges_nd(mins, maxs, 6)
Out[538]:
array([[[ 3. , 5. , 7. , 9. , 11. , 13. ],
[ 5. , 7.2, 9.4, 11.6, 13.8, 16. ]],
[[ 2. , 3.8, 5.6, 7.4, 9.2, 11. ],
[ 4. , 5.6, 7.2, 8.8, 10.4, 12. ]]])
As of Numpy version 1.16.0, non-scalar start and stop are now supported.
So, now you can do this:
assert np.__version__ > '1.17.2'
mins = np.random.rand(2,2)
maxs = np.random.rand(2,2)
# Number of elements in the linspace
x = 3
results = np.linspace(mins, maxs, num=x)
# And, if required
results = np.rollaxis(results, 0, 3)
I have a D dimensional point and vector, p and v, respectively, a positive number n, and a resolution.
I want to get all points after successively adding vector v*resolution to point p n/resolution times.
Example
p = np.array([3, 5])
v = np.array([-1.5, 3])
n = 10
resolution = 1.5
result:
[[ 3. , 5. ],
[ 0.75, 9.5 ],
[ -1.5 , 14. ],
[ -3.75, 18.5 ],
[ -6. , 23. ],
[ -8.25, 27.5 ],
[-10.5 , 32. ]]
My current approach is to tile the range, given by n and the resolution, by the dimension D, multiply by that by v and add p.
def getPoints(p, v, n, resolution=1.):
dRange = np.tile(np.arange(0, n, resolution), (v.shape[0],1))
return np.multiply(v.reshape(-1,1), dRange).T + p
Is there is a direct way to calculate DRange using np.einsum or another method?
Approach #1
Here's one approach leveraging NumPy broadcasting -
np.arange(0, n, resolution)[:,None] * v + p
Basically, we extend the range array to 2D, keeping the second one as singleton, to let it broadcast for elementwise multiplication against 1D v, giving us a 2D array. Then, we add p to it.
Approach #2
There isn't any sum-reduction here, so np.einsum or any dot-based function even though should work, but won't lend any help on performance. Let's put it out anyway, as it was mentioned in the question -
np.einsum('i,j->ij',np.arange(0, n, resolution), v) + p
I'm looking for a way to calculate the cumulative sum with numpy, but don't want to roll forward the value (or set it to zero) in case the cumulative sum is very close to zero and negative.
For instance
a = np.asarray([0, 4999, -5000, 1000])
np.cumsum(a)
returns [0, 4999, -1, 999]
but, I'd like to set the [2]-value (-1) to zero during the calculation. The problem is that this decision can only be done during calculation as the intermediate result isn't know a priori.
The expected array is: [0, 4999, 0, 1000]
The reason for this is that I'm getting very small values (floating point, not integers as in the example) which are due to floating point calculations which should in reality be zero. Calculating the cumulative sum compounds those values which leads to errors.
The Kahan summation algorithm could solve the problem. Unfortunately, it is not implemented in numpy. This means a custom implementation is required:
def kahan_cumsum(x):
x = np.asarray(x)
cumulator = np.zeros_like(x)
compensation = 0.0
cumulator[0] = x[0]
for i in range(1, len(x)):
y = x[i] - compensation
t = cumulator[i - 1] + y
compensation = (t - cumulator[i - 1]) - y
cumulator[i] = t
return cumulator
I have to admit, this is not exactly what was asked for in the question. (A value of -1 at the 3rd output of the cumsum is correct in the example). However, I hope this solves the actual problem behind the question, which is related to floating point precision.
I wonder if rounding will do what you are asking for:
np.cumsum(np.around(a,-1))
# the -1 means it rounds to the nearest 10
gives
array([ 0, 5000, 0, 1000])
It is not exactly as you put in your expected array from your answer, but using around, perhaps with the decimals parameter set to 0, might work when you apply it to the problem with floats.
Probably the best way to go is to write this bit in Cython (name the file cumsum_eps.pyx):
cimport numpy as cnp
import numpy as np
cdef inline _cumsum_eps_f4(float *A, int ndim, int dims[], float *out, float eps):
cdef float sum
cdef size_t ofs
N = 1
for i in xrange(0, ndim - 1):
N *= dims[i]
ofs = 0
for i in xrange(0, N):
sum = 0
for k in xrange(0, dims[ndim-1]):
sum += A[ofs]
if abs(sum) < eps:
sum = 0
out[ofs] = sum
ofs += 1
def cumsum_eps_f4(cnp.ndarray[cnp.float32_t, mode='c'] A, shape, float eps):
cdef cnp.ndarray[cnp.float32_t] _out
cdef cnp.ndarray[cnp.int_t] _shape
N = np.prod(shape)
out = np.zeros(N, dtype=np.float32)
_out = <cnp.ndarray[cnp.float32_t]> out
_shape = <cnp.ndarray[cnp.int_t]> np.array(shape, dtype=np.int)
_cumsum_eps_f4(&A[0], len(shape), <int*> &_shape[0], &_out[0], eps)
return out.reshape(shape)
def cumsum_eps(A, axis=None, eps=np.finfo('float').eps):
A = np.array(A)
if axis is None:
A = np.ravel(A)
else:
axes = list(xrange(len(A.shape)))
axes[axis], axes[-1] = axes[-1], axes[axis]
A = np.transpose(A, axes)
if A.dtype == np.float32:
out = cumsum_eps_f4(np.ravel(np.ascontiguousarray(A)), A.shape, eps)
else:
raise ValueError('Unsupported dtype')
if axis is not None: out = np.transpose(out, axes)
return out
then you can compile it like this (Windows, Visual C++ 2008 Command Line):
\Python27\Scripts\cython.exe cumsum_eps.pyx
cl /c cumsum_eps.c /IC:\Python27\include /IC:\Python27\Lib\site-packages\numpy\core\include
F:\Users\sadaszew\Downloads>link /dll cumsum_eps.obj C:\Python27\libs\python27.lib /OUT:cumsum_eps.pyd
or like this (Linux use .so extension/Cygwin use .dll extension, gcc):
cython cumsum_eps.pyx
gcc -c cumsum_eps.c -o cumsum_eps.o -I/usr/include/python2.7 -I/usr/lib/python2.7/site-packages/numpy/core/include
gcc -shared cumsum_eps.o -o cumsum_eps.so -lpython2.7
and use like this:
from cumsum_eps import *
import numpy as np
x = np.array([[1,2,3,4], [5,6,7,8]], dtype=np.float32)
>>> print cumsum_eps(x)
[ 1. 3. 6. 10. 15. 21. 28. 36.]
>>> print cumsum_eps(x, axis=0)
[[ 1. 2. 3. 4.]
[ 6. 8. 10. 12.]]
>>> print cumsum_eps(x, axis=1)
[[ 1. 3. 6. 10.]
[ 5. 11. 18. 26.]]
>>> print cumsum_eps(x, axis=0, eps=1)
[[ 1. 2. 3. 4.]
[ 6. 8. 10. 12.]]
>>> print cumsum_eps(x, axis=0, eps=2)
[[ 0. 2. 3. 4.]
[ 5. 8. 10. 12.]]
>>> print cumsum_eps(x, axis=0, eps=3)
[[ 0. 0. 3. 4.]
[ 5. 6. 10. 12.]]
>>> print cumsum_eps(x, axis=0, eps=4)
[[ 0. 0. 0. 4.]
[ 5. 6. 7. 12.]]
>>> print cumsum_eps(x, axis=0, eps=8)
[[ 0. 0. 0. 0.]
[ 0. 0. 0. 8.]]
>>> print cumsum_eps(x, axis=1, eps=3)
[[ 0. 0. 3. 7.]
[ 5. 11. 18. 26.]]
and so on, of course normally eps would be some small value, here integers are used just for the sake of demonstration / easiness of typing.
If you need this for double as well the _f8 variants are trivial to write and another case has to be handled in cumsum_eps().
When you're happy with the implementation you should make it a proper part of your setup.py - Cython setup.py
Update #1: If you have good compiler support in run environment you could try [Theano][3] to implement either compensation algorithm or your original idea:
import numpy as np
import theano
import theano.tensor as T
from theano.ifelse import ifelse
A=T.vector('A')
sum=T.as_tensor_variable(np.asarray(0, dtype=np.float64))
res, upd=theano.scan(fn=lambda cur_sum, val: ifelse(T.lt(cur_sum+val, 1.0), np.asarray(0, dtype=np.float64), cur_sum+val), outputs_info=sum, sequences=A)
f=theano.function(inputs=[A], outputs=res)
f([0.9, 2, 3, 4])
will give [0 2 3 4] output. In either Cython or this you get at least +/- performance of the native code.
I'm trying to evaluate the probabilities of end locations of random walks but I'm having some trouble with the speed of my program. Basically what I'm trying to do is take as an input a dictionary that contains the probabilities for a random walk( e.g. p = {0:0.5, 1:0.2. -1:0.3} meaning there's a 50% probability X stays at 0, a 20% probability X increases by 1, and a 30% probability X decreases by 1) and then calculate the probabilities for all the possible future states after n iterations.
So for example if p = {0:0.5, 1:0.2. -1:0.3} and n = 2 then it will return {0:0.37, 1:0.2, -1:0.3, 2:0.04, -2:0.09}
if p = {0:0.5, 1:0.2. -1:0.3} and n = 1 then it will return {0:0.5, 1:0.2. -1:0.3}
I have working code, and it runs relatively quickly if n is low and if the p dictionary is small, but when n > 500 and the dictionary has around 50 values it takes upwards of 5 minutes to calculate. I'm guessing this is because it does it only on one processor so I went ahead and modified it so it would use python's multiprocessing module (as I read that multithreading doesn't improve parallel computing performance because of GIL).
My problem is, that there is not much improvement with multiprocessing, now I'm not sure if it's because I'm implementing it wrong or because of the overhead of multiprocessing in python. I'm just wondering if there's a library somewhere that evaluates all the probabilities of all the possibilities of a random walk when n > 500 in parallel? My next step if I can't find anything is to write my own function as an extension in C but it will be my first time doing it and although I've coded in C before it has been a while.
Original Non MultiProcessed Code
def random_walk_predictor(probabilities_tree, period):
ret = probabilities_tree
probabilities_leaves = ret.copy()
for x in range(period):
tmp = {}
for leaf in ret.keys():
for tree_leaf in probabilities_leaves.keys():
try:
tmp[leaf + tree_leaf] = (ret[leaf] * probabilities_leaves[tree_leaf]) + tmp[leaf + tree_leaf]
except:
tmp[leaf + tree_leaf] = ret[leaf] * probabilities_leaves[tree_leaf]
ret = tmp
return ret
MultiProcessed code
from multiprocessing import Manager,Pool
from functools import partial
def probability_calculator(origin, probability, outp, reference):
for leaf in probability.keys():
try:
outp[origin + leaf] = outp[origin + leaf] + (reference[origin] * probability[leaf])
except KeyError:
outp[origin + leaf] = reference[origin] * probability[leaf]
def random_walk_predictor(probabilities_leaves, period):
probabilities_leaves = tree_developer(probabilities_leaves)
manager = Manager()
prob_leaves = manager.dict(probabilities_leaves)
ret = manager.dict({0:1})
p = Pool()
for x in range(period):
out = manager.dict()
partial_probability_calculator = partial(probability_calculator, probability = prob_leaves, outp = out, reference = ret.copy())
p.map(partial_probability_calculator, ret.keys())
ret = out
return ret.copy()
There tend to be analytic solutions to exactly solve this kind of problem that look similar to binomial distributions, but I'll assume you're really asking for a computational solution for a more general class of problem.
Rather than using python dictionaries, it's easier to think about this in terms of the underlying mathematical problem. Build a matrix A that describes the probability of going from one state to another. Build a state x that describes the probability of being at a given location at some time.
Because after n transitions you can step at most n steps from the origin (in either direction) - your state needs to have 2n+1 rows, and A needs to be square and of size 2n+1 by 2n+1.
For a two timestep problem your transition matrix will be 5x5 and look like:
[[ 0.5 0.2 0. 0. 0. ]
[ 0.3 0.5 0.2 0. 0. ]
[ 0. 0.3 0.5 0.2 0. ]
[ 0. 0. 0.3 0.5 0.2]
[ 0. 0. 0. 0.3 0.5]]
And your state at time 0 will be:
[[ 0.]
[ 0.]
[ 1.]
[ 0.]
[ 0.]]
The one step evolution of the system can be predicted by multiplying A and x.
So at t = 1,
x.T = [[ 0. 0.2 0.5 0.3 0. ]]
and at t = 2,
x.T = [[ 0.04 0.2 0.37 0.3 0.09]]
Because for even modest numbers of timesteps this is potentially going to take a fair bit of storage (A requires n^2 storage), but is very sparse, we can use sparse matrices to reduce our storage (and speed up our calculations). Doing this means A requires approximate 3n elements.
import scipy.sparse as sp
import numpy as np
def random_walk_transition_probability(n, left = 0.3, centre = 0.5, right = 0.2):
m = 2*n+1
A = sp.csr_matrix((m, m))
A += sp.diags(centre*np.ones(m), 0)
A += sp.diags(left*np.ones(m-1), -1)
A += sp.diags(right*np.ones(m-1), 1)
x = np.zeros((m,1))
x[n] = 1.0
for i in xrange(n):
x = A.dot(x)
return x
print random_walk_transition_probability(4)
Timings
%timeit random_walk_transition_probability(500)
100 loops, best of 3: 7.12 ms per loop
%timeit random_walk_transition_probability(10000)
1 loops, best of 3: 1.06 s per loop