I have a 2d array, where each element is a fourier transform. I'd like to split transform 'logarithmically'. For example, let's take a single one of those arrays and call it a:
a = np.arange(0, 512)
# I want to split a into 'bins' defined by b, below:
b = np.array([0] + [10 * 2**i for i in range(6)]) # [0, 10, 20, 40, 80, 160, 320, 640]
What I'm looking to do is something like using np.split, except I would like to split values into 'bins' based on array b such that all values of a between [0, 10) are in one bin, all values between [10, 20) in another, etc.
I could do this in some sort of convoluted for loop:
split_arr = []
for i in range(1, len(b)):
fbin = []
for amp in a:
if (amp >= b[i-1]) and (amp < b[i]):
fbin.append(amp)
split_arr.append(fbin)
I have many arrays to split, and also this is ugly (just my opinion). Is there a better way?
Here is how you can do it, using np.split:
np.split(a, np.searchsorted(a,b))
If your array a is not sorted, sort it before the above command:
a = np.sort(a)
np.searchsorted finds the locations of values in b that would be inserted in the sorted array a. In other words, np.searchsorted finds the locations where you want to split your array. And if you do not want the empty array at the beginning, simply remove 0 from b.
First you can reduce the 'ugliness' by using list comprehension:
split_arr = [[amp for amp in a if (amp >= b[i-1]) and (amp < b[i])] for i in range(1, len(b))]
Then you can apply the same logic using numpy fast parallelized functionalities (which has the bonus of looking even cleaner):
split_arr = [a[(a >= b[i-1]) & (a < b[i])] for i in range(1, len(b))]
Comparison:
%timeit [[amp for amp in a if (amp >= b[i-1]) and (amp < b[i])] for i in range(1, len(b))]
1.29 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit [a[(a >= b[i-1]) & (a < b[i])] for i in range(1, len(b))]
35.9 µs ± 4.52 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Related
I am trying to efficiently compute a summation of a summation in Python:
WolframAlpha is able to compute it too a high n value: sum of sum.
I have two approaches: a for loop method and an np.sum method. I thought the np.sum approach would be faster. However, they are the same until a large n, after which the np.sum has overflow errors and gives the wrong result.
I am trying to find the fastest way to compute this sum.
import numpy as np
import time
def summation(start,end,func):
sum=0
for i in range(start,end+1):
sum+=func(i)
return sum
def x(y):
return y
def x2(y):
return y**2
def mysum(y):
return x2(y)*summation(0, y, x)
n=100
# method #1
start=time.time()
summation(0,n,mysum)
print('Slow method:',time.time()-start)
# method #2
start=time.time()
w=np.arange(0,n+1)
(w**2*np.cumsum(w)).sum()
print('Fast method:',time.time()-start)
Here's a very fast way:
result = ((((12 * n + 45) * n + 50) * n + 15) * n - 2) * n // 120
How I got there:
Rewrite the inner sum as the well-known x*(x+1)//2. So the whole thing becomes sum(x**2 * x*(x+1)//2 for x in range(n+1)).
Rewrite to sum(x**4 + x**3 for x in range(n+1)) // 2.
Look up formulas for sum(x**4) and sum(x**3).
Simplify the resulting mess to (12*n**5 + 45*n**4 + 50*n**3 + 15*n**2 - 2*n) // 120.
Horner it.
Another way to derive it if after steps 1. and 2. you know it's a polynomial of degree 5:
Compute six values with a naive implementation.
Compute the polynomial from the six equations with six unknowns (the polynomial coefficients). I did it similarly to this, but my matrix A is left-right mirrored compared to that, and I called my y-vector b.
Code:
from fractions import Fraction
import math
from functools import reduce
def naive(n):
return sum(x**2 * sum(range(x+1)) for x in range(n+1))
def lcm(ints):
return reduce(lambda r, i: r * i // math.gcd(r, i), ints)
def polynomial(xys):
xs, ys = zip(*xys)
n = len(xs)
A = [[Fraction(x**i) for i in range(n)] for x in xs]
b = list(ys)
for _ in range(2):
for i0 in range(n):
for i in range(i0 + 1, n):
f = A[i][i0] / A[i0][i0]
for j in range(i0, n):
A[i][j] -= f * A[i0][j]
b[i] -= f * b[i0]
A = [row[::-1] for row in A[::-1]]
b.reverse()
coeffs = [b[i] / A[i][i] for i in range(n)]
denominator = lcm(c.denominator for c in coeffs)
coeffs = [int(c * denominator) for c in coeffs]
horner = str(coeffs[-1])
for c in coeffs[-2::-1]:
horner += ' * n'
if c:
horner = f"({horner} {'+' if c > 0 else '-'} {abs(c)})"
return f'{horner} // {denominator}'
print(polynomial((x, naive(x)) for x in range(6)))
Output (Try it online!):
((((12 * n + 45) * n + 50) * n + 15) * n - 2) * n // 120
(fastest methods, 3 and 4, are at the end)
In a fast NumPy method you need to specify dtype=np.object so that NumPy does not convert Python int to its own dtypes (np.int64 or others). It will now give you correct results (checked it up to N=100000).
# method #2
start=time.time()
w=np.arange(0, n+1, dtype=np.object)
result2 = (w**2*np.cumsum(w)).sum()
print('Fast method:', time.time()-start)
Your fast solution is significantly faster than the slow one. Yes, for large N's, but already at N=100 it is like 8 times faster:
start=time.time()
for i in range(100):
result1 = summation(0, n, mysum)
print('Slow method:', time.time()-start)
# method #2
start=time.time()
for i in range(100):
w=np.arange(0, n+1, dtype=np.object)
result2 = (w**2*np.cumsum(w)).sum()
print('Fast method:', time.time()-start)
Slow method: 0.06906533241271973
Fast method: 0.008007287979125977
EDIT: Even faster method (by KellyBundy, the Pumpkin) is by using pure python. Turns out NumPy has no advantage here, because it has no vectorized code for np.objects.
# method #3
import itertools
start=time.time()
for i in range(100):
result3 = sum(x*x * ysum for x, ysum in enumerate(itertools.accumulate(range(n+1))))
print('Faster, pure python:', (time.time()-start))
Faster, pure python: 0.0009944438934326172
EDIT2: Forss noticed that numpy fast method can be optimized by using x*x instead of x**2. For N > 200 it is faster than pure Python method. For N < 200 it is slower than pure Python method (the exact value of boundary may depend on machine, on mine it was 200, its best to check it yourself):
# method #4
start=time.time()
for i in range(100):
w = np.arange(0, n+1, dtype=np.object)
result2 = (w*w*np.cumsum(w)).sum()
print('Fast method x*x:', time.time()-start)
Comparing Python with WolframAlpha like that is unfair, since Wolfram will simplify the equation before computing.
Fortunately, the Python ecosystem knows no limits, so you can use SymPy:
from sympy import summation
from sympy import symbols
n, x, y = symbols("n,x,y")
eq = summation(x ** 2 * summation(y, (y, 0, x)), (x, 0, n))
eq.evalf(subs={"n": 1000})
It will compute the expected result almost instantly: 100375416791650. This is because SymPy simplifies the equation for you, just like Wolfram does. See the value of eq:
#Kelly Bundy's answer is awesome, but if you are like me and use a calculator to compute 2 + 2, then you will love SymPy ❤. As you can see, it gets you to the same results with just 3 lines of code and is a solution that would also work for other more complex cases.
In a comment, you mention that it's really f(x) and g(y) instead of x2 and y. If you're only needing an approximation to that sum, you can pretend the sums are midpoint Riemann sums, so that your sum is approximated by the double integral ∫-.5n+.5 f(x) ∫-.5x+.5 g(y) dy dx.
With your original f(x)=x2 and g(y)=y, this simplifies to n5/10+3n4/8+n3/2+5n2/16+3n/32+1/160, which differs from the correct result by n3/12+3n2/16+53n/480+1/160.
Based on this, I suspect that (actual-integral)/actual would be max(f'',g'')*O(n-2), but I wasn't able to prove it.
All the answers uses math to simplify or implement the loop in python trying to be cpu optimal, but they are not memory optimal.
Here a naive implementation without using any math simplification which is memory efficient
def function5():
inner_sum = float()
result = float()
for x in range(0, n + 1):
inner_sum += x
result += x ** 2 * inner_sum
return result
It is quite slow with respect to the other solutions by dankal444:
method 2 | 31 µs ± 2.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
method 3 | 116 µs ± 538 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
method 4 | 91 µs ± 356 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
function 5 | 217 µs ± 1.14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
by the way if you jit the function with numba (there may be better options):
from numba import jit
function5 = jit(nopython=True)(function5)
you get
59.8 ns ± 0.209 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
I wrote this code. But it works very slowly.
I'm figuring out how many times I have to run the case generator to find numbers less than or equal to inv, in this case six. I count the number of attempts until a digit <= 6 is generated. I find inv equal to 1 and repeat the loop. Until inv is 0. I will keep trying to generate six digits <= 6.
And I will repeat all this 10 ** 4 degrees again to find the arithmetic mean.
Help me speed up this code. Works extremely slowly. I would be immensely grateful. Without third party libraries Thank!
import random
inv = 6
def math(inv):
n = 10**4
counter = 0
while n != 0:
invers = inv
count = 0
while invers > 0:
count += 1
random_digit = random.randint(1, 45)
if random_digit <= invers:
invers -= 1
counter += count
count = 0
if invers == 0:
n -= 1
invers = inv
print(counter/10**4)
math(inv)
Here is a simple way to accelerate your code as is using numba:
m2 = numba.jit(nopython=True)(math)
Timings in ipython:
%timeit math(inv)
1.44 s ± 16.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit -n 7 m2(inv)
10.4 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This speeds up your code by over 100x.
You don't need all those loops. numpy.random.randint() can generate an array of a given size with random integers between low and high, so
np.random.randint(1, 45, (1000, 10000)) will generate a matrix of 500 rows and 10k columns filled with random numbers. Then all you need to do is count how which is the first row in each column that contains a 6 (Numpy: find first index of value fast). Then
max_tries = 1000
def do_math(inv):
rands = np.random.randint(1, 45, (max_tries, 10000))
first_inv = np.argmax(rands < inv, axis=0)
counter = first_inv.mean()
return counter
This is not exactly what you want your function to do, but I found all your loops quite convoluted so this should point you in the right direction, feel free to adapt this code to what you need. This function will give you the number of tries required to get a random number less than inv, averaged over 10000 experiments.
I am looking for an "optimal" way to compute all pairwise products of a given vector's elements. If the vector is of size N, the output will be a vector of size N * (N + 1) // 2 and contain x[i] * x[j] values for all (i, j) pairs with i <= j. The naive way to compute this is as follows:
import numpy as np
def get_pairwise_products_naive(vec: np.ndarray):
k, size = 0, vec.size
output = np.empty(size * (size + 1) // 2)
for i in range(size):
for j in range(i, size):
output[k] = vec[i] * vec[j]
k += 1
return output
Desiderata:
Minimize extra memory allocations/usage: Directly write to the output buffer if possible.
Use vectorized NumPy routines instead of explicit loops.
Avoid extra (unnecessary) calculations.
I have been playing with routines such as outer, triu_indices and einsum as well as some indexing/view tricks, but haven't been able to find a solution that fits the above desiderata.
Approach #1
For a vectorized one with NumPy, you can use a masking one after getting all the pairwise multiplications with outer-multiplication, like so -
def pairwise_multiply_masking(a):
return (a[:,None]*a)[~np.tri(len(a),k=-1,dtype=bool)]
Approach #2
For really big input 1D arrays, we might want to resort to iterative slicing method that uses one-loop -
def pairwise_multiply_iterative_slicing(a):
n = len(a)
N = (n*(n+1))//2
out = np.empty(N, dtype=a.dtype)
c = np.r_[0,np.arange(n,0,-1)].cumsum()
for ii,(i,j) in enumerate(zip(c[:-1],c[1:])):
out[i:j] = a[ii:]*a[ii]
return out
Benchmarking
We will include pairwise_products and pairwise_products_numba from #orlp's solution in the setup.
Using benchit package (few benchmarking tools packaged together; disclaimer: I am its author) to benchmark proposed solutions.
import benchit
funcs = [pairwise_multiply_masking, pairwise_multiply_iterative_slicing, pairwise_products_numba, pairwise_products]
in_ = [np.random.rand(n) for n in [10,50,100,200,500,1000,5000]]
t = benchit.timings(funcs, in_)
t.plot(logx=True, save='timings.png')
t.speedups(-1).plot(logx=True, logy=False, save='speedups.png')
Results (timings and speedups over pairwise_products) -
As can be seen with the plot trends, for really large arrays, the slicing based one will start winning, otherwise vectorized one does a good job.
Suggestions
We can also look into numexpr for performing the outer multiplications more efficienctly for large arrays.
I would probably compute M = vTv and then flatten the lower or higher triangular portion of this matrix.
def pairwise_products(v: np.ndarray):
assert len(v.shape) == 1
n = v.shape[0]
m = v.reshape(n, 1) # v.reshape(1, n)
return m[np.tril_indices_from(m)].ravel()
I would also like to mention numba, which would make your 'naive' approach most likely faster than this one.
import numba
#numba.njit
def pairwise_products_numba(vec: np.ndarray):
k, size = 0, vec.size
output = np.empty(size * (size + 1) // 2)
for i in range(size):
for j in range(i, size):
output[k] = vec[i] * vec[j]
k += 1
return output
Just testing the above pairwise_products(np.arange(5000)) takes ~0.3 sec whereas the numba version takes ~0.05 sec (ignoring the first run which is used to just-in-time compile the function).
You could also parallelize this algorithm. If it would be possible to allocate a large enough array (a smaller view on this array almost costs nothing) only once and overwrite it afterwards larger speedups can be can be achieved.
Example
#numba.njit(parallel=True)
def pairwise_products_numba_2_with_allocation(vec):
k, size = 0, vec.size
k_vec=np.empty(vec.size,dtype=np.int64)
output = np.empty(size * (size + 1) // 2)
#precalculate the indices
for i in range(size):
k_vec[i] = k
k+=(size-i)
for i in numba.prange(size):
k=k_vec[i]
for j in range(size-i):
output[k+j] = vec[i] * vec[j+i]
return output
#numba.njit(parallel=True)
def pairwise_products_numba_2_without_allocation(vec,output):
k, size = 0, vec.size
k_vec=np.empty(vec.size,dtype=np.int64)
#precalculate the indices
for i in range(size):
k_vec[i] = k
k+=(size-i)
for i in numba.prange(size):
k=k_vec[i]
for j in range(size-i):
output[k+j] = vec[i] * vec[j+i]
return output
Timings
A=np.arange(5000)
k, size = 0, A.size
output = np.empty(size * (size + 1) // 2)
%timeit res_1=pairwise_products_numba_2_without_allocation(A,output)
#7.84 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_2=pairwise_products_numba_2_with_allocation(A)
#16.9 ms ± 325 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_3=pairwise_products_numba(A) ##orlp
#43.3 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I have to solve a lot of linear systems using the Scipy pivoted QR-decomposition.
Q, R, perm = scipy.linalg.qr(PW, pivoting=True, mode='full')
During solving the system I reorder the solution using a
permutation matrix using the function below.
def pvec2pmat(vec):
n = len(vec)
P = np.zeros((n, n))
counter = 0
for i in range(0, n):
for j in range(0, n):
if j == vec[counter]:
P[i, j] = 1.0
counter = counter + 1
break
return P.T
Unfortunately, this turns out to be very slow and the code spend a lot of time generating these matrices.
Is it possible to speedup this function?
It is very hard to answer your question, as I do not know what your code is supposed to do. Furthermore, there seems to be no connection to the title. If you understand you correctly, you ask me to optimize the given function, without even knowing what the input is.
I will assume that vec is expected to be a 1-dimensional integer array. In this case your second loop is quite unnecessary. There are two caseses:
vec[counter] in range(0, n)
In this case you set P[i, vec[counter]] to one, and increase the counter (due to the break statement)
vec[counter] not in range(0, n)
The counter will never be increases, as the if statement will never be True. Thus, we ignore the rest of vec and return the matrix.
Therefor a first simplification would be:
def pvec2pmat(vec):
n = len(vec)
P = np.zeros((n, n))
counter = 0
for i in range(0, n):
if vec[counter] in range(0, n):
P[i, vec[counter]] = 1.0
counter += 1
else:
return P.T
return P.T
So the relevant part of vec is only till for the first time a value not in range(0, n) is reached. We can check this right in the beginning and discard the rest.
We can do this using
invalid = (vec < 0) & (vec > 0)
try:
first_invalid = np.flatnonzero(invalid)[0]
except IndexError: # no invalid values
pass
else:
vec[:first_invalid] # keep only till first invalid encounter
Now we know that we assign one value for all rows i <= vec.size.
So we can simplify the loop
for i, vec_val in enumerate(vec):
P[i, vec_val] = 1
This can however also be done using indixing:
P[np.arange(vec.size), vec] = 1
Finally we realize, that instead of taken the transpose, we can just assign it in the reverse order and get
def pvec2pmat(vec):
n = len(vec)
P = np.zeros((n, n))
invalid = (vec < 0) & (vec > 0)
try:
first_invalid = np.flatnonzero(invalid)[0]
except IndexError: # no invalid values
pass
else:
vec[:first_invalid] # keep only till first invalid encounter
P[vec, np.arange(vec.size)] = 1
return P
A quick timing:
vec = np.arange(1000)
np.random.shuffle(vec)
# your version
128 ms ± 2.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# my version
%timeit pvec2pmat(vec)
379 µs ± 22.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# for this simple example of a permutation, we can of course resort to
# siple indexing
%timeit np.eye(vec.size)[:, vec]
8.89 ms ± 13.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
For this particular example, my code takes 1/300 the time of your version. Depending on your need, another large speedup can be achieved, if P is chosen as an boolean matrix, and we assign True.
I am looking to memory optimise np.packbits(A==A[:, None], axis=1), where A is dense array of integers of length n. A==A[:, None] is memory hungry for large n since the resulting Boolean array is stored inefficiently with each Boolean value costing 1 byte.
I wrote the below script to achieve the same result while packing bits one section at a time. It is, however, around 3x slower, so I am looking for ways to speed it up. Or, alternatively, a better algorithm with small memory overhead.
Note: this is a follow-up question to one I asked earlier; Comparing numpy array with itself by element efficiently.
Reproducible code below for benchmarking.
import numpy as np
from numba import jit
#jit(nopython=True)
def bool2int(x):
y = 0
for i, j in enumerate(x):
if j: y += int(j)<<(7-i)
return y
#jit(nopython=True)
def compare_elementwise(arr, result, section):
n = len(arr)
for row in range(n):
for col in range(n):
section[col%8] = arr[row] == arr[col]
if ((col + 1) % 8 == 0) or (col == (n-1)):
result[row, col // 8] = bool2int(section)
section[:] = 0
return result
n = 10000
A = np.random.randint(0, 1000, n)
result_arr = np.zeros((n, n // 8 if n % 8 == 0 else n // 8 + 1)).astype(np.uint8)
selection_arr = np.zeros(8).astype(np.uint8)
# memory efficient version, but slow
packed = compare_elementwise(A, result_arr, selection_arr)
# memory inefficient version, but fast
packed2 = np.packbits(A == A[:, None], axis=1)
assert (packed == packed2).all()
%timeit compare_elementwise(A, result_arr, selection_arr) # 1.6 seconds
%timeit np.packbits(A == A[:, None], axis=1) # 0.460 second
Here is a solution 3 times faster than the numpy one (a.size must be a multiple of 8; see below) :
#nb.njit
def comp(a):
res=np.zeros((a.size,a.size//8),np.uint8)
for i,x in enumerate(a):
for j,y in enumerate(a):
if x==y: res[i,j//8] |= 128 >> j%8
return res
This works because the array is scanned one time, where you do it many times,
and amost all terms are null.
In [122]: %timeit np.packbits(A == A[:, None], axis=1)
389 ms ± 57.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [123]: %timeit comp(A)
123 ms ± 24.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
If a.size%8 > 0, the cost for find back the information will be higher. The best way in this case is to pad the initial array with some (in range(7)) zeros.
For completeness, the padding could be done as so:
if A.size % 8 != 0: A = np.pad(A, (0, 8 - A.size % 8), 'constant', constant_values=0)