I have a bottleneck in my code which I am struggling with.
Take an array A, of size (N x M), only containing 1s and 0s. I need an algorithm which takes all combinations of two rows of A and counts the overlaps between them.
More specifically, I need a faster alternative to the following algorithm:
for i in range(A.shape[0]):
for j in range(A.shape[0]):
a=b=c=d=0
for k in range(A.shape[1]):
if A[i][k]==1 and A[j][k]==1:
a+=1
if A[i][k]==0 and A[j][k]==0:
b+=1
if A[i][k]==1 and A[j][k]==0:
c+=1
if A[i][k]==0 and A[j][k]==1:
d+=1
print(a,b,c,d)
Thanks for any replies!
Since a, b, c, d are in the loop I assume you want them per each i, j. I am going to make a matrix for them with element in [i, j] be the corresponding value for a, b, c, d in your loop i, j, without ANY loops. For example a[i,j] is your value of a in the loop i,j:
A_c = 1-A
a = np.dot(A, A.T)
b = np.dot(A_c, A.T)
c = np.dot(A, A_c.T)
d = np.dot(A_c, A_c.T)
If you care even more about speed, you can factorize and shorten/reuse some of the calculations in the equations above.
While the above answer is absolutely correct, I'd like to follow up on a bit more technical sort of answer - mostly because I was doing something very similar to the problem in your question last week, and learned some cool stuff along the way.
First of all, yes, matrix multiplications and vectorization is the right way to go. However, these can get a bit expensive when the matrices become large. Let me show a small benchmark for N=100 and M=100:
N,M = 100,100
A = np.random.randint(2,size=(N,M))
def type1():
A_c = 1-A
a = np.dot(A, A.T)
b = np.dot(A_c, A.T)
c = np.dot(A, A_c.T)
d = np.dot(A_c, A_c.T)
return a,b,c,d
%timeit -n 100 type1()
>>>3.76 ms ± 48.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
One easy speedup can be done by the fact that a+b+c+d = M. We don't actually need to find d; we can thus reduce one expensive dot product here!
def type2():
A_c = 1-A
a = np.dot(A, A.T)
b = np.dot(A_c, A.T)
c = np.dot(A, A_c.T)
return a,b,c,M-(a+b+c)
%timeit -n 100 type2()
>>>2.81 ms ± 15.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
That shaved off almost a millisecond, but we can do even better. Numpy arrays come in two orders: C-Contiguous and F-Contiguous. You can check this by printing A.flags; A is a C-Contiguous array by default. However, its transpose A.T is represented as an F-Contiguous array, and when we pass them to dot, an internal copy is created for A.T since the ordering doesn't match.
One way to bypass this is by going over to scipy and hooking up our program with BLAS (https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms), particularly, the general matrix multiplication gemm routine.
from scipy.linalg import blas as B
def type3():
A_c = 1-A
a = B.dgemm(alpha=1.0, a=A, b=A, trans_b=True)
b = B.dgemm(alpha=1.0, a=A_c, b=A, trans_b=True)
c = B.dgemm(alpha=1.0, a=A, b=A_c, trans_b=True)
return a,b,c,M-(a+b+c)
%timeit -n 100 type3()
>>>449 µs ± 27 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And the time has gone down directly from milliseconds to microseconds, which is pretty awesome.
Just for the sport of it here is a method that is three times faster than the current fastest. We take advantage of matrix computations being faster on float, in particular float32. Further we are doing only one matrix multiplication, inferring the other numbers by much cheaper methods:
def pp():
A1 = np.count_nonzero(A,1)
Af = A.astype('f4')
a = Af#Af.T
b = A1-a
c = b.T
d = M-a-b-c
return a,b,c,d
[*map(np.array_equal,pp(),type3())]
# [True, True, True, True]
timeit(pp,number=1000)
# 0.14910832402529195
timeit(type3,number=1000)
# 0.4432948770117946
Related
Multiplies large matrices for a very long time. How can this problem be solved. I use the galois library, and numpy, I think it should still work stably. I tried to implement my GF4 arithmetic and multiplied matrices using numpy, but it takes even longer. Thank you for your reply.
When r = 2,3,4,5,6 multiplies quickly, then it takes a long time. As for me, these are not very large sizes of matrices. This is just a code snippet. I get the sizes n, k of matrices of a certain family given r. And I need to multiply the matrices of those obtained parameters.
import numpy as np
import galois
def family_Hamming(q,r):
n = int((q**r-1)/(q-1))
k = int((q**r-1)/(q-1)-r)
res = (n,k)
return res
q = 4
r = 7
n,k = family_Hamming(q,r)
GF = galois.GF(2**2)
#(5461,5461)
a = GF(np.random.randint(4, size=(k, k)))
#(5454,5461)
b = GF(np.random.randint(4, size=(k, n)))
c = np.dot(a,b)
print(c)
I'm not sure if it is actually faster but np.dot should be used for the dot product of two vectors, for matrix multiplication use A # B. That's as efficient as you can get with Python as far as I know
I'm the author of galois. I added performance improvements to matrix multiplication in v0.3.0 by parallelizing the arithmetic over multiple cores. The next performance improvement will come once GPU support is added.
I'm open to other performance improvement suggestions, but as far as I know the algorithm is running as fast as possible on a CPU.
In [1]: import galois
In [2]: GF = galois.GF(2**2)
In [3]: A = GF.Random((300, 400), seed=1)
In [4]: B = GF.Random((400, 500), seed=2)
# v0.2.0
In [5]: %timeit A # B
1.02 s ± 7.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# v0.3.0
In [5]: %timeit A # B
99 ms ± 1.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Try using jax on a CUDA runtime. For example, you can try it out on Google Colab's free GPU. (Open a notebook -> Runtime -> Change runtime type -> GPU).
import jax.numpy as jnp
from jax import device_put
a = GF(np.random.randint(4, size=(k, k)))
b = GF(np.random.randint(4, size=(k, n)))
a, b = device_put(a), device_put(b)
c = jnp.dot(a, b)
c = np.asarray(c)
Timing test:
%timeit jnp.dot(a, b).block_until_ready()
# 765 ms ± 96.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I have to make this program that multiplies an like this:
The first number of the first list of the first array with the first number of the first list of the second array. For example:
Input
array1 = [[1,2,3], [3,2,1]]
array2 = [[4,2,5], [5,6,7]]
So my output must be:
result = [[4,4,15],[15,12,7]]
So far my code is the following:
def multiplyArrays(array1,array2):
if verifySameSize(array1,array2):
for i in array1:
for j in i:
digitA1 = j
for x in array2:
for a in x:
digitA2 = a
mult = digitA1 * digitA2
return mult
return 'Arrays must be the same size'
It's safe to say it's not working since the result I'm getting for the example I gave is 7 , not even an array, so, what am I doing wrong?
if you want a simple solution, use numpy:
import numpy as np
array1 = np.array([[1,2,3], [3,2,1]])
array2 = np.array([[4,2,5], [5,6,7]])
result = array1 * array2
if you want a general solution for your own understanding, then it becomes a bit harder: how in-depth do you want the implementation to be? there are many checks for example the same sizes, same types, number of dimensions, etc.
the problem in your code is using for each loop instead of indexing. for i in array1 runs twice, returning a list (first [1,2,3] then [3,2,1]). then you do a for each loop in each list returning a number, meaning you only get 1 number as the output which is the result of the last operation (1 * 7 = 7). You should create an empty list and append your results in a normal for loop (not for each).
so your function becomes:
def multiplyArrays(array1,array2):
result = []
for i in range(len(array1)):
result.append([])
for j in range(len(array1[i])):
result[i].append(array1[i][j]*array2[i][j])
return result
this is a bad idea though because it only works with 2D arrays and there are no checks. Avoid writing your own functions unless you absolutely need to.
You can use zip() to iterate over the lists at the same time:
array1 = [[1,2,3], [3,2,1]]
array2 = [[4,2,5], [5,6,7]]
def multiplyArrays(array1,array2):
result = []
for inner1,inner2 in zip(array1,array2):
inner = []
for item1,item2 in zip(inner1,inner2):
inner.append(item1*item2)
result.append(inner)
return result
print(multiplyArrays(array1,array2))
Output as requested.
Here are three pure-Python one-liners that yield your expected output, two of which are simply list comprehension versions of the other two answers. List comprehension equivalents are generally more efficient, but you should choose what is most readable for you.
Method 1
#quamrana's, as a list comprehension.
res = [[a * b for a, b in zip(c, d)] for c, d in zip(arr1, arr2)]
Method 2 #OM222O's, as a list comprehension.
res = [[ arr1[i][j] * arr2[i][j] for j in range(len(arr1[0])) ] for i in range(len(arr1))]
Method 3 Similar to Method 1 but makes use of operator.mul(a, b) (returns a * b) from the operator module and the built-in map(function, iterable, ...) function. The map function "[r]eturn[s] an iterator that applies function to every item of iterable, yielding the results." So given two lists a (from array1) and b (from array2), map(operator.mul, a, b) returns an iterator that yields the results of multiplying each element in a with the element in b with the same index. list() converts the results into a list.
res = [list(map(operator.mul, a, b)) for a, b in zip(arr1, arr2)]
Simple Benchmark
Input
from random import randint
arr1 = [[randint(1, 25) for i in range(1_000)] for j in range(1_000)]
arr2 = [[randint(1, 25) for i in range(1_000)] for j in range(1_000)]
Ordered from fastest to slowest
# Method 3
29.2 ms ± 59.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Method 1
44.4 ms ± 197 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Method 2
79.3 ms ± 151 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# numpy multiplication (inclusive of time required to convert list to array)
81.7 ms ± 122 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
We can see that Method 3 (the operator.mul approach) appears fastest and the numpy approach appears the slowest. There is a big caveat, of course, as the numpy timings included the time required to convert the lists to arrays. In order to make meaningful comparisons, we need to specify whether the input and/or output is a list and/or an array. Clearly, if the inputs are already lists and the results must also be lists, then we can be happy with standard Python approaches.
However, if arr1 and arr2 are already numpy arrays, element-wise multiplication is incredibly fast:
1.47 ms ± 5.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
More simpler approach without using any module.
array1 = [[1, 2, 3], [3, 2, 1]]
array2 = [[4, 2, 5], [5, 6, 7]]
result = []
i = 0
while i < len(array1):
sub_array1 = array1[i]
sub_array2 = array2[i]
a, b, c = sub_array1
d, e, f = sub_array2
inner_list = [a * d, b * e, c * f]
result.append(inner_list)
i += 1
print(result)
Output:
[[4,4,15],[15,12,7]]
I have an array of two dimensional arrays named matrices. Each matrix in there is of dimension 1000 x 1000 and consists of positive values. Now I want to take the log of all values in all the matrices (except for 0). How do I do this easily in python? I have the following code that does what I want, but knowing Python this can be made more brief:
newMatrices = []
for matrix in matrices:
newMaxtrix = []
for row in matrix:
newRow = []
for value in row:
if value > 0:
newRow.append(np.log(value))
else:
newRow.append(value)
newMaxtrix.append(newRow)
newMatrices.append(newMaxtrix)
You can convert it into numpy array and usenumpy.log to calculate the value.
For 0 value, the results will be -Inf. After that you can convert it back to list and replace the -Inf with 0
Or you can use where in numpy
Example:
res = where(arr!= 0, log2(arr), 0)
It will ignore all zero elements.
While #Amadan 's answer is certainly correct (and much shorter/elegant), it may not be the most efficient in your case (depends a bit on the input, of course), because np.where() will generate an integer index for each matching value. A more efficient approach would be to generate a boolean mask. This has two advantages: (1) it is typically more memory efficient (2) the [] operator is typically faster on masks than on integer lists.
To illustrate this, I reimplemented both the np.where()-based and the mask-based solution on a toy input (but with the correct sizes).
I have also included a np.log.at()-based solution which is also quite inefficient.
import numpy as np
def log_matrices_where(matrices):
return [np.where(matrix > 0, np.log(matrix), 0) for matrix in matrices]
def log_matrices_mask(matrices):
arr = np.array(matrices, dtype=float)
mask = arr > 0
arr[mask] = np.log(arr[mask])
arr[~mask] = 0 # if the values are always positive this is not needed
return [x for x in arr]
def log_matrices_at(matrices):
arr = np.array(matrices, dtype=float)
np.log.at(arr, arr > 0)
arr[~(arr > 0)] = 0 # if the values are always positive this is not needed
return [x for x in arr]
N = 1000
matrices = [
np.arange((N * N)).reshape((N, N)) - N
for _ in range(2)]
(some sanity check to make sure we are doing the same thing)
# check that the result is the same
print(all(np.all(np.isclose(x, y)) for x, y in zip(log_matrices_where(matrices), log_matrices_mask(matrices))))
# True
print(all(np.all(np.isclose(x, y)) for x, y in zip(log_matrices_where(matrices), log_matrices_at(matrices))))
# True
And the timings on my machine:
%timeit log_matrices_where(matrices)
# 33.8 ms ± 1.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit log_matrices_mask(matrices)
# 11.9 ms ± 97 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit log_matrices_at(matrices)
# 153 ms ± 831 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
EDIT: additionally included np.log.at() solution and a note on zeroing out the values for which log is not defined
Another alternative using numpy:
arr = np.ndarray((1000,1000))
np.log.at(arr, np.nonzero(arr))
As simple as...
import numpy as np
newMatrices = [np.where(matrix != 0, np.log(matrix), 0) for matrix in matrices]
No need to worry about rows and columns, numpy takes care of it. No need to explicitly iterate over matrices in a for loop when a comprehension is readable enough.
EDIT: I just noticed OP had log, not log2. Not really important for the shape of the solution (though likely very important to not getting a wrong answer :P )
as suugested by #R.yan
you can try something like this.
import numpy as np
newMatrices = []
for matrix in matrices:
newMaxtrix = []
for row in matrix:
newRow = []
for value in row:
if value > 0:
newRow.append(np.log(value))
else:
newRow.append(value)
newMaxtrix.append(newRow)
newMatrices.append(newMaxtrix)
newArray = np.asarray(newMatrices)
logVal = np.log(newArray)
First sorry for my not perfect English.
My problem is simple to explain, I think.
result={}
list_tuple=[(float,float,float),(float,float,float),(float,float,float)...]#200k tuples
threshold=[float,float,float...] #max 1k values
for tuple in list_tuple:
for value in threeshold:
if max(tuple)>value and min(tuple)<value:
if value in result:
result[value].append(tuple)
else:
result[value]=[]
result[value].append(tuple)
list_tuple contains arround 200k tuples, i have to do this operation very fast(2/3 seconds max on a normal pc).
My first attemp was to do this in cython with prange() (so i could have benefits from the cython optimization and from the paralell execution), but the problem is (as always), GIL: in prange() i can manage lists and tuples using cython memviews, but i can't insert my result in a dict.
In cython i also tried using unordered_map of the c++ std, but now the problem is that i can't make a vector of array in c++ (that would the value of my dict).
The second problem is similar:
list_tuple=[((float,float),(float,float)),((float,float),(float,float))...]#200k tuples of tuples
result={list_tuple[0][0]:[]}
for tuple in list_tuple:
if tuple[0] in result:
result[tuple[0]].append(tuple)
else:
result[tuple[0]]=[]
Here i have also another problem,if a want to use prange() i have to use a custom hash function to use an array as key of a c++ unordered_map
As you can see my snippets are very simple to run in paralell.
I thought to try with numba, but probably will be the same because of GIL, and i prefer to use cython because i need a binary code (this library could be a part of a commercial software so only binary libraries are allowed).
In general i would like avoid c/c++ function, what i hope to find is a way to manage something like dicts/lists in parallel,with the cython performance, remaining as much as possible in the Python domain; but i'm open to every advice.
Thanks
Several performance improvements can be achieved, also by using numpy's vectorization features:
The min and max values are currently computed anew for each threshold. Instead they can be precomputed and then reused for each threshold.
The loop over data samples (list_tuple) is performed in pure Python. This loop can be vectorized using numpy.
In the following tests I used data.shape == (200000, 3); thresh.shape == (1000,) as indicated in the OP. I also omitted modifications to the result dict since depending on the data this can quickly overflow memory.
Applying 1.
v_min = [min(t) for t in data]
v_max = [max(t) for t in data]
for mi, ma in zip(v_min, v_max):
for value in thresh:
if ma > value and mi < value:
pass
This yields a performance increase of ~ 5 compared to the OP's code.
Applying 1. & 2.
v_min = data.min(axis=1)
v_max = data.max(axis=1)
mask = np.empty(shape=(data.shape[0],), dtype=bool)
for t in thresh:
mask[:] = (v_min < t) & (v_max > t)
samples = data[mask]
if samples.size > 0:
pass
This yields a performance increase of ~ 30 compared to the OP's code. This approach has the additional benefit that it doesn't contain incremental appends to the lists which can slow down the program since memory reallocation might be required. Instead it creates each list (per threshold) in a single attempt.
#a_guest's code:
def foo1(data, thresh):
data = np.asarray(data)
thresh = np.asarray(thresh)
condition = (
(data.min(axis=1)[:, None] < thresh)
& (data.max(axis=1)[:, None] > thresh)
)
result = {v: data[c].tolist() for c, v in zip(condition.T, thresh)}
return result
This code creates a dictionary entry once for each item in thresh.
The OP code, simplified a bit with default_dict (from collections):
def foo3(list_tuple, threeshold):
result = defaultdict(list)
for tuple in list_tuple:
for value in threeshold:
if max(tuple)>value and min(tuple)<value:
result[value].append(tuple)
return result
This one updates a dictionary entry once for each item that meets the criteria.
And with his sample data:
In [27]: foo1(data,thresh)
Out[27]: {0: [], 1: [[0, 1, 2]], 2: [], 3: [], 4: [[3, 4, 5]]}
In [28]: foo3(data.tolist(), thresh.tolist())
Out[28]: defaultdict(list, {1: [[0, 1, 2]], 4: [[3, 4, 5]]})
time tests:
In [29]: timeit foo1(data,thresh)
66.1 µs ± 197 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# In [30]: timeit foo3(data,thresh)
# 161 µs ± 242 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [31]: timeit foo3(data.tolist(),thresh.tolist())
30.8 µs ± 56.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Iteration on arrays is slower than with lists. Time for tolist() is minimal; np.asarray for lists is longer.
With a larger data sample, the array version is faster:
In [42]: data = np.random.randint(0,50,(3000,3))
...: thresh = np.arange(50)
In [43]:
In [43]: timeit foo1(data,thresh)
16 ms ± 391 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [44]: %%timeit x,y = data.tolist(), thresh.tolist()
...: foo3(x,y)
...:
83.6 ms ± 68.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Edit
Since this approach basically performs an outer product between data samples and threshold values it increases the required memory significantly which might be undesired. An improved approach can be found here. I keep this answer nevertheless for future reference since it was referred to in this answer.
I found the performance increase as compared to the OP's code to be a factor of ~ 20.
This is an example using numpy. The data is vectorized and so are the operations. Note that the resulting dict contains empty lists, as opposed to the OP's example, and hence might require an additional cleaning step, if appropriate.
import numpy as np
# Data setup
data = np.random.uniform(size=(200000, 3))
thresh = np.random.uniform(size=1000)
# Compute tuples for thresholds.
condition = (
(data.min(axis=1)[:, None] < thresh)
& (data.max(axis=1)[:, None] > thresh)
)
result = {v: data[c].tolist() for c, v in zip(condition.T, thresh)}
I would like to do a 'daxpy' (add to a vector the scalar multiple of a second vector and assign the result to the first) with numpy using numba. Doing the following test, I noticed that writing the loop myself was much faster than doing a += c * b.
I was not expecting this. What is the reason for this behavior?
import numpy as np
from numba import jit
x = np.random.random(int(1e6))
o = np.random.random(int(1e6))
c = 3.4
#jit(nopython=True)
def test1(a, b, c):
a += c * b
return a
#jit(nopython=True)
def test2(a, b, c):
for i in range(len(a)):
a[i] += c * b[i]
return a
%timeit -n100 -r10 test1(x, o, c)
>>> 100 loops, best of 10: 2.48 ms per loop
%timeit -n100 -r10 test2(x, o, c)
>>> 100 loops, best of 10: 1.2 ms per loop
One thing to keep in mind is 'manual looping' in numba is very fast, essentially the same as the c-loop used by numpy operations.
In the first example there are two operations, a temporary array (c * b) is allocated / calculated, then that temporary array is added to a. In the second example, both calculations are happening in the same loop with no intermediate result.
In theory, numba could fuse loops and optimize #1 to do the same as #2, but it doesn't seem to be doing it. If you just want to optimize numpy ops, numexpr may also be worth a look as it was designed for exactly that - though probably won't do any better than the explicit fused loop.
In [17]: import numexpr as ne
In [18]: %timeit -r10 test2(x, o, c)
1000 loops, best of 10: 1.36 ms per loop
In [19]: %timeit ne.evaluate('x + o * c', out=x)
1000 loops, best of 3: 1.43 ms per loop