Why np.float32 may perform slower than np.float64? - python

I am writing my own implementation of the Sinkhorn-Knopp algorithm for the optimal transport problem, basing on the implementation from the github repo of the POT. The function looks as follows:
#version for the dense matrices
def sinkhorn_knopp(C, reg, a = None, b = None, max_iter = 1e3, eps = 1e-9, log = False, verbose = False, log_interval = 10):
a = np.asarray(a, dtype=np.float64)
b = np.asarray(b, dtype=np.float64)
C = np.asarray(C, dtype=np.float64)
# if the weights are not specified, assign them uniformly
if len(a.shape) == 0:
a = np.ones((C.shape[0],), dtype=np.float64) / C.shape[0]
if len(b.shape) == 0:
b = np.ones((C.shape[1],), dtype=np.float64) / C.shape[1]
# Faster exponent
K = np.divide(C, -reg)
K = np.exp(K)
# Init data
dim_a = len(a)
dim_b = len(b)
# Set ininital values of u and v
u = np.ones(dim_a) / dim_a
v = np.ones(dim_b) / dim_b
r = np.empty_like(b)
Kp = (1 / a).reshape(-1, 1) * K
err = 1
cpt = 0
if log:
log = {'err' : []}
while(err > eps and cpt < max_iter):
uprev = u
vprev = v
KtransposeU = K.T # u
v = np.divide(b, KtransposeU)
u = 1. / (Kp # v)
if (np.any(KtransposeU == 0)
or np.any(np.isnan(u)) or np.any(np.isnan(v))
or np.any(np.isinf(u)) or np.any(np.isinf(v))):
# we have reached the machine precision
# come back to previous solution and quit loop
print('Warning: numerical errors at iteration', cpt)
u = uprev
v = vprev
break
if cpt % log_interval == 0:
#residual on the iteration
r = (u # K) * v
# violation of marginal
err = np.linalg.norm(r - b)
if log:
log['err'].append(err)
cpt += 1
#return OT matrix
ot_matrix = u * K * v
loss = np.sum(C * ot_matrix)
if log:
return ot_matrix, loss, log
else:
return ot_matrix, loss
I have listed the code for the np.float64. However if one works np.float32, surprisingly, the algorithm performs slower. Naively, one should expect to work faster for a "smaller" float, as there are less bit operations. But the measurements show following numbers:
#np.float64 version
63.6 ms ± 7.94 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#np.float32 version
71.4 ms ± 2.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#np.float64 version
650 ms ± 12.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#np.float32 version
2.48 s ± 298 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
For the larger problem the difference in time is almost for times, which looks rather weird. Why it may be the case?

Related

Python More efficient permutation

I have a string that consists of x amount of the letter 'r' and y amount of 'u'. And my goal is two print all possible combinations of that same amount of x,y with different orders. This my example of a working code.
import itertools
RIGHT = 'r'
UP = 'u'
def up_and_right(n, k, lst):
case = RIGHT*n + UP*k
return [''.join(p) for p in set(itertools.permutations(case))]
Example input and output:
input:
lst1 = []
up_and_right(2,2,lst1)
lst1 => output:['ruru', 'urur', 'rruu', 'uurr', 'urru', 'ruur']
My problem is when the input is an integer more than 10 the code takes a minute to compute. How can I improve the compute time?
Thanks in advance!
Try itertools.combinations:
import itertools
RIGHT = "r"
UP = "u"
def up_and_right(n, k):
out = []
for right_idxs in itertools.combinations(range(n + k), r=n):
s = ""
for idx in range(n + k):
if idx in right_idxs:
s += RIGHT
else:
s += UP
out.append(s)
return out
print(up_and_right(2, 2))
Prints:
['rruu', 'ruru', 'ruur', 'urru', 'urur', 'uurr']
With one-liner:
def up_and_right(n, k):
return [
"".join(RIGHT if idx in right_idxs else UP for idx in range(n + k))
for right_idxs in itertools.combinations(range(n + k), r=n)
]
Your problem is about finding permutations of multisets. sympy has a utility method that deals with it.
from sympy.utilities.iterables import multiset_permutations
def up_and_right2(n,k):
case = RIGHT*n + UP*k
return list(map(''.join, multiset_permutations(case)))
Test:
y = up_and_right(n,k)
x = up_and_right2(n,k)
assert len(x) == len(y) and set(y) == set(x)
Timings:
n = 5
k = 6
%timeit up_and_right(n,k)
3.57 s ± 52.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit up_and_right2(n,k)
4.17 ms ± 159 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Splitting values in an array 'logarithmically' / based on another array

I have a 2d array, where each element is a fourier transform. I'd like to split transform 'logarithmically'. For example, let's take a single one of those arrays and call it a:
a = np.arange(0, 512)
# I want to split a into 'bins' defined by b, below:
b = np.array([0] + [10 * 2**i for i in range(6)]) # [0, 10, 20, 40, 80, 160, 320, 640]
What I'm looking to do is something like using np.split, except I would like to split values into 'bins' based on array b such that all values of a between [0, 10) are in one bin, all values between [10, 20) in another, etc.
I could do this in some sort of convoluted for loop:
split_arr = []
for i in range(1, len(b)):
fbin = []
for amp in a:
if (amp >= b[i-1]) and (amp < b[i]):
fbin.append(amp)
split_arr.append(fbin)
I have many arrays to split, and also this is ugly (just my opinion). Is there a better way?
Here is how you can do it, using np.split:
np.split(a, np.searchsorted(a,b))
If your array a is not sorted, sort it before the above command:
a = np.sort(a)
np.searchsorted finds the locations of values in b that would be inserted in the sorted array a. In other words, np.searchsorted finds the locations where you want to split your array. And if you do not want the empty array at the beginning, simply remove 0 from b.
First you can reduce the 'ugliness' by using list comprehension:
split_arr = [[amp for amp in a if (amp >= b[i-1]) and (amp < b[i])] for i in range(1, len(b))]
Then you can apply the same logic using numpy fast parallelized functionalities (which has the bonus of looking even cleaner):
split_arr = [a[(a >= b[i-1]) & (a < b[i])] for i in range(1, len(b))]
Comparison:
%timeit [[amp for amp in a if (amp >= b[i-1]) and (amp < b[i])] for i in range(1, len(b))]
1.29 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit [a[(a >= b[i-1]) & (a < b[i])] for i in range(1, len(b))]
35.9 µs ± 4.52 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Why is x*=y is slower than x=x*y in Python?

I am attempting to optimise some code I have written in Python and throughout I have assumed that it was more efficient (or in the worst case equivalent) in Python to write x *= y instead of x = x * y. But using the following basic test this does not appear to be the case
import time
t0 = time.time()
x = 1.01
for i in range(1000000):
x *= 1.000001
print(time.time() - t0)
# Outputs 0.1081240177154541
t0 = time.time()
x = 1.01
for i in range(1000000):
x = x * 1.000001
print(time.time() - t0)
# Outputs 0.07818889617919922
Why is this the case?
I have been doing the fallowing tests:
def test1(x):
for i in range(10000000):
x *= 1.000001
def test2(x):
for i in range(10000000):
x = x * 1.000001
%timeit test1(1.01)
# 511 ms ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit test2(1.01)
# 591 ms ± 87.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
the difference it is so small that you could say that performance is the same
as my results show (the opposite of your conclusion) the results can differ from one machine to other or from one run to other

Scipy pivoted QR permutation

I have to solve a lot of linear systems using the Scipy pivoted QR-decomposition.
Q, R, perm = scipy.linalg.qr(PW, pivoting=True, mode='full')
During solving the system I reorder the solution using a
permutation matrix using the function below.
def pvec2pmat(vec):
n = len(vec)
P = np.zeros((n, n))
counter = 0
for i in range(0, n):
for j in range(0, n):
if j == vec[counter]:
P[i, j] = 1.0
counter = counter + 1
break
return P.T
Unfortunately, this turns out to be very slow and the code spend a lot of time generating these matrices.
Is it possible to speedup this function?
It is very hard to answer your question, as I do not know what your code is supposed to do. Furthermore, there seems to be no connection to the title. If you understand you correctly, you ask me to optimize the given function, without even knowing what the input is.
I will assume that vec is expected to be a 1-dimensional integer array. In this case your second loop is quite unnecessary. There are two caseses:
vec[counter] in range(0, n)
In this case you set P[i, vec[counter]] to one, and increase the counter (due to the break statement)
vec[counter] not in range(0, n)
The counter will never be increases, as the if statement will never be True. Thus, we ignore the rest of vec and return the matrix.
Therefor a first simplification would be:
def pvec2pmat(vec):
n = len(vec)
P = np.zeros((n, n))
counter = 0
for i in range(0, n):
if vec[counter] in range(0, n):
P[i, vec[counter]] = 1.0
counter += 1
else:
return P.T
return P.T
So the relevant part of vec is only till for the first time a value not in range(0, n) is reached. We can check this right in the beginning and discard the rest.
We can do this using
invalid = (vec < 0) & (vec > 0)
try:
first_invalid = np.flatnonzero(invalid)[0]
except IndexError: # no invalid values
pass
else:
vec[:first_invalid] # keep only till first invalid encounter
Now we know that we assign one value for all rows i <= vec.size.
So we can simplify the loop
for i, vec_val in enumerate(vec):
P[i, vec_val] = 1
This can however also be done using indixing:
P[np.arange(vec.size), vec] = 1
Finally we realize, that instead of taken the transpose, we can just assign it in the reverse order and get
def pvec2pmat(vec):
n = len(vec)
P = np.zeros((n, n))
invalid = (vec < 0) & (vec > 0)
try:
first_invalid = np.flatnonzero(invalid)[0]
except IndexError: # no invalid values
pass
else:
vec[:first_invalid] # keep only till first invalid encounter
P[vec, np.arange(vec.size)] = 1
return P
A quick timing:
vec = np.arange(1000)
np.random.shuffle(vec)
# your version
128 ms ± 2.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# my version
%timeit pvec2pmat(vec)
379 µs ± 22.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# for this simple example of a permutation, we can of course resort to
# siple indexing
%timeit np.eye(vec.size)[:, vec]
8.89 ms ± 13.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
For this particular example, my code takes 1/300 the time of your version. Depending on your need, another large speedup can be achieved, if P is chosen as an boolean matrix, and we assign True.

Why for-loop factorial function faster than recursive one?

I have written two functions to calculate combinations. The first one uses a for loop, the other one uses a recursive factorial function. Why is the first faster than the second?
def combinations(n: int, k: int) -> int:
# Collection >= Selection
if n < k:
raise ValueError(
"The size of the collection we are selecting items from must be "
"larger than the size of the selection."
)
# Sizes > 0
if n < 0 or k < 0:
raise ValueError(
"Cannot work with negative integers."
)
# Compute with standard python only
numerator = 1
for i in range(n + 1 - k, n+1):
numerator *= i
denominator = 1
for i in range(1, k+1):
denominator *= i
return int(numerator / denominator)
The second function needs a factorial function defined as:
def factorial(n: int) -> int:
if n < 0:
raise ValueError(
"Cannot calculate factorial of a negative number."
)
# Recursive function up to n = 0
return n * factorial(n - 1) if n - 1 >= 0 else 1
And it is defined as:
def combinations2(n: int, k: int) -> int:
# Collection >= Selection
if n < k:
raise ValueError(
"The size of the collection we are selecting items from must be "
"larger than the size of the selection."
)
return int(factorial(n) / (factorial(k) * factorial(n - k)))
When I run the following test on IPython console, it is clear which one is faster
%timeit combinations(1000, 50)
16.2 µs ± 1.95 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
and
%timeit combinations2(1000, 50)
1.6 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
NEW VERSION OF COMBINATIONS2
Okay following the comments, I agree combinations2 is doing many more operations. So I rewrote both factorial and combinations function, here are their versions:
def factorial(n: int, lower: int=-1) -> int:
# n > 0
if n < 0:
raise ValueError(
"Cannot calculate factorial of a negative number."
)
# Recursive function up to n = 0 or up to lower bound
if n - 1 >= 0 and n - 1 >= lower:
return n * factorial(n - 1, lower)
return 1
which now can have a lower bound. Notice that in general factorial(a, b) = factorial(a) / factorial(b). Also, here is the new version of the combinations2 function:
def combinations2(n: int, k: int) -> int:
if n < k:
raise ValueError(
"The size of the collection we are selecting items from must be "
"larger than the size of the selection."
)
return int(factorial(n, n - k) / factorial(k))
But again, this is their comparison:
%timeit combinations(100, 50)
10.5 µs ± 1.67 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit combinations2(100, 50)
56.1 µs ± 5.79 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Just count the number of operations:
In combinations you are making (n+1) - (n+1-k) multiplications for numerator, and (k+1) - 1 multiplications for denominator.
Total: 2k multiplications
In cominations2 you are making n + k + (n-k) multiplications, i.e. 2n multiplications.
And you are making also 2n function calls for recursion.
With k=50 and n=1000, no wonder why the first solution is faster.

Categories