why these algorithms differ in their execution time?

why these algorithms differ in their execution time? - python

for the problem on leetcode 'Top K Frequent Elements' https://leetcode.com/problems/top-k-frequent-elements/submissions/
there is a solution that completes the task in just 88 ms, mine completes the tasks in 124 ms, I see it as a large difference.
I tried to understand why buy docs don't provide the way the function I use is implemented which is most_common(), if I want to dig a lot in details like that, such that I can write algorithms that run so fast in the future what should I read(specific books? or any other resource?).
my code (124 ms)
def topKFrequent(self, nums, k):
if k ==len(nums):
return nums
c=Counter(nums)
return [ t[0] for t in c.most_common(k) ]
other (88 ms) (better in time)
def topKFrequent(self, nums, k):
if k == len(nums):
return nums
count = Counter(nums)
return heapq.nlargest(k, count.keys(), key=count.get)
both are nearly taking same amount of memory, so no difference here.

The implementation of most_common
also uses heapq.nlargest, but it calls it with count.items() instead of count.keys(). This will make it a tiny bit slower, and also requires the overhead of creating a new list, in order to extract the [0] value from each element in the list returned by most_common().
The heapq.nlargest version just avoids this extra overhead, and passes count.keys() as second argument, and therefore it doesn't need to iterate that result again to extract pieces into a new list.

#trincot seems to have answered the question but if anyone is looking for a faster way to do this then use Numpy, provided nums can be stored as a np.array:
def topKFrequent_numpy(nums, k):
unique, counts = np.unique(nums, return_counts=True)
return unique[np.argsort(-counts)[:k]]
One speed test
nums_array = np.random.randint(1000, size=1000000)
nums_list = list(nums_array)
%timeit topKFrequent_Counter(nums_list, 500)
# 116 ms ± 4.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit topKFrequent_heapq(nums_list, 500)
# 117 ms ± 3.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit topKFrequent_numpy(nums_array, 500)
# 39.2 ms ± 185 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
(Speeds may be dramatically different for other input values)

Related

Optimizing while cycle with numba for error tolerance

I have a doubt when using numba for optimization. I am coding a fixed point iteration to calculate the value of a certain array, named gamma, which satisfies the equation f(gamma)=gamma. I am trying to optimize this function with python package Numba. It seems as follows.
#jit
def fixed_point(gamma_guess):
for i in range(17):
gamma_guess=f(gamma_guess)
return gamma_guess
Numba is capable of optimizing well this function, because it knows how many times it will perform the opertation, 17 times,and it works fast. But I need to control the tolerance of error of my desired gamma, I mean , the difference of a gamma and the next one obtained by the fixed point iteration should be less than some number epsilon=0.01, then I tried
#jit
def fixed_point(gamma_guess):
err=1000
gamma_old=gamma_guess.copy()
while(error>0.01):
gamma_guess=f(gamma_guess)
err=np.max(abs(gamma_guess-gamma_old))
gamma_old=gamma_guess.copy()
return gamma_guess
It also works and calculate the desired result, but not as fast as last implementation, it is much slower. I think it is because Numba cannot optimize well the while cycle since we do not know when will it stop. Is there a way I can optimizate this and run as fast as last implementation?
Edit:
Here is the f that I'm using
from scipy import fftpack as sp
S=0.01
Amu=0.7
#jit
def f(gammaa,z,zal,kappa):
ka=sp.diff(kappa)
gamma0=gammaa
for i in range(N):
suma=0
for j in range(N):
if (abs(j-i))%2 ==1:
if((z[i]-z[j])==0):
suma+=(gamma0[j]/(z[i]-z[j]))
gamma0[i]=2.0*Amu*np.real(-(zal[i]/z[i])+zal[i]*(1.0/(2*np.pi*1j))*suma*2*h)+S*ka[i]
return gamma0
I always use np.ones(2048)*0.5 as initial guess and the other parameters that I pass to my function are z=np.cos(alphas)+1j*(np.sin(alphas)+0.1) , zal=-np.sin(alphas)+1j*np.cos(alphas) , kappa=np.ones(2048) and alphas=np.arange(0,2*np.pi,2*np.pi/2048)

I made a small test script, to see if I could reproduce your error:
import numba as nb
from IPython import get_ipython
ipython = get_ipython()
#nb.jit(nopython=True)
def f(x):
return (x+1)/x
def fixed_point_for(x):
for _ in range(17):
x = f(x)
return x
#nb.jit(nopython=True)
def fixed_point_for_nb(x):
for _ in range(17):
x = f(x)
return x
def fixed_point_while(x):
error=1
x_old = x
while error>0.01:
x = f(x)
error = abs(x_old-x)
x_old = x
return x
#nb.jit(nopython=True)
def fixed_point_while_nb(x):
error=1
x_old = x
while error>0.01:
x = f(x)
error = abs(x_old-x)
x_old = x
return x
print("for loop without numba:")
ipython.magic("%timeit fixed_point_for(10)")
print("for loop with numba:")
ipython.magic("%timeit fixed_point_for_nb(10)")
print("while loop without numba:")
ipython.magic("%timeit fixed_point_while(10)")
print("for loop with numba:")
ipython.magic("%timeit fixed_point_while_nb(10)")
As I don't know about your f I just used the most simple stabilizing function, that I could think of. I then ran tests with and without numba, both times with for and while loops. The results on my machine are:
for loop without numba:
3.35 µs ± 8.72 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
for loop with numba:
282 ns ± 1.07 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
while loop without numba:
1.86 µs ± 7.09 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
for loop with numba:
214 ns ± 1.36 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
The following thoughts arise:
It can't be, that your function is not optimizable, since your for loop is fast (at least you said so; have you tested without numba?).
It could be, that your function takes way more loops to converge as you might think
We are using different software versions. My versions are:
numba 0.49.0
numpy 1.18.3
python 3.8.2

Slow `statistics` functions

Why is statistics.mean() so slow compared to the NumPy version or even to a naive implementation like e.g.:
def mean(items):
return sum(items) / len(items)
On my system, I get the following timings:
import numpy as np
import statistics
ll_int = [x for x in range(100_000)]
%timeit statistics.mean(ll_int)
# 42 ms ± 408 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit sum(ll_int) / len(ll_int)
# 460 µs ± 5.43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.mean(ll_int)
# 4.62 ms ± 38.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
ll_float = [x / 10 for x in range(100_000)]
%timeit statistics.mean(ll_float)
# 56.7 ms ± 879 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit sum(ll_float) / len(ll_float)
# 459 µs ± 7.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.mean(ll_float)
# 2.7 ms ± 11.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I get similar timings for other functions like variance or stdev.
EDIT:
Even an iterative implementation like this:
def next_mean(value, mean_, num):
return (num * mean_ + value) / (num + 1)
def imean(items, mean_=0.0):
for i, item in enumerate(items):
mean_ = next_mean(item, mean_, i)
return mean_
seems to be faster:
%timeit imean(ll_int)
# 16.6 ms ± 62.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit imean(ll_float)
# 16.2 ms ± 429 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The statistics module uses interpreted Python code, but numpy is using optimized compiled code for all of its heavy lifting, so it would be surprising if numpy didn't blow statistics out of the water.
Furthermore, statistics is designed to play nice with modules like decimal and fractions and uses code which values numerical accuracy and type safety over speed. Your naive implementation uses sum. The statistics module uses its own function called _sum internally. Looking at its source shows that it does an awful lot more than just add things together:
def _sum(data, start=0):
"""_sum(data [, start]) -> (type, sum, count)
Return a high-precision sum of the given numeric data as a fraction,
together with the type to be converted to and the count of items.
If optional argument ``start`` is given, it is added to the total.
If ``data`` is empty, ``start`` (defaulting to 0) is returned.
Examples
--------
>>> _sum([3, 2.25, 4.5, -0.5, 1.0], 0.75)
(<class 'float'>, Fraction(11, 1), 5)
Some sources of round-off error will be avoided:
# Built-in sum returns zero.
>>> _sum([1e50, 1, -1e50] * 1000)
(<class 'float'>, Fraction(1000, 1), 3000)
Fractions and Decimals are also supported:
>>> from fractions import Fraction as F
>>> _sum([F(2, 3), F(7, 5), F(1, 4), F(5, 6)])
(<class 'fractions.Fraction'>, Fraction(63, 20), 4)
>>> from decimal import Decimal as D
>>> data = [D("0.1375"), D("0.2108"), D("0.3061"), D("0.0419")]
>>> _sum(data)
(<class 'decimal.Decimal'>, Fraction(6963, 10000), 4)
Mixed types are currently treated as an error, except that int is
allowed.
"""
count = 0
n, d = _exact_ratio(start)
partials = {d: n}
partials_get = partials.get
T = _coerce(int, type(start))
for typ, values in groupby(data, type):
T = _coerce(T, typ) # or raise TypeError
for n,d in map(_exact_ratio, values):
count += 1
partials[d] = partials_get(d, 0) + n
if None in partials:
# The sum will be a NAN or INF. We can ignore all the finite
# partials, and just look at this special one.
total = partials[None]
assert not _isfinite(total)
else:
# Sum all the partial sums using builtin sum.
# FIXME is this faster if we sum them in order of the denominator?
total = sum(Fraction(n, d) for d, n in sorted(partials.items()))
return (T, total, count)
The most surprising thing about this code is that it converts the data to fractions so as to minimize round-off error. There is no reason to expect that code like this would be as quick as a simple sum(nums)/len(nums) approach.

The developer of the statistics module made an explicit decision to value correctness over speed:
Correctness over speed. It is easier to speed up a correct but slow function than to correct a fast but buggy one.
and moreover stated that there was no intention to
to replace, or even compete directly with, numpy
However, a enhancement request was raised to add an additional, faster, simpler implementation, statistics.fmean, and this function will be released in Python 3.8. According to the enhancement developer this function is up to 500 times faster than the existing statistics.mean.
The fmean implementation is pretty much sum/len.

key in s vs key not in s speed

I have this code:
s = set([5,6,7,8])
if key in s:
return True
if key not in s:
return False
It seems to me that it shouldn't, in theory, differ time wise, but I may be missing something under the hood.
Is there any reason to prefer one over the other in terms of processing time or readability?
Perhaps is this an example of:
"Premature optimization is the root of all evil"?
Short Answer: No, no difference. Yes, probably premature optimization.
OK, I ran this test:
import random
s = set([5,6,7,8])
for _ in range(5000000):
s.add(random.randint(-100000,100000000))
def test_in():
count = 0
for _ in range(50000):
if random.randint(-100000,100000000) in s:
count += 1
print(count)
def test_not_in():
count = 0
for _ in range(50000):
if random.randint(-100000,100000000) not in s:
count += 1
print(count)
When I time the outputs:
%timeit test_in()
10 loops, best of 3: 83.4 ms per loop
%timeit test_not_in()
10 loops, best of 3: 78.7 ms per loop
BUT, that small difference seems to be a symptom of counting the components. There are an average of 47500 "not ins" but only 2500 "ins". If I change both tests to pass, e.g.:
def test_in():
for _ in range(50000):
if random.randint(-100000,100000000) in s:
pass
The results are nearly identical
%timeit test_in()
10 loops, best of 3: 77.4 ms per loop
%timeit test_not_in()
10 loops, best of 3: 78.7 ms per loop
In this case, my intuition failed me. I had thought that saying it is not in the set could have had added some additional processing time. When I further consider what a hashmap does, it seems obvious that this can't be the case.

You shouldn't see a difference. The lookup time in a set is constant. You hash the entry, then look it up in a hashmap. All keys are checked in the same time, and in vs not in should be comparable.

Running a simple performance test in an ipython session with timeit confirms g.d.d.c's statement.
def one(k, s):
if k in s:
return True
def two(k, s):
if k not in s:
return False
s = set(range(1, 100))
%timeit -r7 -n 10000000 one(50, s)
## 83.7 ns ± 0.874 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit -r7 -n 10000000 two(50, s)
## 86.1 ns ± 1.11 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Optimisations such as this aren't going to gain you a lot, and as has been pointed out in the comments will in fact reduce the speed at which you'll push out bugfixes/improvements/... due to bad readability. For this type of low-level performance gains, I'd suggest looking into Cython or Numba.

Is it efficient to build a list with a generator function

When reading the book 'Effective Python' by Brett Slatkin I noticed that the author suggested that sometimes building a list using a generator function and calling list on the resulting iterator could lead to cleaner, more readable code.
So an example:
num_list = range(100)
def num_squared_iterator(nums):
for i in nums:
yield i**2
def get_num_squared_list(nums):
l = []
for i in nums:
l.append(i**2)
return l
Where a user could call
l = list(num_squared_iterator(num_list))
or
l = get_num_squared_list(nums)
and get the same result.
The suggestion was that the generator function has less noise because it is shorter and does not have the extra code for creating the list and appending values to it.
(NOTE clearly for these simple examples a list comprehension or generator expression would be better, but let us take it as given that this is a simplification of a pattern that can be used for more complex code that would not be clear in a list comprehension)
My question is this, is there a cost to wrapping the generator in a list? Would it be equivalent in performance to the list building function?

Seeing this I decided to do a quick test and wrote and ran the following code:
from functools import wraps
from time import time
TEST_DATA = range(100)
def timeit(func):
#wraps(func)
def wrapped(*args, **kwargs):
start = time()
func(*args, **kwargs)
end = time()
print(f'running time for {func.__name__} = {end-start}')
return wrapped
def num_squared_iterator(nums):
for i in nums:
yield i**2
#timeit
def get_num_squared_list(nums):
l = []
for i in nums:
l.append(i**2)
return l
#timeit
def get_num_squared_list_from_iterator(nums):
return list(num_squared_iterator(nums))
if __name__ == '__main__':
get_num_squared_list(TEST_DATA)
get_num_squared_list_from_iterator(TEST_DATA)
I ran the test code many times and each times (much to my surprise) the get_num_squared_list_from_iterator function actually ran (fractionally) faster than the get_num_squared_list function.
Here are results for my first few runs:
1.
running time for get_num_squared_list = 5.2928924560546875e-05
running time for get_num_squared_list_from_iterator = 5.0067901611328125e-05
2.
running time for get_num_squared_list = 5.3882598876953125e-05
running time for get_num_squared_list_from_iterator = 4.982948303222656e-05
3.
running time for get_num_squared_list = 5.1975250244140625e-05
running time for get_num_squared_list_from_iterator = 4.76837158203125e-05
I am guessing that this is due to the expense of doing a list.append in each iteration of the loop in the get_num_squared_list function.
I find this interesting because not only is the code clear and elegant it seems more performant.

I can confirm that your generator with list example is faster:
In [4]: def num_squared_iterator(nums):
...: for i in nums:
...: yield i**2
...:
...: def get_num_squared_list(nums):
...: l = []
...: for i in nums:
...: l.append(i**2)
...: return l
...:
In [5]: %timeit list(num_squared_iterator(nums))
320 µs ± 4.57 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [6]: %timeit get_num_squared_list(nums)
370 µs ± 25.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [7]: nums = range(100000)
In [8]: %timeit list(num_squared_iterator(nums))
33.2 ms ± 461 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [9]: %timeit get_num_squared_list(nums)
36.3 ms ± 375 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
However, there is more to the story. Conventional wisdom is that generators are slower than iterating over other types of iterables, there's a lot of overhead to generators. However, using list is pushing the list-building code down into the C-level, so you sort of are seeing a middle ground. Note, using a for-loop can be optimized thusly:
In [10]: def get_num_squared_list_microoptimized(nums):
...: l = []
...: append = l.append
...: for i in nums:
...: append(i**2)
...: return l
...:
In [11]: %timeit list(num_squared_iterator(nums))
33.4 ms ± 427 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [12]: %timeit get_num_squared_list(nums)
36.5 ms ± 624 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [13]: %timeit get_num_squared_list_microoptimized(nums)
33.3 ms ± 487 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
And now you see that a lot of the difference in the approaches can be ameliorated if you "inline" l.append (which is what the list constructor avoids). In general, method resolution is slow in Python. In tight loops, the above micro-optimization is well known and is sort of the first step one would take to make your for-loops more performant.

How to get random.sample() from deque in Python 3?

I have a collections.deque() of tuples from which I want to draw random samples.
In Python 2.7, I can use batch = random.sample(my_deque, batch_size).
But in Python 3.4 this raises TypeError: Population must be a sequence or set. For dicts, use list(d).
What's the best workaround, or recommended way to sample efficiently from a deque in Python 3?

The obvious way – convert to a list.
batch = random.sample(list(my_deque), batch_size))
But you can avoid creating an entire list.
idx_batch = set(sample(range(len(my_deque)), batch_size))
batch = [val for i, val in enumerate(my_deque) if i in idx_batch]
P.S. (Edited)
Actually, random.sample should work fine with deques in Python >= 3.5. because the class has been updated to match the Sequence interface.
In [3]: deq = collections.deque(range(100))
In [4]: random.sample(deq, 10)
Out[4]: [12, 64, 84, 77, 99, 69, 1, 93, 82, 35]
Note! as Geoffrey Irving has correctly stated in the comment bellow, you'd better convert the queue into a list, because queues are implemented as linked lists, making each index-access O(n) in the size of the queue, therefore sampling m random values will take O(m*n) time.

sample() on a deque works fine in Python ≥3.5, and it's pretty fast.
In Python 3.4, you could use this instead, which runs about as fast:
sample_indices = sample(range(len(deq)), 50)
[deq[index] for index in sample_indices]
On my MacBook using Python 3.6.8, this solution is over 44 times faster than Eli Korvigo's solution. :)
I used a deque with 1 million items, and I sampled 50 items:
from random import sample
from collections import deque
deq = deque(maxlen=1000000)
for i in range(1000000):
deq.append(i)
sample_indices = set(sample(range(len(deq)), 50))
%timeit [deq[i] for i in sample_indices]
1.68 ms ± 23.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit sample(deq, 50)
1.94 ms ± 60.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit sample(range(len(deq)), 50)
44.9 µs ± 549 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit [val for index, val in enumerate(deq) if index in sample_indices]
75.1 ms ± 410 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
That said, as others have pointed out, a deque is not well suited for random access. If you want to implement a replay memory, you could instead use a rotating list like this:
class ReplayMemory:
def __init__(self, max_size):
self.buffer = [None] * max_size
self.max_size = max_size
self.index = 0
self.size = 0
def append(self, obj):
self.buffer[self.index] = obj
self.size = min(self.size + 1, self.max_size)
self.index = (self.index + 1) % self.max_size
def sample(self, batch_size):
indices = sample(range(self.size), batch_size)
return [self.buffer[index] for index in indices]
With a million items, sampling 50 items is blazingly fast:
%timeit mem.sample(50)
#58 µs ± 691 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

why these algorithms differ in their execution time? - python

Related

Optimizing while cycle with numba for error tolerance

Slow `statistics` functions

key in s vs key not in s speed

Is it efficient to build a list with a generator function

How to get random.sample() from deque in Python 3?

Categories

Resources