I have this code:
s = set([5,6,7,8])
if key in s:
return True
if key not in s:
return False
It seems to me that it shouldn't, in theory, differ time wise, but I may be missing something under the hood.
Is there any reason to prefer one over the other in terms of processing time or readability?
Perhaps is this an example of:
"Premature optimization is the root of all evil"?
Short Answer: No, no difference. Yes, probably premature optimization.
OK, I ran this test:
import random
s = set([5,6,7,8])
for _ in range(5000000):
s.add(random.randint(-100000,100000000))
def test_in():
count = 0
for _ in range(50000):
if random.randint(-100000,100000000) in s:
count += 1
print(count)
def test_not_in():
count = 0
for _ in range(50000):
if random.randint(-100000,100000000) not in s:
count += 1
print(count)
When I time the outputs:
%timeit test_in()
10 loops, best of 3: 83.4 ms per loop
%timeit test_not_in()
10 loops, best of 3: 78.7 ms per loop
BUT, that small difference seems to be a symptom of counting the components. There are an average of 47500 "not ins" but only 2500 "ins". If I change both tests to pass, e.g.:
def test_in():
for _ in range(50000):
if random.randint(-100000,100000000) in s:
pass
The results are nearly identical
%timeit test_in()
10 loops, best of 3: 77.4 ms per loop
%timeit test_not_in()
10 loops, best of 3: 78.7 ms per loop
In this case, my intuition failed me. I had thought that saying it is not in the set could have had added some additional processing time. When I further consider what a hashmap does, it seems obvious that this can't be the case.
You shouldn't see a difference. The lookup time in a set is constant. You hash the entry, then look it up in a hashmap. All keys are checked in the same time, and in vs not in should be comparable.
Running a simple performance test in an ipython session with timeit confirms g.d.d.c's statement.
def one(k, s):
if k in s:
return True
def two(k, s):
if k not in s:
return False
s = set(range(1, 100))
%timeit -r7 -n 10000000 one(50, s)
## 83.7 ns ± 0.874 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit -r7 -n 10000000 two(50, s)
## 86.1 ns ± 1.11 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Optimisations such as this aren't going to gain you a lot, and as has been pointed out in the comments will in fact reduce the speed at which you'll push out bugfixes/improvements/... due to bad readability. For this type of low-level performance gains, I'd suggest looking into Cython or Numba.
Related
for the problem on leetcode 'Top K Frequent Elements' https://leetcode.com/problems/top-k-frequent-elements/submissions/
there is a solution that completes the task in just 88 ms, mine completes the tasks in 124 ms, I see it as a large difference.
I tried to understand why buy docs don't provide the way the function I use is implemented which is most_common(), if I want to dig a lot in details like that, such that I can write algorithms that run so fast in the future what should I read(specific books? or any other resource?).
my code (124 ms)
def topKFrequent(self, nums, k):
if k ==len(nums):
return nums
c=Counter(nums)
return [ t[0] for t in c.most_common(k) ]
other (88 ms) (better in time)
def topKFrequent(self, nums, k):
if k == len(nums):
return nums
count = Counter(nums)
return heapq.nlargest(k, count.keys(), key=count.get)
both are nearly taking same amount of memory, so no difference here.
The implementation of most_common
also uses heapq.nlargest, but it calls it with count.items() instead of count.keys(). This will make it a tiny bit slower, and also requires the overhead of creating a new list, in order to extract the [0] value from each element in the list returned by most_common().
The heapq.nlargest version just avoids this extra overhead, and passes count.keys() as second argument, and therefore it doesn't need to iterate that result again to extract pieces into a new list.
#trincot seems to have answered the question but if anyone is looking for a faster way to do this then use Numpy, provided nums can be stored as a np.array:
def topKFrequent_numpy(nums, k):
unique, counts = np.unique(nums, return_counts=True)
return unique[np.argsort(-counts)[:k]]
One speed test
nums_array = np.random.randint(1000, size=1000000)
nums_list = list(nums_array)
%timeit topKFrequent_Counter(nums_list, 500)
# 116 ms ± 4.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit topKFrequent_heapq(nums_list, 500)
# 117 ms ± 3.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit topKFrequent_numpy(nums_array, 500)
# 39.2 ms ± 185 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
(Speeds may be dramatically different for other input values)
I have a doubt when using numba for optimization. I am coding a fixed point iteration to calculate the value of a certain array, named gamma, which satisfies the equation f(gamma)=gamma. I am trying to optimize this function with python package Numba. It seems as follows.
#jit
def fixed_point(gamma_guess):
for i in range(17):
gamma_guess=f(gamma_guess)
return gamma_guess
Numba is capable of optimizing well this function, because it knows how many times it will perform the opertation, 17 times,and it works fast. But I need to control the tolerance of error of my desired gamma, I mean , the difference of a gamma and the next one obtained by the fixed point iteration should be less than some number epsilon=0.01, then I tried
#jit
def fixed_point(gamma_guess):
err=1000
gamma_old=gamma_guess.copy()
while(error>0.01):
gamma_guess=f(gamma_guess)
err=np.max(abs(gamma_guess-gamma_old))
gamma_old=gamma_guess.copy()
return gamma_guess
It also works and calculate the desired result, but not as fast as last implementation, it is much slower. I think it is because Numba cannot optimize well the while cycle since we do not know when will it stop. Is there a way I can optimizate this and run as fast as last implementation?
Edit:
Here is the f that I'm using
from scipy import fftpack as sp
S=0.01
Amu=0.7
#jit
def f(gammaa,z,zal,kappa):
ka=sp.diff(kappa)
gamma0=gammaa
for i in range(N):
suma=0
for j in range(N):
if (abs(j-i))%2 ==1:
if((z[i]-z[j])==0):
suma+=(gamma0[j]/(z[i]-z[j]))
gamma0[i]=2.0*Amu*np.real(-(zal[i]/z[i])+zal[i]*(1.0/(2*np.pi*1j))*suma*2*h)+S*ka[i]
return gamma0
I always use np.ones(2048)*0.5 as initial guess and the other parameters that I pass to my function are z=np.cos(alphas)+1j*(np.sin(alphas)+0.1) , zal=-np.sin(alphas)+1j*np.cos(alphas) , kappa=np.ones(2048) and alphas=np.arange(0,2*np.pi,2*np.pi/2048)
I made a small test script, to see if I could reproduce your error:
import numba as nb
from IPython import get_ipython
ipython = get_ipython()
#nb.jit(nopython=True)
def f(x):
return (x+1)/x
def fixed_point_for(x):
for _ in range(17):
x = f(x)
return x
#nb.jit(nopython=True)
def fixed_point_for_nb(x):
for _ in range(17):
x = f(x)
return x
def fixed_point_while(x):
error=1
x_old = x
while error>0.01:
x = f(x)
error = abs(x_old-x)
x_old = x
return x
#nb.jit(nopython=True)
def fixed_point_while_nb(x):
error=1
x_old = x
while error>0.01:
x = f(x)
error = abs(x_old-x)
x_old = x
return x
print("for loop without numba:")
ipython.magic("%timeit fixed_point_for(10)")
print("for loop with numba:")
ipython.magic("%timeit fixed_point_for_nb(10)")
print("while loop without numba:")
ipython.magic("%timeit fixed_point_while(10)")
print("for loop with numba:")
ipython.magic("%timeit fixed_point_while_nb(10)")
As I don't know about your f I just used the most simple stabilizing function, that I could think of. I then ran tests with and without numba, both times with for and while loops. The results on my machine are:
for loop without numba:
3.35 µs ± 8.72 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
for loop with numba:
282 ns ± 1.07 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
while loop without numba:
1.86 µs ± 7.09 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
for loop with numba:
214 ns ± 1.36 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
The following thoughts arise:
It can't be, that your function is not optimizable, since your for loop is fast (at least you said so; have you tested without numba?).
It could be, that your function takes way more loops to converge as you might think
We are using different software versions. My versions are:
numba 0.49.0
numpy 1.18.3
python 3.8.2
Often to save some time, I would like we to use n = len(s) in my local function.
I am curious about which call is faster or they are the same?
while i < len(s):
# do something
vs
while i < n:
# do something
There should not be too much difference, but using len(s), we need to reach s first, then call s.length. This is O(1) + O(1). But using n, it is O(1). I assume so.
it has to be faster.
Using n you're looking in the variables (dictionaries) once.
Using len(s) you're looking twice (len is also a function that we have to look for). Then you call the function.
That said if you do while i < n: most of the time you can get away with a classical for i in range(len(s)): loop since upper boundary doesn't change, and is evaluated once only at start in range (which may lead you to: Why wouldn't I iterate directly on the elements or use enumerate ?)
while i < len(s) allows to compare your index against a varying list. That's the whole point. If you fix the bound, it becomes less attractive.
In a for loop, it's easy to skip increments with continue (as easy as it is to forget to increment i and end up with an infinite while loop)
You're right, here's some benchmarks:
s = np.random.rand(100)
n = 100
Above is setup.
%%timeit
50 < len(s)
86.3 ns ± 2.4 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Versus:
%%timeit
50 < n
36.8 ns ± 1.15 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
But then again, it's hard to imagine differences on ~60ns level would have affected speed. Unless you're calling len(s) millions of times.
When reading the book 'Effective Python' by Brett Slatkin I noticed that the author suggested that sometimes building a list using a generator function and calling list on the resulting iterator could lead to cleaner, more readable code.
So an example:
num_list = range(100)
def num_squared_iterator(nums):
for i in nums:
yield i**2
def get_num_squared_list(nums):
l = []
for i in nums:
l.append(i**2)
return l
Where a user could call
l = list(num_squared_iterator(num_list))
or
l = get_num_squared_list(nums)
and get the same result.
The suggestion was that the generator function has less noise because it is shorter and does not have the extra code for creating the list and appending values to it.
(NOTE clearly for these simple examples a list comprehension or generator expression would be better, but let us take it as given that this is a simplification of a pattern that can be used for more complex code that would not be clear in a list comprehension)
My question is this, is there a cost to wrapping the generator in a list? Would it be equivalent in performance to the list building function?
Seeing this I decided to do a quick test and wrote and ran the following code:
from functools import wraps
from time import time
TEST_DATA = range(100)
def timeit(func):
#wraps(func)
def wrapped(*args, **kwargs):
start = time()
func(*args, **kwargs)
end = time()
print(f'running time for {func.__name__} = {end-start}')
return wrapped
def num_squared_iterator(nums):
for i in nums:
yield i**2
#timeit
def get_num_squared_list(nums):
l = []
for i in nums:
l.append(i**2)
return l
#timeit
def get_num_squared_list_from_iterator(nums):
return list(num_squared_iterator(nums))
if __name__ == '__main__':
get_num_squared_list(TEST_DATA)
get_num_squared_list_from_iterator(TEST_DATA)
I ran the test code many times and each times (much to my surprise) the get_num_squared_list_from_iterator function actually ran (fractionally) faster than the get_num_squared_list function.
Here are results for my first few runs:
1.
running time for get_num_squared_list = 5.2928924560546875e-05
running time for get_num_squared_list_from_iterator = 5.0067901611328125e-05
2.
running time for get_num_squared_list = 5.3882598876953125e-05
running time for get_num_squared_list_from_iterator = 4.982948303222656e-05
3.
running time for get_num_squared_list = 5.1975250244140625e-05
running time for get_num_squared_list_from_iterator = 4.76837158203125e-05
I am guessing that this is due to the expense of doing a list.append in each iteration of the loop in the get_num_squared_list function.
I find this interesting because not only is the code clear and elegant it seems more performant.
I can confirm that your generator with list example is faster:
In [4]: def num_squared_iterator(nums):
...: for i in nums:
...: yield i**2
...:
...: def get_num_squared_list(nums):
...: l = []
...: for i in nums:
...: l.append(i**2)
...: return l
...:
In [5]: %timeit list(num_squared_iterator(nums))
320 µs ± 4.57 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [6]: %timeit get_num_squared_list(nums)
370 µs ± 25.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [7]: nums = range(100000)
In [8]: %timeit list(num_squared_iterator(nums))
33.2 ms ± 461 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [9]: %timeit get_num_squared_list(nums)
36.3 ms ± 375 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
However, there is more to the story. Conventional wisdom is that generators are slower than iterating over other types of iterables, there's a lot of overhead to generators. However, using list is pushing the list-building code down into the C-level, so you sort of are seeing a middle ground. Note, using a for-loop can be optimized thusly:
In [10]: def get_num_squared_list_microoptimized(nums):
...: l = []
...: append = l.append
...: for i in nums:
...: append(i**2)
...: return l
...:
In [11]: %timeit list(num_squared_iterator(nums))
33.4 ms ± 427 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [12]: %timeit get_num_squared_list(nums)
36.5 ms ± 624 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [13]: %timeit get_num_squared_list_microoptimized(nums)
33.3 ms ± 487 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
And now you see that a lot of the difference in the approaches can be ameliorated if you "inline" l.append (which is what the list constructor avoids). In general, method resolution is slow in Python. In tight loops, the above micro-optimization is well known and is sort of the first step one would take to make your for-loops more performant.
I'm trying to optimize my code a bit. One call is pretty fast, but since it is often I got some issues.
My input data looks like this:
df = pd.DataFrame(data=np.random.randn(30),
index=pd.date_range(pd.datetime(2016,1,1), periods = 30))
df.iloc[:20] = np.nan
Now I just want to apply a simple function. Here is the part I want to optimize:
s = df >= df.shift(1)
s = s.applymap(lambda x: 1 if x else 0)
Right now I'm getting 1000 loops, best of 3: 1.36 ms per loop. I guess it should be possible to do it much faster. Not sure if I should vectorize, work only with numpy or maybe use cython. Any Idea for the best approach? I struggle a bit with the shift operator.
You can cast the result of your comparison directly from bool to int:
(df >= df.shift(1)).astype(int)
#Paul H's answer is good, performant and what I'd generally recommend.
That said, if you want to squeeze every last bit of performance, this is a decent candidate for numba which you can use to compute the answer in a single pass over the data.
from numba import njit
#njit
def do_calc(arr):
N = arr.shape[0]
ans = np.empty(N, dtype=np.int_)
ans[0] = 0
for i in range(1, N):
ans[i] = 1 if arr[i] > arr[i-1] else 0
return ans
a = (df >= df.shift(1)).astype(int)
b = pd.DataFrame(pd.Series(do_calc(df[0].values), df[0].index))
from pandas.testing import assert_frame_equal
assert_frame_equal(a, b)
Here are timings
In [45]: %timeit b = pd.DataFrame(pd.Series(do_calc(df[0].values), df[0].index))
135 µs ± 1.83 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [46]: %timeit a = (df >= df.shift(1)).astype(int)
762 µs ± 22.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Thats my current best solution:
values = df.values[1:] >= df.values[:-1]
data = np.array(values, dtype=int)
s = pd.DataFrame(data, df.index[1:])
I'm getting 10000 loops, best of 3: 125 µs per loop. x10 improvement. But I think it could be done even faster.
PS: this solution isn't exactly correct since the first zero / nan is missing.
PPS: that can be corrected by pd.DataFrame(np.append([[0]],data), df.index)