I'm currently working with generators and factorials in python.
As an example:
itertools.permutations(range(100))
Meaning, I receive a generator object containing 100! values.
In reality, this code does look a bit more complicated; I'm using a list of sublists instead of range(100), with the goal to find a combination of those sublists meeting my conditions.
This is the code:
mylist = [[0, 0, 1], ..., [5, 7, 3]] # random numbers
x = True in (combination for combination in itertools.permutations(mylist)
if compare(combination))
# Compare() does return True for one or a few combination in that generator
I realized this is very time-consuming. Is there a more efficient way to do this, and, moreover, a way to compute how many time it is going to take?
I've done a few %timeit using ipython:
%timeit (combination for combination in itertools.permutations(mylist) if compare(combination))
--> 697 ns
%timeit (combination for combination in itertools.permutations(range(100)) if compare(combination))
--> 572 ns
Note: I do understand that the generator is just being created, when it's "consumed", meaning the genertor comprehension needs to be executed at first, to start the creaton of itself at all.
I've seen a lot of tutorials explaining how generators do work, but I've found nothing about the execution time.
Moreover I don't need an exact value, like timing the execution time using time-module in my program, hence I need a rough value before execution.
Edit:
I've also tested this for a smaller amount of values, for a list containing 24 sublists, 10 sublists and 5 sublists. Doing this, I receive an instant output.
This means, the program does work, it is just a matter of time.
My problem is (said more clarified): How much time is this going to take, and: Is there a less time consuming way to do it?
A comparison of generators, generator expression, lists and list comprehensions:
In [182]: range(5)
Out[182]: range(0, 5)
In [183]: list(range(5))
Out[183]: [0, 1, 2, 3, 4]
In [184]: (x for x in range(5))
Out[184]: <generator object <genexpr> at 0x7fc18cd88a98>
In [186]: [x for x in range(5)]
Out[186]: [0, 1, 2, 3, 4]
Some timings:
In [187]: timeit range(1000)
248 ns ± 2.79 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [188]: timeit (x for x in range(1000))
802 ns ± 6.97 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [189]: timeit [x for x in range(1000)]
43.4 µs ± 27.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [190]: timeit list(range(1000))
23.6 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
time for setting up a generator is (practically) independent of the parameter. Populating the list scales roughly with the size.
In [193]: timeit range(100000)
252 ns ± 1.57 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [194]: timeit list(range(100000))
4.41 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
edit
Timings show that a in test on a generator is somewhat faster than a list, but it still scales with the len:
In [264]: timeit True in (True for x in itertools.permutations(range(15),2) if x==(14,4))
17.1 µs ± 17.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [265]: timeit list (True for x in itertools.permutations(range(15),2) if x==(14,4))
18.5 µs ± 158 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [266]: timeit (14,4) in itertools.permutations(range(15),2)
8.85 µs ± 8.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [267]: timeit list(itertools.permutations(range(15),2))
11.3 µs ± 21.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Related
I'm not sure why comparing on a sliced numpy array using , is a lot slower than ][. For example:
start = time.time()
a = np.zeros((100,100))
for _ in range(1000000):
a[1:99][1:99] == 1
print(time.time() - start)
start = time.time()
a = np.zeros((100,100))
for _ in range(1000000):
a[1:99, 1:99] == 1
print(time.time() - start)
3.2756259441375732
11.044903039932251
That's over 3 times worse.
The time measurements are approximately the same using timeit.
I'm working on a recursive algorithm (I intended to do so), and those problems make my program run a lot slower, from about 1 second increased to 10 seconds. I just want to know the reason behind them. May be this is a bug. I'm using Python 3.9.9. Thanks.
The first is the same as a[2:99]==1. A (98,100) slice followed by a (97,100), and then the == test.
In [177]: timeit (a[1:99][1:99]==1)
8.51 µs ± 16.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [178]: timeit (a[1:99][1:99])
383 ns ± 5.73 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [179]: timeit (a[1:99])
208 ns ± 10.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
The bulk of the time is the test, not the slicing.
In [180]: a[1:99,1:99].shape
Out[180]: (98, 98)
In [181]: timeit a[1:99,1:99]==1
32.2 µs ± 12.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [182]: timeit a[1:99,1:99]
301 ns ± 3.61 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Again the slicing is a minor part of the timing, but the == test is significantly slower. In the first case we selected a subset of the rows, so the test is on a contiguous block of the data-buffer. In the second we select a subset of rows and columns. Iteration through the data-buffer is more complicated.
We can simplify the comparison by testing a slice of columns versus a slice of rows:
In [183]: timeit a[:,2:99]==1
32.3 µs ± 13.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [184]: timeit a[2:99,:]==1
8.58 µs ± 10.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
As a further test, make a new array with 'F' order. Now "rows" are the slow slice
In [189]: b = np.array(a, order='F')
In [190]: timeit b[:,2:99]==1
8.83 µs ± 20.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [191]: timeit b[2:99,:]==1
32.8 µs ± 31.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
===
But why are you trying to compare these two slices, one that makes a (97,100) array, and the other a (98,98). They are picking different parts of a.
I wonder if you really meant to test a sequential row, column slice, not two row slices.
In [193]: timeit (a[1:99][:,1:99]==1)
32.6 µs ± 92.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Comparing just the slicing we see that the sequential one is slower - by just a bit.
In [194]: timeit (a[1:99][:,1:99])
472 ns ± 3.76 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [195]: timeit (a[1:99,1:99])
306 ns ± 3.19 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
===
The data for a is actually stored in 1d c array. The numpy code uses strides and shape to iterate through it when doing something like a[...] == 1.
So imagine (3,6) data buffer looking like
[0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5]
sliced with [1:3], it will use
[_ _ _ _ _ _ 0 1 2 3 4 5 0 1 2 3 4 5]
slice with [:,1:4] it will use
[_ 1 2 3 _ _ _ 1 2 3 _ _ _ 1 2 3 _ _]
Regardless of the processor caching details, the iteration through the 2nd is more complex.
I have the following issue: I have a matrix yj of size (m,200) (m = 3683), and I have a dictionary that for each key, returns a numpy array of row indices for yj (for each key, the size array changes, just in case anyone is wondering).
Now, I have to access this matrix lots of times (around 1M times) and my code is slowing down because of the indexing (I've profiled the code and it takes 65% of time on this step).
Here is what I've tried out:
First of all, use the indices for slicing:
>> %timeit yj[R_u_idx_train[1]]
10.5 µs ± 79.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The variable R_u_idx_train is the dictionary that has the row indices.
I thought that maybe boolean indexing might be faster:
>> yj[R_u_idx_train_mask[1]]
10.5 µs ± 159 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
R_u_idx_train_mask is a dictionary that returns a boolean array of size m where the indices given by R_u_idx_train are set to True.
I also tried np.ix_
>> cols = np.arange(0,200)
>> %timeit ix_ = np.ix_(R_u_idx_train[1], cols); yj[ix_]
42.1 µs ± 353 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
I also tried np.take
>> %timeit np.take(yj, R_u_idx_train[1], axis=0)
2.35 ms ± 88.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And while this seems great, it is not, since it gives an array that is shape (R_u_idx_train[1].shape[0], R_u_idx_train[1].shape[0]) (it should be (R_u_idx_train[1].shape[0], 200)). I guess I'm not using the method correctly.
I also tried np.compress
>> %timeit np.compress(R_u_idx_train_mask[1], yj, axis=0)
14.1 µs ± 124 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Finally I tried to index with a boolean matrix
>> %timeit yj[R_u_idx_train_mask2[1]]
244 µs ± 786 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So, is 10.5 µs ± 79.7 ns per loop the best I can do? I could try to use cython but that seems like a lot of work for just indexing...
Thanks a lot.
A very smart solution was given by V.Ayrat in the comments.
>> newdict = {k: yj[R_u_idx_train[k]] for k in R_u_idx_train.keys()}
>> %timeit newdict[1]
202 ns ± 6.7 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Anyway maybe it would still be cool to know if there is a way to speed it up using numpy!
I timed set() and list() constructors. set() was significantly slower than list(). I benchmarked them using values where no duplicates exist. I know set use hashtables is it reason it's slower?
I'm using Python 3.7.5 [MSC v.1916 64 bit (AMD64)], Windows 10, as of this writing( 8th March).
#No significant changed observed.
timeit set(range(10))
517 ns ± 4.91 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
timeit list(range(10))
404 ns ± 4.71 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
When the size increases set() became very slower than list()
# When size is 100
timeit set(range(100))
2.13 µs ± 12.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
timeit list(range(100))
934 ns ± 10.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# when size is ten thousand.
timeit set(range(10000))
325 µs ± 2.37 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
timeit list(range(10000))
240 µs ± 2.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# When size is one million.
timeit set(range(1000000))
86.9 ms ± 1.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
timeit list(range(1000000))
37.7 ms ± 396 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Both of them take O(n) asymptotically. When there are no duplicates shouldn't set(...) approximately equal be to list(...).
To my surprise set comprehension and list comprehension didn't show those huge deviations like set() and list() showed.
# When size is 100.
timeit {i for i in range(100)}
3.96 µs ± 858 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
timeit [i for i in range(100)]
3.01 µs ± 265 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# When size is ten thousand.
timeit {i for i in range(10000)}
434 µs ± 5.11 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
timeit [i for i in range(10000)]
395 µs ± 13.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# When size is one million.
timeit {i for i in range(1000000)}
95.1 ms ± 2.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
timeit [i for i in range(1000000)]
87.3 ms ± 760 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Why should they be the same? Yes, they are both O(n) but set() needs to hash each element and needs to account for elements not being unique. This translates to a higher fixed cost per element.
Big O says nothing about absolute times, only how the time taken will grow as the size of the input grows. Two O(n) algorithms, given the same inputs, can take vastly different amounts of time to complete. All you can say is that when the size of the input doubles, the amount of time taken will (roughly) double, for both functions.
If you want to understand Big O better, I highly recommend Ned Batchelder’s introduction to the subject.
When there are no duplicates shouldn't set(...) approximately equal be to list(...).
No, they are not equal, because list() doesn't hash. That there are no duplicates doesn't figure.
To my suprise set comprehension and list comprehension didn't show those huge deviations like set() and list() showed.
The additional loop executed by the Python interpreter loop adds overhead that dominates the time taken. The higher fixed cost of set() is then less prominent.
There are other differences that may make a difference:
Given a sequence with a known length, list() can pre-allocate enough memory to fit those elements. Sets can't pre-allocate as they can't know how many duplicates there will be. Pre-allocating avoids the (amortised) cost of having to grow the list dynamically.
List and set comprehensions add one element at a time, so list objects can't preallocate, increasing the fixed per-item cost slightly.
I am curious about the fact that, when applying a function to each element of pd.Series inside for loop, the execution time looks significantly faster than O(N).
Considering a function below, which is rotating the number bit-wise, but the code itself is not important here.
def rotate(x: np.uint32) -> np.uint32:
return np.uint32(x >> 1) | np.uint32((x & 1) << 31)
When executing this code 1000 times in a for loop, it simply takes the order of 1000 times as expected.
x = np.random.randint(2 ** 32 - 1, dtype=np.uint32)
%timeit rotate(x)
# 13 µs ± 807 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
for i in range(1000):
rotate(x)
# 9.61 ms ± 255 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
However when I apply this code inside for loop over a Series of size 1000, it gets significantly faster.
s = pd.Series(np.random.randint(2 ** 32 - 1, size=1000, dtype=np.uint32))
%%timeit
for x in s:
rotate(x)
# 2.08 ms ± 113 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I am curious about the mechanism that makes this happen?
Note in your first loop you're not actually using the next value of the iterator. The following is a better comparison:
...: %%timeit
...: for i in range(1000):
...: rotate(i)
...:
1.46 ms ± 71.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
...: %%timeit
...: for x in s:
...: rotate(x)
...:
1.6 ms ± 66.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Not surprisingly, they perform more or less the same.
In your original example, by using a variable x declared outside, the interpreter needed to load in that variable using LOAD_GLOBAL 2 (x) while if you just used the value i then the interpreter could just call LOAD_FAST 0 (i), which as the name hints is faster.
Consider the following code:
import numpy as np
import pandas as pd
a = pd.DataFrame({'case': np.arange(10000) % 100,
'x': np.random.rand(10000) > 0.5})
%timeit any(a.x)
%timeit a.x.max()
%timeit a.groupby('case').x.transform(any)
%timeit a.groupby('case').x.transform(max)
13.2 µs ± 179 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
195 µs ± 811 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
25.9 ms ± 555 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.43 ms ± 13.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
b = pd.DataFrame({'x': np.random.rand(100) > 0.5})
%timeit any(b.x)
%timeit b.x.max()
13.1 µs ± 205 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
81.5 µs ± 1.81 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
We see that "any" works faster than "max" on a boolean pandas.Series of size 100 and 10000, but when we try to groupby and transform data in groups of 100, suddenly "max" is a lot faster than "any". Why?
Because any evaluation is lazy. Which means that the that the any function will stop at the first True boolean element.
The max, however, can't do so because it required to inspect every element in a sequence to be sure it haven't missed any greater element.
That's why, max always will inspect all element when any inspect only element before the first True.
The case when max works faster are probably the cases with type coercion because all values in numpy are stored in their own types and formats, mathematical operations may be faster that python's any.
As said in comment, the python any fonction have a short circuit mechanism, when np.any
have not. see here.
But True in a.x is even faster:
%timeit any(a.x)
53.6 µs ± 543 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit True in (a.x)
3.39 µs ± 31.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)