Why is statistics.mean() so slow compared to the NumPy version or even to a naive implementation like e.g.:
def mean(items):
return sum(items) / len(items)
On my system, I get the following timings:
import numpy as np
import statistics
ll_int = [x for x in range(100_000)]
%timeit statistics.mean(ll_int)
# 42 ms ± 408 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit sum(ll_int) / len(ll_int)
# 460 µs ± 5.43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.mean(ll_int)
# 4.62 ms ± 38.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
ll_float = [x / 10 for x in range(100_000)]
%timeit statistics.mean(ll_float)
# 56.7 ms ± 879 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit sum(ll_float) / len(ll_float)
# 459 µs ± 7.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.mean(ll_float)
# 2.7 ms ± 11.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I get similar timings for other functions like variance or stdev.
EDIT:
Even an iterative implementation like this:
def next_mean(value, mean_, num):
return (num * mean_ + value) / (num + 1)
def imean(items, mean_=0.0):
for i, item in enumerate(items):
mean_ = next_mean(item, mean_, i)
return mean_
seems to be faster:
%timeit imean(ll_int)
# 16.6 ms ± 62.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit imean(ll_float)
# 16.2 ms ± 429 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The statistics module uses interpreted Python code, but numpy is using optimized compiled code for all of its heavy lifting, so it would be surprising if numpy didn't blow statistics out of the water.
Furthermore, statistics is designed to play nice with modules like decimal and fractions and uses code which values numerical accuracy and type safety over speed. Your naive implementation uses sum. The statistics module uses its own function called _sum internally. Looking at its source shows that it does an awful lot more than just add things together:
def _sum(data, start=0):
"""_sum(data [, start]) -> (type, sum, count)
Return a high-precision sum of the given numeric data as a fraction,
together with the type to be converted to and the count of items.
If optional argument ``start`` is given, it is added to the total.
If ``data`` is empty, ``start`` (defaulting to 0) is returned.
Examples
--------
>>> _sum([3, 2.25, 4.5, -0.5, 1.0], 0.75)
(<class 'float'>, Fraction(11, 1), 5)
Some sources of round-off error will be avoided:
# Built-in sum returns zero.
>>> _sum([1e50, 1, -1e50] * 1000)
(<class 'float'>, Fraction(1000, 1), 3000)
Fractions and Decimals are also supported:
>>> from fractions import Fraction as F
>>> _sum([F(2, 3), F(7, 5), F(1, 4), F(5, 6)])
(<class 'fractions.Fraction'>, Fraction(63, 20), 4)
>>> from decimal import Decimal as D
>>> data = [D("0.1375"), D("0.2108"), D("0.3061"), D("0.0419")]
>>> _sum(data)
(<class 'decimal.Decimal'>, Fraction(6963, 10000), 4)
Mixed types are currently treated as an error, except that int is
allowed.
"""
count = 0
n, d = _exact_ratio(start)
partials = {d: n}
partials_get = partials.get
T = _coerce(int, type(start))
for typ, values in groupby(data, type):
T = _coerce(T, typ) # or raise TypeError
for n,d in map(_exact_ratio, values):
count += 1
partials[d] = partials_get(d, 0) + n
if None in partials:
# The sum will be a NAN or INF. We can ignore all the finite
# partials, and just look at this special one.
total = partials[None]
assert not _isfinite(total)
else:
# Sum all the partial sums using builtin sum.
# FIXME is this faster if we sum them in order of the denominator?
total = sum(Fraction(n, d) for d, n in sorted(partials.items()))
return (T, total, count)
The most surprising thing about this code is that it converts the data to fractions so as to minimize round-off error. There is no reason to expect that code like this would be as quick as a simple sum(nums)/len(nums) approach.
The developer of the statistics module made an explicit decision to value correctness over speed:
Correctness over speed. It is easier to speed up a correct but slow function than to correct a fast but buggy one.
and moreover stated that there was no intention to
to replace, or even compete directly with, numpy
However, a enhancement request was raised to add an additional, faster, simpler implementation, statistics.fmean, and this function will be released in Python 3.8. According to the enhancement developer this function is up to 500 times faster than the existing statistics.mean.
The fmean implementation is pretty much sum/len.
Related
for the problem on leetcode 'Top K Frequent Elements' https://leetcode.com/problems/top-k-frequent-elements/submissions/
there is a solution that completes the task in just 88 ms, mine completes the tasks in 124 ms, I see it as a large difference.
I tried to understand why buy docs don't provide the way the function I use is implemented which is most_common(), if I want to dig a lot in details like that, such that I can write algorithms that run so fast in the future what should I read(specific books? or any other resource?).
my code (124 ms)
def topKFrequent(self, nums, k):
if k ==len(nums):
return nums
c=Counter(nums)
return [ t[0] for t in c.most_common(k) ]
other (88 ms) (better in time)
def topKFrequent(self, nums, k):
if k == len(nums):
return nums
count = Counter(nums)
return heapq.nlargest(k, count.keys(), key=count.get)
both are nearly taking same amount of memory, so no difference here.
The implementation of most_common
also uses heapq.nlargest, but it calls it with count.items() instead of count.keys(). This will make it a tiny bit slower, and also requires the overhead of creating a new list, in order to extract the [0] value from each element in the list returned by most_common().
The heapq.nlargest version just avoids this extra overhead, and passes count.keys() as second argument, and therefore it doesn't need to iterate that result again to extract pieces into a new list.
#trincot seems to have answered the question but if anyone is looking for a faster way to do this then use Numpy, provided nums can be stored as a np.array:
def topKFrequent_numpy(nums, k):
unique, counts = np.unique(nums, return_counts=True)
return unique[np.argsort(-counts)[:k]]
One speed test
nums_array = np.random.randint(1000, size=1000000)
nums_list = list(nums_array)
%timeit topKFrequent_Counter(nums_list, 500)
# 116 ms ± 4.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit topKFrequent_heapq(nums_list, 500)
# 117 ms ± 3.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit topKFrequent_numpy(nums_array, 500)
# 39.2 ms ± 185 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
(Speeds may be dramatically different for other input values)
Let A be an (N,M,M) matrix (with N very large) and I would like to compute scipy.linalg.expm(A[n,:,:]) for each n in range(N). I can of course just use a for loop but I was wondering if there was some trick to do this in a better way (something like np.einsum).
I have the same question for other operations like inverting matrices (inverting solved in comments).
Depending on the size and structure of your matrices you can do better than loop.
Assuming your matrices can be diagonalized as A = V D V^(-1) (where D has the eigenvalues in its diagonal and V contains the corresponding eigenvectors as columns), you can compute the matrix exponential as
exp(A) = V exp(D) V^(-1)
where exp(D) simply contains exp(lambda) for each eigenvalue lambda in its diagonal. This is really easy to prove if we use the power series definition of the exponential function. If the matrix A is furthermore normal, the matrix V is unitary and thus its inverse can be computed by simply taking its adjoint.
The good news is that numpy.linalg.eig and numpy.linalg.inv both work with stacked matrices just fine:
import numpy as np
import scipy.linalg
A = np.random.rand(1000,10,10)
def loopy_expm(A):
expmA = np.zeros_like(A)
for n in range(A.shape[0]):
expmA[n,...] = scipy.linalg.expm(A[n,...])
return expmA
def eigy_expm(A):
vals,vects = np.linalg.eig(A)
return np.einsum('...ik, ...k, ...kj -> ...ij',
vects,np.exp(vals),np.linalg.inv(vects))
Note that there's probably some room for optimization in specifying the order of operations in the call to einsum, but I didn't investigate that.
Testing the above for the random array:
In [59]: np.allclose(loopy_expm(A),eigy_expm(A))
Out[59]: True
In [60]: %timeit loopy_expm(A)
824 ms ± 55.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [61]: %timeit eigy_expm(A)
138 ms ± 992 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
That's already nice. If you're lucky enough that your matrices are all normal (say, because they are real symmetric):
A = np.random.rand(1000,10,10)
A = (A + A.transpose(0,2,1))/2
def eigy_expm_normal(A):
vals,vects = np.linalg.eig(A)
return np.einsum('...ik, ...k, ...jk -> ...ij',
vects,np.exp(vals),vects.conj())
Note the symmetric definition of the input matrix and the transpose inside the pattern of einsum. Results:
In [80]: np.allclose(loopy_expm(A),eigy_expm_normal(A))
Out[80]: True
In [79]: %timeit loopy_expm(A)
878 ms ± 89.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [80]: %timeit eigy_expm_normal(A)
55.8 ms ± 868 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
That is a 15-fold speedup for the above example shapes.
It should be noted though that scipy.linalg.eigm uses Padé approximation according to the documentation. This might imply that if your matrices are ill-conditioned, the eigenvalue decomposition may yield different results than scipy.linalg.eigm. I'm not familiar with how this function works, but I expect it to be safer for pathological inputs.
When reading the book 'Effective Python' by Brett Slatkin I noticed that the author suggested that sometimes building a list using a generator function and calling list on the resulting iterator could lead to cleaner, more readable code.
So an example:
num_list = range(100)
def num_squared_iterator(nums):
for i in nums:
yield i**2
def get_num_squared_list(nums):
l = []
for i in nums:
l.append(i**2)
return l
Where a user could call
l = list(num_squared_iterator(num_list))
or
l = get_num_squared_list(nums)
and get the same result.
The suggestion was that the generator function has less noise because it is shorter and does not have the extra code for creating the list and appending values to it.
(NOTE clearly for these simple examples a list comprehension or generator expression would be better, but let us take it as given that this is a simplification of a pattern that can be used for more complex code that would not be clear in a list comprehension)
My question is this, is there a cost to wrapping the generator in a list? Would it be equivalent in performance to the list building function?
Seeing this I decided to do a quick test and wrote and ran the following code:
from functools import wraps
from time import time
TEST_DATA = range(100)
def timeit(func):
#wraps(func)
def wrapped(*args, **kwargs):
start = time()
func(*args, **kwargs)
end = time()
print(f'running time for {func.__name__} = {end-start}')
return wrapped
def num_squared_iterator(nums):
for i in nums:
yield i**2
#timeit
def get_num_squared_list(nums):
l = []
for i in nums:
l.append(i**2)
return l
#timeit
def get_num_squared_list_from_iterator(nums):
return list(num_squared_iterator(nums))
if __name__ == '__main__':
get_num_squared_list(TEST_DATA)
get_num_squared_list_from_iterator(TEST_DATA)
I ran the test code many times and each times (much to my surprise) the get_num_squared_list_from_iterator function actually ran (fractionally) faster than the get_num_squared_list function.
Here are results for my first few runs:
1.
running time for get_num_squared_list = 5.2928924560546875e-05
running time for get_num_squared_list_from_iterator = 5.0067901611328125e-05
2.
running time for get_num_squared_list = 5.3882598876953125e-05
running time for get_num_squared_list_from_iterator = 4.982948303222656e-05
3.
running time for get_num_squared_list = 5.1975250244140625e-05
running time for get_num_squared_list_from_iterator = 4.76837158203125e-05
I am guessing that this is due to the expense of doing a list.append in each iteration of the loop in the get_num_squared_list function.
I find this interesting because not only is the code clear and elegant it seems more performant.
I can confirm that your generator with list example is faster:
In [4]: def num_squared_iterator(nums):
...: for i in nums:
...: yield i**2
...:
...: def get_num_squared_list(nums):
...: l = []
...: for i in nums:
...: l.append(i**2)
...: return l
...:
In [5]: %timeit list(num_squared_iterator(nums))
320 µs ± 4.57 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [6]: %timeit get_num_squared_list(nums)
370 µs ± 25.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [7]: nums = range(100000)
In [8]: %timeit list(num_squared_iterator(nums))
33.2 ms ± 461 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [9]: %timeit get_num_squared_list(nums)
36.3 ms ± 375 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
However, there is more to the story. Conventional wisdom is that generators are slower than iterating over other types of iterables, there's a lot of overhead to generators. However, using list is pushing the list-building code down into the C-level, so you sort of are seeing a middle ground. Note, using a for-loop can be optimized thusly:
In [10]: def get_num_squared_list_microoptimized(nums):
...: l = []
...: append = l.append
...: for i in nums:
...: append(i**2)
...: return l
...:
In [11]: %timeit list(num_squared_iterator(nums))
33.4 ms ± 427 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [12]: %timeit get_num_squared_list(nums)
36.5 ms ± 624 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [13]: %timeit get_num_squared_list_microoptimized(nums)
33.3 ms ± 487 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
And now you see that a lot of the difference in the approaches can be ameliorated if you "inline" l.append (which is what the list constructor avoids). In general, method resolution is slow in Python. In tight loops, the above micro-optimization is well known and is sort of the first step one would take to make your for-loops more performant.
Is there some difference between NumPy np.inf and float('Inf')?
float('Inf') == np.inf returns True, so it seems they are interchangeable, thus I was wondering why NumPy has defined its own "inf" constant, and when should I use one constant instead of the other (considering style concerns too)?
TL, DR: There is no difference and they can be used interchangeably.
Besides having the same value as math.inf and float('inf'):
>>> import math
>>> import numpy as np
>>> np.inf == float('inf')
True
>>> np.inf == math.inf
True
It also has the same type:
>>> import numpy as np
>>> type(np.inf)
float
>>> type(np.inf) is type(float('inf'))
float
That's interesting because NumPy also has it's own floating point types:
>>> np.float32(np.inf)
inf
>>> type(np.float32(np.inf))
numpy.float32
>>> np.float32('inf') == np.inf # nevertheless equal
True
So it has the same value and the same type as math.inf and float('inf') which means it's interchangeable.
Reasons for using np.inf
It's less to type:
np.inf (6 chars)
math.inf (8 chars; new in python 3.5)
float('inf') (12 chars)
That means if you already have NumPy imported you can save yourself 6 (or 2) chars per occurrence compared to float('inf') (or math.inf).
Because it's easier to remember.
At least for me, it's far easier to remember np.inf than that I need to call float with a string.
Also, NumPy defines some additional aliases for infinity:
np.Inf
np.inf
np.infty
np.Infinity
np.PINF
It also defines an alias for negative infinity:
np.NINF
Similarly for nan:
np.nan
np.NaN
np.NAN
Constants are constants
This point is based on CPython and could be completely different in another Python implementation.
A float CPython instance requires 24 Bytes:
>>> import sys
>>> sys.getsizeof(np.inf)
24
If you can re-use the same instance you might save a lot of memory compared to creating lots of new instances. Of course, this point is mute if you create your own inf constant but if you don't then:
a = [np.inf for _ in range(1000000)]
b = [float('inf') for _ in range(1000000)]
b would use 24 * 1000000 Bytes (~23 MB) more memory than a.
Accessing a constant is faster than creating the variable.
%timeit np.inf
37.9 ns ± 0.692 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit float('inf')
232 ns ± 13.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit [np.inf for _ in range(10000)]
552 µs ± 15.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit [float('inf') for _ in range(10000)]
2.59 ms ± 78.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Of course, you can create your own constant to counter that point. But why bother if NumPy already did that for you.
I have a collections.deque() of tuples from which I want to draw random samples.
In Python 2.7, I can use batch = random.sample(my_deque, batch_size).
But in Python 3.4 this raises TypeError: Population must be a sequence or set. For dicts, use list(d).
What's the best workaround, or recommended way to sample efficiently from a deque in Python 3?
The obvious way – convert to a list.
batch = random.sample(list(my_deque), batch_size))
But you can avoid creating an entire list.
idx_batch = set(sample(range(len(my_deque)), batch_size))
batch = [val for i, val in enumerate(my_deque) if i in idx_batch]
P.S. (Edited)
Actually, random.sample should work fine with deques in Python >= 3.5. because the class has been updated to match the Sequence interface.
In [3]: deq = collections.deque(range(100))
In [4]: random.sample(deq, 10)
Out[4]: [12, 64, 84, 77, 99, 69, 1, 93, 82, 35]
Note! as Geoffrey Irving has correctly stated in the comment bellow, you'd better convert the queue into a list, because queues are implemented as linked lists, making each index-access O(n) in the size of the queue, therefore sampling m random values will take O(m*n) time.
sample() on a deque works fine in Python ≥3.5, and it's pretty fast.
In Python 3.4, you could use this instead, which runs about as fast:
sample_indices = sample(range(len(deq)), 50)
[deq[index] for index in sample_indices]
On my MacBook using Python 3.6.8, this solution is over 44 times faster than Eli Korvigo's solution. :)
I used a deque with 1 million items, and I sampled 50 items:
from random import sample
from collections import deque
deq = deque(maxlen=1000000)
for i in range(1000000):
deq.append(i)
sample_indices = set(sample(range(len(deq)), 50))
%timeit [deq[i] for i in sample_indices]
1.68 ms ± 23.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit sample(deq, 50)
1.94 ms ± 60.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit sample(range(len(deq)), 50)
44.9 µs ± 549 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit [val for index, val in enumerate(deq) if index in sample_indices]
75.1 ms ± 410 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
That said, as others have pointed out, a deque is not well suited for random access. If you want to implement a replay memory, you could instead use a rotating list like this:
class ReplayMemory:
def __init__(self, max_size):
self.buffer = [None] * max_size
self.max_size = max_size
self.index = 0
self.size = 0
def append(self, obj):
self.buffer[self.index] = obj
self.size = min(self.size + 1, self.max_size)
self.index = (self.index + 1) % self.max_size
def sample(self, batch_size):
indices = sample(range(self.size), batch_size)
return [self.buffer[index] for index in indices]
With a million items, sampling 50 items is blazingly fast:
%timeit mem.sample(50)
#58 µs ± 691 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)