I am trying to get this code running fast in python however I am having trouble getting it to run anywhere near the speed it runs in MATLAB. The problem seems to be this for loop which takes about 2 second to run when the number "SRpixels" is approximately equal to 25000.
I cant seem to find any way to trim this down any further, and I am looking for suggestions.
The datatypes for the numpy arrays below are float32 for all except the **_Location[] which are uint32.
for j in range (0,SRpixels):
#Skip data if outside valid range
if (abs(SR_pointCloud[j,0]) > SR_xMax or SR_pointCloud[j,2] > SR_zMax or SR_pointCloud[j,2] < 0):
pass
else:
RIGrid1_Location[j,0] = np.floor(((SR_pointCloud[j,0] + xPosition + 5) - xGrid1Center) / gridSize)
RIGrid1_Location[j,1] = np.floor(((SR_pointCloud[j,2] + yPosition) - yGrid1LowerBound) / gridSize)
RIGrid1_Count[RIGrid1_Location[j,0],RIGrid1_Location[j,1]] += 1
RIGrid1_Sum[RIGrid1_Location[j,0],RIGrid1_Location[j,1]] += SR_pointCloud[j,1]
RIGrid1_SumofSquares[RIGrid1_Location[j,0],RIGrid1_Location[j,1]] += SR_pointCloud[j,1] * SR_pointCloud[j,1]
RIGrid2_Location[j,0] = np.floor(((SR_pointCloud[j,0] + xPosition + 5) - xGrid2Center) / gridSize)
RIGrid2_Location[j,1] = np.floor(((SR_pointCloud[j,2] + yPosition) - yGrid2LowerBound) / gridSize)
RIGrid2_Count[RIGrid2_Location[j,0],RIGrid2_Location[j,1]] += 1
RIGrid2_Sum[RIGrid2_Location[j,0],RIGrid2_Location[j,1]] += SR_pointCloud[j,1]
RIGrid2_SumofSquares[RIGrid2_Location[j,0],RIGrid2_Location[j,1]] += SR_pointCloud[j,1] * SR_pointCloud[j,1]
I did attempt to use Cython, where I replaced j with a cdef int j and compiled. There was no noticeable performance gain. Anyone have suggestions?
Vectorization is almost always the best way to speed up numpy code, and much of this seems vectorizable. To start, for example, the location arrays seem quite simple to do:
# these are all of your j values
inds = np.arange(0,SRpixels)
# these are the j values you don't want to skip
sel = np.invert((abs(SR_pointCloud[inds,0]) > SR_xMax) | (SR_pointCloud[inds,2] > SR_zMax) | (SR_pointCloud[inds,2] < 0))
RIGrid1_Location[sel,0] = np.floor(((SR_pointCloud[sel,0] + xPosition + 5) - xGrid1Center) / gridSize)
RIGrid1_Location[sel,1] = np.floor(((SR_pointCloud[sel,2] + yPosition) - yGrid1LowerBound) / gridSize)
RIGrid2_Location[sel,0] = np.floor(((SR_pointCloud[sel,0] + xPosition + 5) - xGrid2Center) / gridSize)
RIGrid2_Location[sel,1] = np.floor(((SR_pointCloud[sel,2] + yPosition) - yGrid2LowerBound) / gridSize)
This has no python loop.
The rest are trickier and will depend upon what you are doing, but should also be vectorizable if you think about them in this way.
If you really have something that can't be vectorized and must be done with a loop—I've only had this happen a few times—I'd suggest Weave over Cython. It's harder to use, but should give speeds comparable to C.
Try vectorization the calculation first, if you must do calculation element by element, here is some speedup hint:
Calculation with NumPy scalar is much slower than builtin scalars. array[i, j] will get a numpy scalar, and array.item(i,j) will return a builtin scalar.
functions in math module is faster than numpy when do scalar calculation.
Here is an example:
import numpy as np
import math
a = np.array([[1.1, 2.2, 3.3],[4.4, 5.5, 6.6]])
%timeit np.floor(a[0,0]*2)
%timeit math.floor(a[0,0]*2)
%timeit np.floor(a.item(0,0)*2)
%timeit math.floor(a.item(0,0)*2)
output:
100000 loops, best of 3: 10.2 µs per loop
100000 loops, best of 3: 3.49 µs per loop
100000 loops, best of 3: 6.49 µs per loop
1000000 loops, best of 3: 851 ns per loop
So change np.floor to math.floor, change SR_pointCloud[j,0] to SR_pointCloud.item(j,0) will speedup the loop alot.
Related
I am trying to efficiently compute a summation of a summation in Python:
WolframAlpha is able to compute it too a high n value: sum of sum.
I have two approaches: a for loop method and an np.sum method. I thought the np.sum approach would be faster. However, they are the same until a large n, after which the np.sum has overflow errors and gives the wrong result.
I am trying to find the fastest way to compute this sum.
import numpy as np
import time
def summation(start,end,func):
sum=0
for i in range(start,end+1):
sum+=func(i)
return sum
def x(y):
return y
def x2(y):
return y**2
def mysum(y):
return x2(y)*summation(0, y, x)
n=100
# method #1
start=time.time()
summation(0,n,mysum)
print('Slow method:',time.time()-start)
# method #2
start=time.time()
w=np.arange(0,n+1)
(w**2*np.cumsum(w)).sum()
print('Fast method:',time.time()-start)
Here's a very fast way:
result = ((((12 * n + 45) * n + 50) * n + 15) * n - 2) * n // 120
How I got there:
Rewrite the inner sum as the well-known x*(x+1)//2. So the whole thing becomes sum(x**2 * x*(x+1)//2 for x in range(n+1)).
Rewrite to sum(x**4 + x**3 for x in range(n+1)) // 2.
Look up formulas for sum(x**4) and sum(x**3).
Simplify the resulting mess to (12*n**5 + 45*n**4 + 50*n**3 + 15*n**2 - 2*n) // 120.
Horner it.
Another way to derive it if after steps 1. and 2. you know it's a polynomial of degree 5:
Compute six values with a naive implementation.
Compute the polynomial from the six equations with six unknowns (the polynomial coefficients). I did it similarly to this, but my matrix A is left-right mirrored compared to that, and I called my y-vector b.
Code:
from fractions import Fraction
import math
from functools import reduce
def naive(n):
return sum(x**2 * sum(range(x+1)) for x in range(n+1))
def lcm(ints):
return reduce(lambda r, i: r * i // math.gcd(r, i), ints)
def polynomial(xys):
xs, ys = zip(*xys)
n = len(xs)
A = [[Fraction(x**i) for i in range(n)] for x in xs]
b = list(ys)
for _ in range(2):
for i0 in range(n):
for i in range(i0 + 1, n):
f = A[i][i0] / A[i0][i0]
for j in range(i0, n):
A[i][j] -= f * A[i0][j]
b[i] -= f * b[i0]
A = [row[::-1] for row in A[::-1]]
b.reverse()
coeffs = [b[i] / A[i][i] for i in range(n)]
denominator = lcm(c.denominator for c in coeffs)
coeffs = [int(c * denominator) for c in coeffs]
horner = str(coeffs[-1])
for c in coeffs[-2::-1]:
horner += ' * n'
if c:
horner = f"({horner} {'+' if c > 0 else '-'} {abs(c)})"
return f'{horner} // {denominator}'
print(polynomial((x, naive(x)) for x in range(6)))
Output (Try it online!):
((((12 * n + 45) * n + 50) * n + 15) * n - 2) * n // 120
(fastest methods, 3 and 4, are at the end)
In a fast NumPy method you need to specify dtype=np.object so that NumPy does not convert Python int to its own dtypes (np.int64 or others). It will now give you correct results (checked it up to N=100000).
# method #2
start=time.time()
w=np.arange(0, n+1, dtype=np.object)
result2 = (w**2*np.cumsum(w)).sum()
print('Fast method:', time.time()-start)
Your fast solution is significantly faster than the slow one. Yes, for large N's, but already at N=100 it is like 8 times faster:
start=time.time()
for i in range(100):
result1 = summation(0, n, mysum)
print('Slow method:', time.time()-start)
# method #2
start=time.time()
for i in range(100):
w=np.arange(0, n+1, dtype=np.object)
result2 = (w**2*np.cumsum(w)).sum()
print('Fast method:', time.time()-start)
Slow method: 0.06906533241271973
Fast method: 0.008007287979125977
EDIT: Even faster method (by KellyBundy, the Pumpkin) is by using pure python. Turns out NumPy has no advantage here, because it has no vectorized code for np.objects.
# method #3
import itertools
start=time.time()
for i in range(100):
result3 = sum(x*x * ysum for x, ysum in enumerate(itertools.accumulate(range(n+1))))
print('Faster, pure python:', (time.time()-start))
Faster, pure python: 0.0009944438934326172
EDIT2: Forss noticed that numpy fast method can be optimized by using x*x instead of x**2. For N > 200 it is faster than pure Python method. For N < 200 it is slower than pure Python method (the exact value of boundary may depend on machine, on mine it was 200, its best to check it yourself):
# method #4
start=time.time()
for i in range(100):
w = np.arange(0, n+1, dtype=np.object)
result2 = (w*w*np.cumsum(w)).sum()
print('Fast method x*x:', time.time()-start)
Comparing Python with WolframAlpha like that is unfair, since Wolfram will simplify the equation before computing.
Fortunately, the Python ecosystem knows no limits, so you can use SymPy:
from sympy import summation
from sympy import symbols
n, x, y = symbols("n,x,y")
eq = summation(x ** 2 * summation(y, (y, 0, x)), (x, 0, n))
eq.evalf(subs={"n": 1000})
It will compute the expected result almost instantly: 100375416791650. This is because SymPy simplifies the equation for you, just like Wolfram does. See the value of eq:
#Kelly Bundy's answer is awesome, but if you are like me and use a calculator to compute 2 + 2, then you will love SymPy ❤. As you can see, it gets you to the same results with just 3 lines of code and is a solution that would also work for other more complex cases.
In a comment, you mention that it's really f(x) and g(y) instead of x2 and y. If you're only needing an approximation to that sum, you can pretend the sums are midpoint Riemann sums, so that your sum is approximated by the double integral ∫-.5n+.5 f(x) ∫-.5x+.5 g(y) dy dx.
With your original f(x)=x2 and g(y)=y, this simplifies to n5/10+3n4/8+n3/2+5n2/16+3n/32+1/160, which differs from the correct result by n3/12+3n2/16+53n/480+1/160.
Based on this, I suspect that (actual-integral)/actual would be max(f'',g'')*O(n-2), but I wasn't able to prove it.
All the answers uses math to simplify or implement the loop in python trying to be cpu optimal, but they are not memory optimal.
Here a naive implementation without using any math simplification which is memory efficient
def function5():
inner_sum = float()
result = float()
for x in range(0, n + 1):
inner_sum += x
result += x ** 2 * inner_sum
return result
It is quite slow with respect to the other solutions by dankal444:
method 2 | 31 µs ± 2.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
method 3 | 116 µs ± 538 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
method 4 | 91 µs ± 356 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
function 5 | 217 µs ± 1.14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
by the way if you jit the function with numba (there may be better options):
from numba import jit
function5 = jit(nopython=True)(function5)
you get
59.8 ns ± 0.209 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
I am looking for an "optimal" way to compute all pairwise products of a given vector's elements. If the vector is of size N, the output will be a vector of size N * (N + 1) // 2 and contain x[i] * x[j] values for all (i, j) pairs with i <= j. The naive way to compute this is as follows:
import numpy as np
def get_pairwise_products_naive(vec: np.ndarray):
k, size = 0, vec.size
output = np.empty(size * (size + 1) // 2)
for i in range(size):
for j in range(i, size):
output[k] = vec[i] * vec[j]
k += 1
return output
Desiderata:
Minimize extra memory allocations/usage: Directly write to the output buffer if possible.
Use vectorized NumPy routines instead of explicit loops.
Avoid extra (unnecessary) calculations.
I have been playing with routines such as outer, triu_indices and einsum as well as some indexing/view tricks, but haven't been able to find a solution that fits the above desiderata.
Approach #1
For a vectorized one with NumPy, you can use a masking one after getting all the pairwise multiplications with outer-multiplication, like so -
def pairwise_multiply_masking(a):
return (a[:,None]*a)[~np.tri(len(a),k=-1,dtype=bool)]
Approach #2
For really big input 1D arrays, we might want to resort to iterative slicing method that uses one-loop -
def pairwise_multiply_iterative_slicing(a):
n = len(a)
N = (n*(n+1))//2
out = np.empty(N, dtype=a.dtype)
c = np.r_[0,np.arange(n,0,-1)].cumsum()
for ii,(i,j) in enumerate(zip(c[:-1],c[1:])):
out[i:j] = a[ii:]*a[ii]
return out
Benchmarking
We will include pairwise_products and pairwise_products_numba from #orlp's solution in the setup.
Using benchit package (few benchmarking tools packaged together; disclaimer: I am its author) to benchmark proposed solutions.
import benchit
funcs = [pairwise_multiply_masking, pairwise_multiply_iterative_slicing, pairwise_products_numba, pairwise_products]
in_ = [np.random.rand(n) for n in [10,50,100,200,500,1000,5000]]
t = benchit.timings(funcs, in_)
t.plot(logx=True, save='timings.png')
t.speedups(-1).plot(logx=True, logy=False, save='speedups.png')
Results (timings and speedups over pairwise_products) -
As can be seen with the plot trends, for really large arrays, the slicing based one will start winning, otherwise vectorized one does a good job.
Suggestions
We can also look into numexpr for performing the outer multiplications more efficienctly for large arrays.
I would probably compute M = vTv and then flatten the lower or higher triangular portion of this matrix.
def pairwise_products(v: np.ndarray):
assert len(v.shape) == 1
n = v.shape[0]
m = v.reshape(n, 1) # v.reshape(1, n)
return m[np.tril_indices_from(m)].ravel()
I would also like to mention numba, which would make your 'naive' approach most likely faster than this one.
import numba
#numba.njit
def pairwise_products_numba(vec: np.ndarray):
k, size = 0, vec.size
output = np.empty(size * (size + 1) // 2)
for i in range(size):
for j in range(i, size):
output[k] = vec[i] * vec[j]
k += 1
return output
Just testing the above pairwise_products(np.arange(5000)) takes ~0.3 sec whereas the numba version takes ~0.05 sec (ignoring the first run which is used to just-in-time compile the function).
You could also parallelize this algorithm. If it would be possible to allocate a large enough array (a smaller view on this array almost costs nothing) only once and overwrite it afterwards larger speedups can be can be achieved.
Example
#numba.njit(parallel=True)
def pairwise_products_numba_2_with_allocation(vec):
k, size = 0, vec.size
k_vec=np.empty(vec.size,dtype=np.int64)
output = np.empty(size * (size + 1) // 2)
#precalculate the indices
for i in range(size):
k_vec[i] = k
k+=(size-i)
for i in numba.prange(size):
k=k_vec[i]
for j in range(size-i):
output[k+j] = vec[i] * vec[j+i]
return output
#numba.njit(parallel=True)
def pairwise_products_numba_2_without_allocation(vec,output):
k, size = 0, vec.size
k_vec=np.empty(vec.size,dtype=np.int64)
#precalculate the indices
for i in range(size):
k_vec[i] = k
k+=(size-i)
for i in numba.prange(size):
k=k_vec[i]
for j in range(size-i):
output[k+j] = vec[i] * vec[j+i]
return output
Timings
A=np.arange(5000)
k, size = 0, A.size
output = np.empty(size * (size + 1) // 2)
%timeit res_1=pairwise_products_numba_2_without_allocation(A,output)
#7.84 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_2=pairwise_products_numba_2_with_allocation(A)
#16.9 ms ± 325 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_3=pairwise_products_numba(A) ##orlp
#43.3 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I am trying to write functions which emulate math.sin and math.tan but, instead of using the math library, performing the calculation using a series expansion.
The formulae are from Mathematics SE, How would you calculate the Tangent without a calculator?:
sin(x) = x − x^3/3! + x^5/5! −...
tan(x) = sin(x) / √(1 − sin(x)^2)
This is my attempt, but I could not figure out how to perform the sign flipping + / - / + / ... part of the series expansion for sin:
from math import factorial
res = 0
for i in [1, 3, 5, 7, 9, 11, 13, 15, 17, 19]:
res += 1**i/factorial(i)
print(res) # 1.1752011936438016
The result is not correct because I have not applied a + / - switch. I could add an if / else clause but this seems messy. Is there a better way?
Note: This question is an embellished version of a now deleted question that was posted yesterday by #Lana.
You can avoid recalculating x**n and the factorial at each step by calculating the next term of the sum using the previous one:
def sin2(x, n=20):
curr = x
res = curr
for i in range(2, n, 2):
curr *= - x**2/(i*(i+1))
res += curr
return res
Compared to jpp's version, it's about twice as fast:
from math import factorial
def sin(x, n=20):
return sum(x**j/factorial(j)*(1 if i%2==0 else -1)
for i, j in enumerate(range(1, n, 2)))
%timeit sin(0.7)
# 100000 loops, best of 3: 8.52 µs per loop
%timeit sin2(0.7)
# 100000 loops, best of 3: 4.54 µs per loop
And it can get a bit faster if we calculate - x**2 once and for all:
def sin3(x, n=20):
curr = x
res = 0
minus_x_squared = - x**2
for i in range(2, n, 2):
res += curr
curr *= minus_x_squared/(i*(i+1))
return res
%timeit sin2(0.7)
# 100000 loops, best of 3: 4.6 µs per loop
%timeit sin3(0.7)
# 100000 loops, best of 3: 3.54 µs per loop
You are close. Below is one way using sum with enumerate for your series expansion.
enumerate works by taking each value of an iterable and attaching an index, i.e. 0 for the first item, 1 for the second item, etc. Then we only need to test whether the index is even or odd and use a ternary statement.
In addition, you can use range instead of listing the odd numbers required in your expansion.
from math import factorial
def sin(x, n=20):
return sum(x**j/factorial(j)*(1 if i%2==0 else -1)
for i, j in enumerate(range(1, n, 2)))
def tan(x):
return sin(x) / (1-(sin(x))**2)**0.5
print(tan(1.2)) # 2.572151622126318
You can avoid the need for a ternary statement and enumerate altogether:
def sin(x, n=20):
return sum((-1)**i * x**(2*i+1) / factorial(2*i+1) for i in range(n))
If you write out the first few terms by hand, the equivalence will become clear.
Notes:
The sign of the tan function is only correct for 1st and 4th quadrants. This is consistent with the formulae you have provided. You can perform a trivial transformation to the input to account for this.
You can improve accuracy by increasing parameter n.
You can also calculate factorial without a library, but I'll leave that as an exercise.
I'm experimenting with NumPy to see how and where it is faster than using generic list comprehensions in Python. Here's a standard coding question I'm using for this experiment.
Find the sum of all the multiples of 3 or 5 below 1000000.
I have written three functions to compute this number.
def fA(M):
sum = 0
for x in range(M):
if x % 3 == 0 or x % 5 == 0:
sum += x
return sum
def fB(M):
multiples_3 = range(0, M, 3)
multiples_5 = range(0, M, 5)
multiples_15 = range(0, M, 15)
return sum(multiples_3) + sum(multiples_5) - sum(multiples_15)
def fC(M):
arr = np.arange(M)
return np.sum(arr[np.logical_or(arr % 3 == 0, arr % 5 == 0)])
I first did a quick sanity check to see that the three functions produced the same answer.
I then used timeit to compare the runtimes for the three functions.
%timeit -n 100 fA(1000000)
100 loops, best of 3: 182 ms per loop
%timeit -n 100 fB(1000000)
100 loops, best of 3: 14.4 ms per loop
%timeit -n 100 fC(1000000)
100 loops, best of 3: 44 ms per loop
It's no surprise that fA is the slowest. But why is fB so much better than fC? Is there a better way to compute this answer using NumPy?
I don't think size is an issue here. In fact, if I change the 1e6 to 1e9, fC becomes even slower when compared to fB.
fB is so much faster than fC because fC is not the NumPy equivalent of fB. fC is the NumPy equivalent of fA. This is the NumPy equivalent of fB:
def fD(M):
multiples_3 = np.arange(0, M, 3)
multiples_5 = np.arange(0, M, 5)
multiples_15 = np.arange(0, M, 15)
return multiples_3.sum() + multiples_5.sum() - multiples_15.sum()
It runs way faster:
In [4]: timeit fB(1000000)
100 loops, best of 3: 9.96 ms per loop
In [5]: timeit fD(1000000)
1000 loops, best of 3: 637 µs per loop
In fB you are constructing the ranges with the exact multiples you want. Their sizes become smaller from 3 to 5 to 15 and thus each takes less time to construct than the one before, after they are constructed you only need to take the sum and do some arithmetic.
In fC you are constructing a 100000 element array, the size isn't really the issue as much as the two modulo comparisons you are doing which must look at every single element in the array. This takes the lion's share of the execution time (about 90 %) for fC.
You're only really using numpy there to generate an array. You'd see a much bigger difference if you were trying to perform operations on arrays as opposed to performing them on lists or tuples. With regards to that particular problem, take a look at the function fD in the code below, which just calculates how many multiples there should be in each range and then calculates their sum, rather than generating the array. Actually, if you run the below snippet, you'll see how the times change in function of M. Also, fC breaks down for M >= 100000. I couldn't tell you why.
import numpy as np
from time import time
def fA(M):
sum = 0
for x in range(M):
if x % 3 == 0 or x % 5 == 0:
sum += x
return sum
def fB(M):
multiples_3 = range(0, M, 3)
multiples_5 = range(0, M, 5)
multiples_15 = range(0, M, 15)
return sum(multiples_3) + sum(multiples_5) - sum(multiples_15)
def fC(M):
arr = np.arange(M)
return np.sum(arr[np.logical_or(arr % 3 == 0, arr % 5 == 0)])
def fD(M):
return sum_mult(M,3)+sum_mult(M,5)-sum_mult(M,15)
def sum_mult(M,n):
instances=(M-1)/n
check=len(range(n,M,n))
return (n*instances*(instances+1))/2
for x in range(5,20):
print "*"*20
M=2**x
print M
answers=[]
T=[]
for f in (fA,fB,fC,fD):
ts=time()
answers.append(f(M))
for i in range(20):
f(M)
T.append(time()-ts)
if not all([x==answers[0] for x in answers]):
print "Warning! Answers do not match!",answers
print T
If I have a list of numbers or objects in a list like l = [3,5,3,6,47,89]. We can calculate the minimum, maximum and average using following python code
minimum = min(l)
maximum = max(l)
avg = sum(l) / len(l)
Since all involve iterating the entire list, it is slow for large lists and lot of code.Is there any python module which can calculate all these values together?
Cython function:
#cython.boundscheck(False)
#cython.wraparound(False)
def minmaxAvg(list x):
cdef int i
cdef int _min, _max, total
_min = x[0]
_max = x[0]
total = 0
for i in x:
if i < _min: _min = i
elif i > _max: _max = i
total += i
return _min, _max, total/len(x)
pure python function to compare against:
def builtinfuncs(x):
a = min(x)
b = max(x)
avg = sum(x) / len(x)
return a,b,avg
In [16]: x = [random.randint(0,1000) for _ in range(10000)]
In [17]: %timeit minmaxAvg(x)
10000 loops, best of 3: 34 µs per loop
In [18]: %timeit frob(x)
1000 loops, best of 3: 460 µs per loop
Disclaimer:
- Speed result from cython will be dependent on computer hardware.
- Not as flexible and foolproof as using builtins. You would have to change the function to handle anything but integers for example.
- Before going down this path, you should ask yourself if this operation really is a big bottleneck in your application. It's probably not.
If you have pandas installed, you can do something like this:
import numpy as np
import pandas
s = pandas.Series(np.random.normal(size=37))
stats = s.describe()
stats will be a another series that behaves like a dictionary:
print(stats)
count 37.000000
mean 0.072138
std 0.932000
min -1.267888
25% -0.688728
50% -0.048624
75% 0.784244
max 2.501713
dtype: float64
stats['max']
2.501713
...etc. However, I don't recommend this unless you're striving simply for concise code. Here's why:
%%timeit
stats = s.describe()
# 100 loops, best of 3: 1.44 ms per loop
%%timeit
mymin = min(s)
mymax = max(s)
myavg = sum(s)/len(s)
# 10000 loops, best of 3: 89.5 µs per loop
I just can't imagine that you'll be able to squeeze any more performance out of the built-in functions with your own implementations (barring some cython voodoo, maybe).