Here is an example :
4 digits
first, second digit's range is : 0 ~ 5 (total six number)
third, fourth digit's range is : 0 ~ 4 (total five number)
So, 0000, 0040, 0111, 4455 are ok but 5555, 4555, 4466 are not ok.
What I want to is to find what is the 2345 in ordinal? (from start zero index)`
For example, 0001 is "1" in ordinal. Likewise, 0010 is "5".
It could be calculated by,
(5*6*6*1)*2 + (6*6*1)*3 + (6*1)*4 + (1)*5 = 497
I made a function in Python
import numpy as np
def find_real_index_of_state(state, num_cnt_in_each_digit):
"""
parameter
=========
state(str)
num_cnt_in_each_digit(list) : the number of number in each digit
"""
num_of_digit = len(state)
digit_list = [int(i) for i in state]
num_cnt_in_each_digit.append(1)
real_index = 0
for i in range(num_of_digit):
real_index += np.product(num_cnt_in_each_digit[num_of_digit-i:]) * digit_list[num_of_digit-i-1]
return real_index
find_real_index_of_state("2345", [5,5,6,6])
Its result is same as 497.
Problem is though, this function is really slow. I need much more faster version, but this one is the best I can think about.
I really need your advice to improve it performance. (e.g vectorization etc)
Thanks
hope I understood you correctly.
First thing I notice is you do not need to recalculate everything each loop. I.e. you calculate (5*6*6*1),(6*6*1),(6*1),(1) individually instead you only need to calculate once.
def find_real_index_of_state(state,num_cnt_in_each_digit):
factor = 1
total = 0
for digit, num_cnt in zip(reversed(state), reversed(num_cnt_in_each_digit)):
digit = int(digit)
total += digit*factor
factor*= num_cnt
return total
Here's one vectorized approach making use of np.cumprod to perform the iterative np.product and then np.dot for the sum-reductions -
def real_index_vectorized(n, count):
num = [int(d) for d in str(n)]
# Or np.array([n]).view((str,1)).astype(int) #Thanks to #Eric
# Or (int(n)//(10**np.arange(len(n)-1,-1,-1)))%10
return np.dot( np.cumprod(count[:0:-1]), num[-2::-1]) + num[-1]
Runtime test -
1) Original sample :
In [66]: %timeit find_real_index_of_state("2345",[5,5,6,6])
100000 loops, best of 3: 14.1 µs per loop
In [67]: %timeit real_index_vectorized("2345",[5,5,6,6])
100000 loops, best of 3: 8.19 µs per loop
2) A bit bigger sample :
In [69]: %timeit find_real_index_of_state("234532321321323",[5,5,6,6,3,5,4,6,4,5,2,3,5,3,3])
10000 loops, best of 3: 52.7 µs per loop
In [70]: %timeit real_index_vectorized("234532321321323",[5,5,6,6,3,5,4,6,4,5,2,3,5,3,3])
100000 loops, best of 3: 12.5 µs per loop
Being a vectorized solution, it would scale well when it competes against a loopy version that has a good number of loop iterations.
For performance, I propose you vectorize your states first :
base=np.array([5*6*6,6*6,6,1])
states=np.array(["2345","0010"])
numbers=np.frombuffer(states,np.uint32).reshape(-1,4)-48 # faster
ordinals=(base*numbers).sum(1)
#array([497, 6], dtype=int64)
Related
I have a long list, let's call it y. len(y) = 500. I'm not including y in the code on purpose.
For each item in y, I want to find the average value of the item and its 5 proceeding values. I run into a problem when I get to the last item on the list, because I need to use 'a+1' for one of the lines below.
a = 0
SMAlist = []
for each_item in y:
if a > 4 and a < ((len(y))-1): # finding my averages begin at 6th item
b = (y[a-5:a+1]) # this line doesn't work for the last item in y
SMAsix = round((sum(b)/6),2)
SMAlist.append(SMAsix)
if a > ((len(y))-2): # this line seems unnecessary. How can I avoid it?
b = (y[-6:-1]+[y[a]]) # Should I just use negative values in general?
SMAsix = round((sum(b)/6),2)
SMAlist.append(SMAsix)
a = a+1
A little warning wrt to #Vivek Kalyanarangan's "zipper" solution.
For longer sequences this is vulnerable to loss of significance. Let's use float32 for clarity:
>>> y = (1000 + np.sin(np.arange(1000000))).astype(np.float32)
>>> window=6
>>>
# naive zipper solution
>>> s=np.insert(np.cumsum(np.array(y)), 0, [0])
>>> output = (s[window :] - s[:-window]) * (1. / window)
# towards the end the result is clearly wrong
>>> print(output[-10:])
[1024. 1024. 1024. 1024. 1024. 1024. 1024. 1024. 1024. 1024.]
>>>
# this can be alleviated by first taking the difference and then summing
>>> np.cumsum(np.r_[y[:window].sum(), y[window:]-y[:-window]])/window
array([1000.02936, 999.98285, 999.9521 , ..., 1000.0247 , 1000.05304,
1000.0367 ], dtype=float32)
>>>
# compare to last value calculated directly for reference
>>> np.mean(y[-6:])
1000.03217
To further reduce the error one could chunk y and anchor the cumsum every so-and-so many terms without losing much speed.
Option 1: Pandas
import pandas as pd
y = [10406.19,10995.72,11162.55,11256.7,11634.98,12174.25,13876.47,18491.18,16908,15266.43]
series = pd.Series(y)
print(series.rolling(window=6, center=True).mean().dropna().tolist())
Option 2: Numpy
import numpy as np
window=6
s=np.insert(np.cumsum(np.array(y)), 0, [0])
output = (s[window :] - s[:-window]) * (1. / window)
print(list(output))
Output
[11271.731666666667, 11850.111666666666, 13099.355, 14056.930000000002, 14725.218333333332]
Timings (subject to size of data)
# Pandas
59.5 µs ± 8 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# Numpy
19 µs ± 4.38 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# #PatrickArtner's solution
16.1 µs ± 2.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Update
Check timings code (works on Jupyter notebook)
%%timeit
import pandas as pd
y = [10406.19,10995.72,11162.55,11256.7,11634.98,12174.25,13876.47,18491.18,16908,15266.43]
series = pd.Series(y)
You chunkify your list and build averages over the chunks. The linked answer uses full chunks, I adapted it to build incremental ones:
Sliding avg via list comprehension:
# Inspiration for a "full" chunk I adapted: https://stackoverflow.com/a/312464/7505395
def overlappingChunks(l, n):
"""Yield overlapping n-sized chunks from l."""
for i in range(0, len(l)):
yield l[i:i + n]
somenums = [10406.19,10995.72,11162.55,11256.7,11634.98,12174.25,13876.47,
18491.18,16908,15266.43]
# avg over sublist-lengths
slideAvg5 = [ round(sum(part)/(len(part)*1.0),2) for part in overlappingChunks(somenums,6)]
print (slideAvg5)
Output:
[11271.73, 11850.11, 13099.36, 14056.93, 14725.22, 15343.27, 16135.52,
16888.54, 16087.22, 15266.43]
I was going for a partion of the list by incremental range(len(yourlist)) before averaging the partitions, but thats as full partitioning was already solved here: How do you split a list into evenly sized chunks? I adapted it to yield incremental chunks to apply it to your problem.
What partitions are used for avg-ing?
explained = {(idx,tuple(part)): round(sum(part)/(len(part)*1.0),2) for idx,part in
enumerate(overlappingChunks(somenums,6))}
import pprint
pprint.pprint(explained)
Output (reformatted):
# Input:
# [10406.19,10995.72,11162.55,11256.7,11634.98,12174.25,13876.47,18491.18,16908,15266.43]
# Index partinioned part of the input list avg
{(0, (10406.19, 10995.72, 11162.55, 11256.7, 11634.98, 12174.25)) : 11271.73,
(1, (10995.72, 11162.55, 11256.7, 11634.98, 12174.25, 13876.47)) : 11850.11,
(2, (11162.55, 11256.7, 11634.98, 12174.25, 13876.47, 18491.18)) : 13099.36,
(3, (11256.7, 11634.98, 12174.25, 13876.47, 18491.18, 16908)) : 14056.93,
(4, (11634.98, 12174.25, 13876.47, 18491.18, 16908, 15266.43)) : 14725.22,
(5, (12174.25, 13876.47, 18491.18, 16908, 15266.43)) : 15343.27,
(6, (13876.47, 18491.18, 16908, 15266.43)) : 16135.52,
(7, (18491.18, 16908, 15266.43)) : 16888.54,
(8, (16908, 15266.43)) : 16087.22,
(9, (15266.43,)) : 15266.43}
Is there a faster way to write "compute_optimal_weights" function in Python. I run it hundreds of millions of times, so any speed increase would help. The arguments of the function are different each time I run it.
c1 = 0.25
c2 = 0.67
def compute_optimal_weights(input_prices):
input_weights_optimal = {}
for i in input_prices:
price = input_prices[i]
input_weights_optimal[i] = c2 / sum([(price/n) ** c1 for n in input_prices.values()])
return input_weights_optimal
input_sellers_ID = range(10)
input_prices = {}
for i in input_sellers_ID:
input_prices[i] = random.uniform(0,1)
t0 = time.time()
for i in xrange(1000000):
compute_optimal_weights(input_prices)
t1 = time.time()
print "old time", (t1 - t0)
The number of elements in list and dictionary vary, but on average there are about 10 elements. They keys in input_prices are the same across all calls but the values change, so the same key will have different values over different runs.
Using a little bit of math, you can calculate part of your sum_price_ratio_scaled as a constant earlier in the loop and speed up your program by ~80% (for the average input size of 10).
Optimized Implementation (Python 3):
def compute_optimal_weights(ids, prices):
scaled_sum = 0
for i in ids:
scaled_sum += prices[i] ** -0.25
result = {}
for i in ids:
result[i] = 0.67 * (prices[i] ** -0.25) / scaled_sum
return result
Edit, in response to this answer: While using numpy will prove more performant with massive data sets, given that "on average there are about 10 elements" in your input_sellers_ID list, I doubt that this approach is worth its own weight for your particular application.
Although it might be tempting to leverage the terseness of generator expressions and dictionary comprehensions, I noticed when running on my machine that the best performance was obtained by using regular for-in loops and avoiding function calls like sum(...). For the sake of completeness, though, here is what the above implementation would look like in a more 'pythonic' style:
def compute_optimal_weights(ids, prices):
scaled_sum = sum(prices[i] ** -0.25 for i in ids)
return {i: 0.67 * (prices[i] ** -0.25) / scaled_sum for i in ids}
Reasoning / Math:
Based on your posted algorithm, you are trying to create a dictionary with values represented by the function f(i) below, where i is one of the elements in your input_sellers_ID list.
When you initially write out the formula for f(i), it appears as though prices[i] must be recalculated for every step of the summation process, which is costly. Simplifying the expression using the rules of exponents, however, you can see that the simplest summation needed to determine f(i) is actually independent of i (only the index value of j is ever used), meaning that that term is a constant and can be calculated outside of the loop which sets the dictionary values.
Note that above I refer to input_prices as prices and input_sellers_ID as ids.
Performance Profile (~80% speed improvement on my machine, size 10):
import time
import random
def compute_optimal_weights(ids, prices):
scaled_sum = 0
for i in ids:
scaled_sum += prices[i] ** -0.25
result = {}
for i in ids:
result[i] = 0.67 * (prices[i] ** -0.25) / scaled_sum
return result
def compute_optimal_weights_old(input_sellers_ID, input_prices):
input_weights_optimal = {}
for i in input_sellers_ID:
sum_price_ratio_scaled = 0
for j in input_sellers_ID:
price_ratio = input_prices[i] / input_prices[j]
scaled_price_ratio = price_ratio ** c1
sum_price_ratio_scaled += scaled_price_ratio
input_weights_optimal[i] = c2 / sum_price_ratio_scaled
return input_weights_optimal
c1 = 0.25
c2 = 0.67
input_sellers_ID = range(10)
input_prices = {i: random.uniform(0,1) for i in input_sellers_ID}
start = time.clock()
for _ in range(1000000):
compute_optimal_weights_old(input_sellers_ID, input_prices) and None
old_time = time.clock() - start
start = time.clock()
for _ in range(1000000):
compute_optimal_weights(input_sellers_ID, input_prices) and None
new_time = time.clock() - start
print('Old:', compute_optimal_weights_old(input_sellers_ID, input_prices))
print('New:', compute_optimal_weights(input_sellers_ID, input_prices))
print('New algorithm is {:.2%} faster.'.format(1 - new_time / old_time))
I believe we could speed-up the function by factoring the loop. Let a = price, b = n and c = c1, if my maths are not wrong (e.g. (5/6)**3 == 5**3 / 6**3:
(5./6.)**2 + (5./4.)**2
==
5**2 / 6.**2 + 5**2 / 4.**2
==
5**2 * (1/6.**2 + 1/4.**2)
With variables:
sum( (a / b) ** c for each b)
==
sum( a**c * (1/b) ** c for each b)
==
a**c * sum((1./b)**c for each b)
The second term is constant and can be taken out. Which leaves:
Faster implementation - Raw Python
Using generators and dict-comprehension:
def compute_optimal_weights(input_prices):
sconst = sum(1/w**c1 for w in input_prices.values())
return {k: c2 / (v**c1 * sconst) for k, v in input_prices.items()}
NOTE: if you are using Python2 replace .values() and .items() with .itervalues() and .iteritems() for extra speedup (few ms with large lists).
Even Faster - Numpy
Additionally, if you don't care that much about the dictionary and just want the values, you could speed it up using numpy (for large inputs >100):
def compute_optimal_weights_np(input_prices):
data = np.asarray(input_prices.values()) ** c1
return c2 / (data * np.sum(1./data))
Few timings for different input size:
N = 10 inputs:
MINE: 100000 loops, best of 3: 6.02 µs per loop
NUMPY: 100000 loops, best of 3: 10.6 µs per loop
YOURS: 10000 loops, best of 3: 23.8 µs per loop
N = 100 inputs:
MINE: 10000 loops, best of 3: 49.1 µs per loop
NUMPY: 10000 loops, best of 3: 22.6 µs per loop
YOURS: 1000 loops, best of 3: 1.86 ms per loop
N = 1000 inputs:
MINE: 1000 loops, best of 3: 458 µs per loop
NUMPY: 10000 loops, best of 3: 121 µs per loop
YOURS: 10 loops, best of 3: 173 ms per loop
N = 100000 inputs:
MINE: 10 loops, best of 3: 54.2 ms per loop
NUMPY: 100 loops, best of 3: 11.1 ms per loop
YOURS: didn't finish in a couple of minutes
Both options here are considerably faster than the one presented in the question. The benefit of using numpy if you can give consistent input (in the form of array instead of a dictionary) becomes apparent when the size grows:
what i need to achieve is to get array of all indexes, where in my data array filled with zeros and ones is step from zero to one. I need very quick solution, because i have to work with milions of arrays of hundrets milions length. It will be running in computing centre. For instance..
data_array = np.array([1,1,0,1,1,1,0,0,0,1,1,1,0,1,1,0])
result = [3,9,13]
try this:
In [23]: np.where(np.diff(a)==1)[0] + 1
Out[23]: array([ 3, 9, 13], dtype=int64)
Timing for 100M element array:
In [46]: a = np.random.choice([0,1], 10**8)
In [47]: %timeit np.nonzero((a[1:] - a[:-1]) == 1)[0] + 1
1 loop, best of 3: 1.46 s per loop
In [48]: %timeit np.where(np.diff(a)==1)[0] + 1
1 loop, best of 3: 1.64 s per loop
Here's the procedure:
Compute the diff of the array
Find the index where the diff == 1
Add 1 to the results (b/c len(diff) = len(orig) - 1)
So try this:
index = numpy.nonzero((data_array[1:] - data_array[:-1]) == 1)[0] + 1
index
# [3, 9, 13]
Well thanks a lot to all of you. Solution with nonzero is probably better for me, because I need to know steps from 0->1 and also 1->0 and finally calculate differences. So this is my solution. Any other advice appreciated .)
i_in = np.nonzero( (data_array[1:] - data_array[:-1]) == 1 )[0] +1
i_out = np.nonzero( (data_array[1:] - data_array[:-1]) == -1 )[0] +1
i_return_in_time = (i_in - i_out[:i_in.size] )
Since it's an array filled with 0s and 1s, you can benefit from just comparing rather than performing arithmetic operation between the one-shifted versions to directly give us the boolean array, which could be fed to np.flatnonzero to get us the indices and the final output.
Thus, we would have an implementation like so -
np.flatnonzero(data_array[1:] > data_array[:-1])+1
Runtime test -
In [26]: a = np.random.choice([0,1], 10**8)
In [27]: %timeit np.nonzero((a[1:] - a[:-1]) == 1)[0] + 1
1 loop, best of 3: 1.91 s per loop
In [28]: %timeit np.where(np.diff(a)==1)[0] + 1
1 loop, best of 3: 1.91 s per loop
In [29]: %timeit np.flatnonzero(a[1:] > a[:-1])+1
1 loop, best of 3: 954 ms per loop
I recently stumbled on an interessting problem, when computing the fourier transform of a signal with np.fft.fft. The reproduced problem is:
%timeit np.fft.fft(np.random.rand(59601))
1 loops, best of 3: 1.34 s per loop
I found that the amount of time is unexpectedly long. For instance lets look at some other fft's, but with a slightly longer/shorter signal:
%timeit np.fft.fft(np.random.rand(59600))
100 loops, best of 3: 6.18 ms per loop
%timeit np.fft.fft(np.random.rand(59602))
10 loops, best of 3: 61.3 ms per loop
%timeit np.fft.fft(np.random.rand(59603))
10 loops, best of 3: 113 ms per loop
%timeit np.fft.fft(np.random.rand(59604))
1 loops, best of 3: 182 ms per loop
%timeit np.fft.fft(np.random.rand(59605))
100 loops, best of 3: 6.53 ms per loop
%timeit np.fft.fft(np.random.rand(59606))
1 loops, best of 3: 2.17 s per loop
%timeit np.fft.fft(np.random.rand(59607))
100 loops, best of 3: 8.14 ms per loop
We can observe that the times are now in miliseconds, except for np.random.rand(59606), which lasts 2.17 s.
Note, the numpy documentation states:
FFT (Fast Fourier Transform) refers to a way the discrete Fourier Transform (DFT) can be calculated efficiently, by using symmetries in the calculated terms. The symmetry is highest when n is a power of 2, and the transform is therefore most efficient for these sizes.
However these vectors do not have the length of a power of 2. Could someone explain how to avoid/predict cases, when computation times are considerably higher?
As some comments have pointed, the prime factor decomposition allows you to predict the time to calculate the FFT. The following graphs show your results. Remark the logarithmic scale!
This image is generated with the following code:
import numpy as np
import matplotlib.pyplot as plt
def prime_factors(n):
"""Returns all the prime factors of a positive integer"""
#from http://stackoverflow.com/questions/23287/largest-prime-factor-of-a-number/412942#412942
factors = []
d = 2
while n > 1:
while n % d == 0:
factors.append(d)
n /= d
d = d + 1
return factors
times = []
decomp = []
for i in range(59600, 59613):
print(i)
t= %timeit -o np.fft.fft(np.random.rand(i))
times.append(t.best)
decomp.append(max(prime_factors(i)))
plt.loglog(decomp, times, 'o')
plt.ylabel("best time")
plt.xlabel("largest prime in prime factor decomposition")
plt.title("FFT timings")
I have a matrix m where I would like to calculate the number of zeros.
m=((2,0,2,2),(4,4,5,4),(0,9,4,8),(2,2,0,0))
My current code is as follows:
def zeroCount(M):
return [item for row in M for item in row].count(0)
# list of lists is flattened to form single list, and number of 0 are counted
Is there any way to do this quicker? Currently, I'm taking 0.4s to execute the function 20,000 times on 4 by 4 matrices, where the matrices are equally likely to contain zeros as they are not.
Some possible places to start (but which I could not make to work faster than my code) are these other questions: counting non-zero elements in numpy array, finding the indices of non-zero elements, and counting non-zero elements in iterable.
The fastest so far:
def count_zeros(matrix):
total = 0
for row in matrix:
total += row.count(0)
return total
For 2D tuple you could use a generator expression:
def count_zeros_gen(matrix):
return sum(row.count(0) for row in matrix)
Time comparison:
%timeit [item for row in m for item in row].count(0) # OP
1000000 loops, best of 3: 1.15 µs per loop
%timeit len([item for row in m for item in row if item == 0]) # #thefourtheye
1000000 loops, best of 3: 913 ns per loop
%timeit sum(row.count(0) for row in m)
1000000 loops, best of 3: 1 µs per loop
%timeit count_zeros(m)
1000000 loops, best of 3: 775 ns per loop
For the baseline:
def f(m): pass
%timeit f(m)
10000000 loops, best of 3: 110 ns per loop
Here is my answer.
reduce(lambda a, b: a + b, m).count(0)
Time:
%timeit count_zeros(m) ##J.F. Sebastian
1000000 loops, best of 3: 813 ns per loop
%timeit len([item for row in m for item in row if item == 0]) ##thefourtheye
1000000 loops, best of 3: 974 ns per loop
%timeit reduce(lambda a, b: a + b, m).count(0) #Mine
1000000 loops, best of 3: 1.02 us per loop
%timeit countzeros(m) ##frostnational
1000000 loops, best of 3: 1.07 us per loop
%timeit sum(row.count(0) for row in m) ##J.F. Sebastian
1000000 loops, best of 3: 1.28 us per loop
%timeit [item for row in m for item in row].count(0) #OP
1000000 loops, best of 3: 1.53 us per loop
#thefourtheye's is the fastest. This is because of few function call.
#J.F. Sebastian's is the fastest in my environment. I don't know why...
The problem with your solution is that, you have to iterate the list again to get the count O(N). But the len function can get the count in O(1).
You can make this a lot quicker with this
def zeroCount(M):
return len([item for row in M for item in row if item == 0])
Check this out:
from itertools import chain, filterfalse # ifilterfalse for Python 2
def zeroCount(m):
total = 0
for x in filterfalse(bool, chain(*m)):
total += 1
return total
Perfomance tests on Python 3.3.3:
from timeit import timeit
from itertools import chain, filterfalse
import functools
m = ((2,0,2,2),(4,4,5,4),(0,9,4,8),(2,2,0,0))
def zeroCountOP():
return [item for row in m for item in row].count(0)
def zeroCountTFE():
return len([item for row in m for item in row if item == 0])
def zeroCountJFS():
return sum(row.count(0) for row in m)
def zeroCountuser2931409():
# `reduce` is in `functools` in Py3k
return functools.reduce(lambda a, b: a + b, m).count(0)
def zeroCount():
total = 0
for x in filterfalse(bool, chain(*m)):
total += 1
return total
print('Original code ', timeit(zeroCountOP, number=100000))
print('#J.F.Sebastian ', timeit(zeroCountJFS, number=100000))
print('#thefourtheye ', timeit(zeroCountTFE, number=100000))
print('#user2931409 ', timeit(zeroCountuser2931409, number=100000))
print('#frostnational ', timeit(zeroCount, number=100000))
The above give me these results:
Original code 0.244224319984056
#thefourtheye 0.22169152169497108
#user2931409 0.19247795242092186
#frostnational 0.18846473728790825
#J.F.Sebastian 0.1439318853410907
#J.F.Sebastian's solution is a winner, mine is a runner-up (about 20% slower).
Comprehensive solution for both Python 2 and Python 3:
import sys
import itertools
if sys.version_info < (3, 0, 0):
filterfalse = getattr(itertools, 'ifilterfalse')
else:
filterfalse = getattr(itertools, 'filterfalse')
def countzeros(matrix):
''' Make a good use of `itertools.filterfalse`
(`itertools.ifilterfalse` in case of Python 2) to count
all 0s in `matrix`. '''
counter = 0
for _ in filterfalse(bool, itertools.chain(*matrix)):
counter += 1
return counter
if __name__ == '__main__':
# Benchmark
from timeit import repeat
print(repeat('countzeros(((2,0,2,2),(4,4,5,4),(0,9,4,8),(2,2,0,0)))',
'from __main__ import countzeros',
repeat=10,
number=100000))
Use numpy:
import numpy
m=((2,0,2,2),(4,4,5,4),(0,9,4,8),(2,2,0,0))
numpy_m = numpy.array(m)
print numpy.sum(numpy_m == 0)
How does the above work? First, your "matrix" is converted to a numpy array (numpy.array(m)). Then, each entry is checked for equality with zero (numpy_m == 0). This yields a binary array. Summing over this binary array gives the number of zero elements in the original array.
Note that numpy will be clearly efficient for larger matrices. 4x4 might be too small to see a large performance difference vs. ordinary python code, esp. if you are initializing a python "matrix" like above.
One numpy solution is:
import numpy as np
m = ((2,0,2,2),(4,4,5,4),(0,9,4,8),(2,2,0,0))
mm = np.array(m)
def zeroCountSmci():
return (mm==0).sum() # sums across all axes, by default