z3 much slower than ortools SAT. Why?

z3 much slower than ortools SAT. Why? - python

I am trying out different solvers for a toy clique problem and was surprised to find that ortools using SAT seems much faster than z3. I am wondering if I am doing something wrong given that z3 does so well in the published benchmarks. Here is an MWE:
First create a random graph with 150 vertices:
# First make a random graph and find the largest clique size in it
import igraph as ig
import random
from tqdm import tqdm
random.seed(7)
num_variables = 150 # bigger gives larger running time gap between z3 and ortools grow
print("Making graph")
g = ig.Graph.Erdos_Renyi(num_variables, 0.6)
# Make a set of edges. Maybe this isn't necessary
print("Making set of edges")
edges = set()
for edge in tqdm(g.es):
edges.add((edge.source, edge.target))
Now use z3 to find the max clique size:
import z3
z3.set_option(verbose=1)
myVars = []
for i in range(num_variables):
myVars += [z3.Int('v%d' % i)]
opt = z3.Optimize()
for i in range(num_variables):
opt.add(z3.Or(myVars[i]==0, myVars[i] == 1))
for i in tqdm(range(num_variables)):
for j in range(i+1, num_variables):
if not (i, j) in edges:
opt.add(myVars[i] + myVars[j] <= 1)
t = time()
h = opt.maximize(sum(myVars))
opt.check()
print(round(time()-t,2))
This takes around 70 seconds on my PC.
Now do the same thing using SAT from ortools.
from ortools.linear_solver import pywraplp
solver = pywraplp.Solver.CreateSolver('SAT')
solver.EnableOutput()
myVars = []
for i in range(num_variables):
myVars += [solver.IntVar(0.0, 1.0, 'v%d' % i)]
for i in tqdm(range(num_variables)):
for j in range(i+1, num_variables):
if not (i, j) in edges:
solver.Add(myVars[i] + myVars[j] <= 1)
print("Solving")
solver.Maximize(sum(myVars))
t = time()
status = solver.Solve()
print(round(time()-t,2))
This takes about 4 seconds on my PC.
If you increase num_variables the gap grows even larger.
Am I doing something wrong or is this just a really bad case for the z3 optimizer?
Update
It turns out that ortools using SAT is multi-core by default using 8 cores so the timings are unfair. Using #AxelKemper's improvement I now get (on a different machine):
Z3 time 88 seconds. If we assume perfect parallelisation that is 11 seconds on 8 cores.
ortools 4.3 seconds
So Z3 is only 2.5 times slower than ortools.
Update 2
Using #alias's improved code that uses s.add(z3.PbGe([(x, 1) for x in myVars], k)) it now takes 30.4 seconds which when divided by 8 is faster than ortools!

A custom tool like ortools is usually hard to beat, as it understands only a fixed number of domains, and they take advantage of parallel hardware. An SMT solver shines not at speed, but rather at what it allows: Combination of many many theories (arithmetic, data-structures, booleans, reals, integers, floats, etc.), so it's not a fair comparison.
Having said that, we can make z3 go faster by using three ideas:
Use Bool instead of Int as Axel suggested. z3 is much better dealing with booleans as is, instead of coding them via integers. See this earlier answer for general advice on coding in z3.
z3's regular solver is much more adept at handling many constraints compared to the optimizer. While the optimizer definitely has its applications, avoid it if you can. If your problem can use an iterative approach (i.e., get a solution and keep improving by adding new constraints), it's definitely worth trying. Not all optimization problems are amenable to this sort of iteration based optimization of course, but whenever applicable, it can pay off. (One reason for this is that the optimization is a much more difficult problem that can get the solver bogged down in irrelevant details. Second reason is that z3's optimizer hasn't received much attention in recent years compared to the solver engines, so it's not keeping up with the improvements that were done to the mainline solver. At least this is my impression.)
Use pseudo-boolean equalities, for which z3 has internal tactics to deal with much more efficiently. For pseudo-boolean constraints, see this answer.
Putting all these ideas together, here's how I'd code your problem in z3:
import igraph as ig
import random
from time import *
import z3
# First make a random graph and find the largest clique size in it
from tqdm import tqdm
random.seed(7)
num_variables = 150 # bigger gives larger running time gap between z3 and ortools grow
print("Making graph")
g = ig.Graph.Erdos_Renyi(num_variables, 0.6)
# Make a set of edges. Maybe this isn't necessary
print("Making set of edges")
edges = set()
for edge in tqdm(g.es):
edges.add((edge.source, edge.target))
myVars = []
for i in range(num_variables):
myVars += [z3.Bool('v%d' % i)]
total = sum([z3.If(i, 1, 0) for i in myVars])
s = z3.Solver()
for i in tqdm(range(num_variables)):
for j in range(i+1, num_variables):
if not (i, j) in edges:
s.add(z3.Not(z3.And(myVars[i], myVars[j])))
t = time()
curTotal = 0
while True:
s.add(z3.PbGe([(x, 1) for x in myVars], curTotal+1))
r = s.check()
if r == z3.unsat:
break
curTotal = s.model().evaluate(total, model_completion=True).as_long()
print(r, curTotal, round(time()-t,2))
print("DONE total time =", round(time()-t,2))
Note the use of z3.PbGe which introduces the pseudo-boolean constraints. In my experiments this runs faster than the regular constraint version; and if you also assume 8-core full parallelization speed-up, I think it compares favorably to the ortools solution.
Note that the iterative loop above is coded in a linear fashion, i.e., we ask for curTotal+1 solution, incrementing the last solution found by z3 only by one in each call to check. At the expense of some extra programming, you can also implement a binary-search like routine, i.e., ask for double the last result, and if unsat scale back to find the optimum, using push/pop to manage the assertion stack. This trick can perform even better when amortized over many runs.
Please report the results of your own experiments!

Replacing the Int variables by Bool variables reduced the z3py runtime by 28% on my machine:
z3.set_option(verbose=1)
myVars = []
for i in range(num_variables):
myVars += [z3.Bool('v%d' % i)] # Bool rather than Int
opt = z3.Optimize()
for i in tqdm(range(num_variables)):
for j in range(i+1, num_variables):
if not (i, j) in edges:
opt.add(Not(And(myVars[i], myVars[j])))
t = time()
h = opt.maximize(Sum([If(myVars[i], 1, 0) for i in range(num_variables)]))
opt.check()
print(round(time()-t,2))

Related

Does this manipulation in python save more time for my highly inefficient program?

I am using the following code unchanged in form but changed in content:
import numpy as np
import matplotlib.pyplot as plt
import random
from random import seed
from random import randint
import math
from math import *
from random import *
import statistics
from statistics import *
n=1000
T_plot=[0];
X_relm=[0];
class Objs:
def __init__(self, xIn, yIn, color):
self.xIn= xIn
self.yIn = yIn
self.color = color
def yfT(self, t):
return self.yIn*t+self.yIn*t
def xfT(self, t):
return self.xIn*t-self.yIn*t
xi=np.random.uniform(0,1,n);
yi=np.random.uniform(0,1,n);
O1 = [Objs(xIn = i, yIn = j, color = choice(["Black", "White"])) for i,j
in zip(xi,yi)]
X=sorted(O1,key=lambda x:x.xIn)
dt=1/(2*n)
T=20
iter=40000
Black=[]
White=[]
Xrelm=[]
for i in range(1,iter+1):
t=i*dt
for j in range(n-1):
check=X[j].xfT(t)-X[j+1].xfT(t);
if check<0:
X[j],X[j+1]=X[j+1],X[j]
if check<-10:
X[j].color,X[j+1].color=X[j+1].color,X[j].color
if X[j].color=="Black":
Black.append(X[j].xfT(t))
else:
White.append(X[j].xfT(t))
Xrel=mean(Black)-mean(White)
Xrelm.append(Xrel)
plot1=plt.figure(1);
plt.plot(T_plot,Xrelm);
plt.xlabel("time")
plt.ylabel("Relative ")
and it keeps running (I left it for 10 hours) without giving output for some parameters simply because it's too big I guess. I know that my code is not faulty totally (in the sense that it should give something even if wrong) because it does give outputs for fewer time steps and other parameters.
So, I am focusing on trying to optimize my code so that it takes lesser time to run. Now, this is a routine task for coders but I am a newbie and I am coding simply because the simulation will help in my field. So, in general, any inputs of a general nature that give insights on how to make one's code faster are appreciated.
Besides that, I want to ask whether defining a function a priori for the inner loop will save any time.
I do not think it should save any time since I am doing the same thing but I am not sure maybe it does. If it doesn't, any insights on how to deal with nested loops in a more efficient way along with those of general nature are appreciated.
(I have tried to shorten the code as far as I could and still not miss relevant information)

There are several issues in your code:
the mean is recomputed from scratch based on the growing array. Thus, the complexity of mean(Black)-mean(White) is quadratic to the number of elements.
The mean function is not efficient. Using a basic sum and division is much faster. In fact, a manual mean is about 25~30 times faster on my machine.
The CPython interpreter is very slow so you should avoid using loops as much as possible (OOP code does not help either). If this is not possible and your computation is expensive, then consider using a natively compiled code. You can use tools like PyPy, Numba or Cython or possibly rewrite a part in C.
Note that strings are generally quite slow and there is no reason to use them here. Consider using enumerations instead (ie. integers).
Here is a code fixing the first two points:
dt = 1/(2*n)
T = 20
iter = 40000
Black = []
White = []
Xrelm = []
cur1, cur2 = 0, 0
sum1, sum2 = 0.0, 0.0
for i in range(1,iter+1):
t = i*dt
for j in range(n-1):
check = X[j].xfT(t) - X[j+1].xfT(t)
if check < 0:
X[j],X[j+1] = X[j+1],X[j]
if check < -10:
X[j].color, X[j+1].color = X[j+1].color, X[j].color
if X[j].color == "Black":
Black.append(X[j].xfT(t))
else:
White.append(X[j].xfT(t))
delta1, delta2 = sum(Black[cur1:]), sum(White[cur2:])
sum1, sum2 = sum1+delta1, sum2+delta2
cur1, cur2 = len(Black), len(White)
Xrel = sum1/cur1 - sum2/cur2
Xrelm.append(Xrel)
Consider resetting Black and White to an empty list if you do not use them later.
This is several hundreds of time faster. It now takes 2 minutes as opposed to >20h (estimation) for the initial code.
Note that using a compiled code should be at least 10 times faster here so the execution time should be no more than dozens of seconds.

As mentioned in earlier comments, this one is a bit too broad to answer.
To illustrate; your iteration itself doesn't take very long:
import time
start = time.time()
for i in range(10000):
for j in range(10000):
pass
end = time.time()
print (end-start)
On my not-so-great machine that takes ~2s to complete.
So the looping portion is only a tiny fraction of your 10h+ run time.
The detail of what you're doing in the loop is the key.
Whilst very basic, the approach I've shown in the code above could be applied to your existing code to work out which bit(s) are the least performant and then raise a new question with some more specific, actionable detail.

Improve performance of constraint-adding in Gurobi (Python-Interface)

i got this decision variable:
x={}
for j in range(10):
for i in range(500000):
x[i,j] = m.addVar(vtype=GRB.BINARY, name="x%d%d" %(i,j))
so i need to add constraints for each x[i,j] variable like this:
for p in range(10):
for u in range(500000):
m.addConstr(x[u,p-1]<=x[u,p])
this is taking me so much time, more that 12hrs and then a lack of memory pop-up appears at my computer.
Can someone helpme to improve this constraint addition problem

Most likely, you are running out of physical memory and using virtual (swap) memory. This would not cause your computer to report an out-of-memory warning or error.
I rewrote your code as follows:
from gurobipy import *
m = Model()
x={}
for j in range(10):
for i in range(500000):
x[i,j] = m.addVar(vtype=GRB.BINARY, name="x%d%d" %(i,j))
m.update()
for p in range(10):
for u in range(500000):
try:
m.addConstr(x[u,p-1]<=x[u,p])
except:
pass
m.update()
I tested this using Gurobi Optimizer 6.5.2 on a computer with an Intel Xeon E3-1240 processor (3.40 GHz) and 32 GB of physical memory. It was able to formulate the variables and constraints in 1 minute 14 seconds. You might be able to save a small amount of memory using a list, but I believe that Gurobi Var and Constr objects require far more memory than a Python dict or list.

General Remark:
It looks quite costly to add 5 million constraints in general
Specific Remark:
Approach
You are wasting time and space by using dictionaries
Despite having constant-access complexity, these constants are big
They are also wasting memory
In a simple 2-dimensional case like this: stick to arrays!
Validity
Your indexing is missing the border-case of the first element, so indexing breaks!
Try this (much more efficient approach; using numpy's arrays):
import numpy as np
from gurobipy import *
N = 10
M = 500000
m = Model("Testmodel")
x = np.empty((N, M), dtype=object)
for i in range(N):
for j in range(M):
x[i,j] = m.addVar(vtype=GRB.BINARY, name="x%d%d" %(i,j))
m.update()
for u in range(M): # i switched the loop-order
for p in range(1,N): # i'm handling the border-case
m.addConstr(x[p-1,u] <= x[p,u])
Result:
~2 minutes
~2.5GB memory (complete program incl. Gurobi's internals)

Speeding up dynamic programming in python/numpy

I have a 2D cost matrix M, perhaps 400x400, and I'm trying to calculate the optimal path through it. As such, I have a function like:
M[i,j] = M[i,j] + min(M[i-1,j-1],M[i-1,j]+P1,M[i,j-1]+P1)
which is obviously recursive. P1 is some additive constant. My code, which works more or less, is:
def optimalcost(cost, P1=10):
width1,width2 = cost.shape
M = array(cost)
for i in range(0,width1):
for j in range(0,width2):
try:
M[i,j] = M[i,j] + min(M[i-1,j-1],M[i-1,j]+P1,M[i,j-1]+P1)
except:
M[i,j] = inf
return M
Now I know looping in Numpy is a terrible idea, and for things like the calculation of the initial cost matrix I've been able to find shortcuts to cutting the time down. However, as I need to evaluate potentially the entire matrix I'm not sure how else to do it. This takes around 3 seconds per call on my machine and must be applied to around 300 of these cost matrices. I'm not sure where this time comes from, as profiling says the 200,000 calls to min only take 0.1s - maybe memory access?
Is there a way to do this in parallel somehow? I assume there may be, but to me it seems each iteration is dependent unless there's a smarter way to memoize things.
There are parallels to this question: Can I avoid Python loop overhead on dynamic programming with numpy?
I'm happy to switch to C if necessary, but I like the flexibility of Python for rapid testing and the lack of faff with file IO. Off the top of my head, is something like the following code likely to be significantly faster?
#define P1 10
void optimalcost(double** costin, double** costout){
/*
We assume that costout is initially
filled with costin's values.
*/
float a,b,c,prevcost;
for(i=0;i<400;i++){
for(j=0;j<400;j++){
a = prevcost+P1;
b = costout[i][j-1]+P1;
c = costout[i-1][j-1];
costout[i][j] += min(prevcost,min(b,c));
prevcost = costout[i][j];
}
}
}
return;
Update:
I'm on Mac, and I don't want to install a whole new Python toolchain so I used Homebrew.
> brew install llvm --rtti
> LLVM_CONFIG_PATH=/usr/local/opt/llvm/bin/llvm-config pip install llvmpy
> pip install numba
New "numba'd" code:
from numba import autojit, jit
import time
import numpy as np
#autojit
def cost(left, right):
height,width = left.shape
cost = np.zeros((height,width,width))
for row in range(height):
for x in range(width):
for y in range(width):
cost[row,x,y] = abs(left[row,x]-right[row,y])
return cost
#autojit
def optimalcosts(initcost):
costs = zeros_like(initcost)
for row in range(height):
costs[row,:,:] = optimalcost(initcost[row])
return costs
#autojit
def optimalcost(cost):
width1,width2 = cost.shape
P1=10
prevcost = 0.0
M = np.array(cost)
for i in range(1,width1):
for j in range(1,width2):
M[i,j] += min(M[i-1,j-1],prevcost+P1,M[i,j-1]+P1)
prevcost = M[i,j]
return M
prob_size = 400
left = np.random.rand(prob_size,prob_size)
right = np.random.rand(prob_size,prob_size)
print '---------- Numba Time ----------'
t = time.time()
c = cost(left,right)
optimalcost(c[100])
print time.time()-t
print '---------- Native python Time --'
t = time.time()
c = cost.py_func(left,right)
optimalcost.py_func(c[100])
print time.time()-t
It's interesting writing code in Python that is so un-Pythonic. Note for anyone interested in writing Numba code, you need to explicitly express loops in your code. Before, I had the neat Numpy one-liner,
abs(left[row,:][:,newaxis] - right[row,:])
to calculate the cost. That took around 7 seconds with Numba. Writing out the loops properly gives 0.5s.
It's an unfair comparison to compare it to native Python code, because Numpy can do that pretty quickly, but:
Numba compiled: 0.509318113327s
Native: 172.70626092s
I'm impressed both by the numbers and how utterly simple the conversion is.

If it's not hard for you to switch to the Anaconda distribution of Python, you can try using Numba, which for this particular simple dynamic algorithm would probably offer a lot of speedup without making you leave Python.

Numpy is usually not very good at iterative jobs (though it do have some commonly used iterative functions such as np.cumsum, np.cumprod, np.linalg.* and etc). But for simple tasks like finding the shortest path (or lowest energy path) above, you can vectorize the problem by thinking about what can be computed at the same time (also try to avoid making copy:
Suppose we are finding a shortest path in the "row" direction (i.e. horizontally), we can first create our algorithm input:
# The problem, 300 400*400 matrices
# Create infinitely high boundary so that we dont need to handle indexing "-1"
a = np.random.rand(300, 400, 402).astype('f')
a[:,:,::a.shape[2]-1] = np.inf
then prepare some utility arrays which we will use later (creation takes constant time):
# Create self-overlapping view for 3-way minimize
# This is the input in each iteration
# The shape is (400, 300, 400, 3), separately standing for row, batch, column, left-middle-right
A = np.lib.stride_tricks.as_strided(a, (a.shape[1],len(a),a.shape[2]-2,3), (a.strides[1],a.strides[0],a.strides[2],a.strides[2]))
# Create view for output, this is basically for convenience
# The shape is (399, 300, 400). 399 comes from the fact that first row is never modified
B = a[:,1:,1:-1].swapaxes(0, 1)
# Create a temporary array in advance (try to avoid cache miss)
T = np.empty((len(a), a.shape[2]-2), 'f')
and finally do the computation and timeit:
%%timeit
for i in np.arange(a.shape[1]-1):
A[i].min(2, T)
B[i] += T
The timing result on my (super old laptop) machine is 1.78s, which is already way faster than 3 minute. I believe you can improve even more (while stick to numpy) by optimize the memory layout and alignment (somehow). Or, you can simply use multiprocessing.Pool. It is easy to use, and this problem is trivial to split to smaller problems (by dividing on the batch axis).

Need help vectorizing code or optimizing

I am trying to do a double integral by first interpolating the data to make a surface. I am using numba to try and speed this process up, but it's just taking too long.
Here is my code, with the images needed to run the code located at here and here.

Noting that your code has a quadruple-nested set of for loops, I focused on optimizing the inner pair. Here's the old code:
for i in xrange(K.shape[0]):
for j in xrange(K.shape[1]):
print(i,j)
'''create an r vector '''
r=(i*distX,j*distY,z)
for x in xrange(img.shape[0]):
for y in xrange(img.shape[1]):
'''create an ksi vector, then calculate
it's norm, and the dot product of r and ksi'''
ksi=(x*distX,y*distY,z)
ksiNorm=np.linalg.norm(ksi)
ksiDotR=float(np.dot(ksi,r))
'''calculate the integrand'''
temp[x,y]=img[x,y]*np.exp(1j*k*ksiDotR/ksiNorm)
'''interpolate so that we can do the integral and take the integral'''
temp2=rbs(a,b,temp.real)
K[i,j]=temp2.integral(0,n,0,m)
Since K and img are each about 2000x2000, the innermost statements need to be executed sixteen trillion times. This is simply not practical using Python, but we can shift the work into C and/or Fortran using NumPy to vectorize. I did this one careful step at a time to try to make sure the results will match; here's what I ended up with:
'''create all r vectors'''
R = np.empty((K.shape[0], K.shape[1], 3))
R[:,:,0] = np.repeat(np.arange(K.shape[0]), K.shape[1]).reshape(K.shape) * distX
R[:,:,1] = np.arange(K.shape[1]) * distY
R[:,:,2] = z
'''create all ksi vectors'''
KSI = np.empty((img.shape[0], img.shape[1], 3))
KSI[:,:,0] = np.repeat(np.arange(img.shape[0]), img.shape[1]).reshape(img.shape) * distX
KSI[:,:,1] = np.arange(img.shape[1]) * distY
KSI[:,:,2] = z
# vectorized 2-norm; see http://stackoverflow.com/a/7741976/4323
KSInorm = np.sum(np.abs(KSI)**2,axis=-1)**(1./2)
# loop over entire K, which is same shape as img, rows first
# this loop populates K, one pixel at a time (so can be parallelized)
for i in xrange(K.shape[0]):
for j in xrange(K.shape[1]):
print(i, j)
KSIdotR = np.dot(KSI, R[i,j])
temp = img * np.exp(1j * k * KSIdotR / KSInorm)
'''interpolate so that we can do the integral and take the integral'''
temp2 = rbs(a, b, temp.real)
K[i,j] = temp2.integral(0, n, 0, m)
The inner pair of loops is now completely gone, replaced by vectorized operations done in advance (at a space cost linear in the size of the inputs).
This reduces the time per iteration of the outer two loops from 340 seconds to 1.3 seconds on my Macbook Air 1.6 GHz i5, without using Numba. Of the 1.3 seconds per iteration, 0.68 seconds are spent in the rbs function, which is scipy.interpolate.RectBivariateSpline. There is probably room to optimize further--here are some ideas:
Reenable Numba. I don't have it on my system. It may not make much difference at this point, but easy for you to test.
Do more domain-specific optimization, such as trying to simplify the fundamental calculations being done. My optimizations are intended to be lossless, and I don't know your problem domain so I can't optimize as deeply as you may be able to.
Try to vectorize the remaining loops. This may be tough unless you are willing to replace the scipy RBS function with something supporting multiple calculations per call.
Get a faster CPU. Mine is pretty slow; you can probably get a speedup of at least 2x simply by using a better computer than my tiny laptop.
Downsample your data. Your test images are 2000x2000 pixels, but contain fairly little detail. If you cut their linear dimensions by 2-10x, you'd get a huge speedup.
So that's it for me for now. Where does this leave you? Assuming a slightly better computer and no further optimization work, even the optimized code would take about a month to process your test images. If you only have to do this once, maybe that's fine. If you need to do it more often, or need to iterate on the code as you try different things, you probably need to keep optimizing--starting with that RBS function which consumes more than half the time now.
Bonus tip: your code would be a lot easier to deal with if it didn't have nearly-identical variable names like k and K, nor used j as a variable name and also as a complex number suffix (0j).

Parralelized code runs much slower in Python than in Matlab

I have a piece of code that does the following:
for each file (already read in the RAM):
call a function and obtain a result
add the results up and disply
Each file can be analyzed in parallel. The function that analyzes each file is as following:
# Complexity = 1000*19*19 units of work
def fun(args):
(a, b, p) = args
for itr in range(1000):
for i in range(19):
for j in range(19):
# The following random number generated depends on
# latest values in (i-1, j), (i+1, j), (i, j-1) & (i, j+1)
# cells of latest a and b arrays
u = np.random.rand();
if (u < p):
a[i, j] += -1
else:
b[i, j] += 1
return a+b
I am using multiprocessing package to achieve parallelism:
import numpy as np
import time
from multiprocessing import Pool, cpu_count
if __name__ == '__main__':
t = time.time()
pool = Pool(processes=cpu_count())
args = [None]*100
for i in range(100):
a = np.random.randint(2, size=(19, 19))
b = np.random.randint(2, size=(19, 19))
p = np.random.rand()
args[i] = (a, b, p)
result = pool.map(fun, args)
for i in range(2, 100):
result[0] += result[i]
print result[0]
print time.time() - t
I have written equivalent MATLAB code that uses parfor and calls fun in each iteration of parfor:
tic
args = cell(100, 1);
r = cell(100, 1);
parfor i = 1:100
a = randi(2, 19, 19);
b = randi(2, 19, 19);
p = rand();
args{i}.a = a;
args{i}.b = b;
args{i}.p = p;
r{i} = fun(args{i});
end
for i = 2:100
r{1} = r{1} + r{i};
end
disp(r{1});
toc
The implementation of fun is as follows:
function [ ret ] = fun( args )
a = args.a;
b = args.b;
p = args.p;
for itr = 1:1000
for i = 1:19
for j = 1:19
u = rand();
if (u < p)
a(i, j) = a(i, j) + -1;
else
b(i, j) = b(i, j) + 1;
end
end
end
end
ret = a + b;
end
I find that MATLAB is blazingly fast, it takes around 1.5 seconds on a dual core processor compared to around 33-34 seconds taken by Python program. Why is this so?
EDIT: A lot of answers suggested that I should vectorize the random number generation. Actually the way it works is, random number generated depends on the latest a and b 2D arrays. I just put a simple rand() call to keep program simple and readable. In actuality of my program, random number is always generated by looking at certain horizontally and vertically neighbouring cells of (i, j) cell. So it is not possible to vectorize that.

Have you benchmarked both implementations of fun in a non-parallel context? One might just be a lot faster. In particular, those nested loops in the Python fun look like they might turn in to a faster vectorized solution in Matlab, or may be optimized by Matlab's JIT.
Stick both implementations in profilers to see where they're spending their time. Convert both implementations to non-parallel and profile those first, to make sure they're equivalent in performance before introducing the complexities of the parallelization stuff.
And one last check - you're setting up Matlab's Parallel Computation Toolbox with a local pool of workers, right, and not hooking in to a remote machine or picking up some other resources? How many workers on the Matlab side?

I did some tests on your Python code, though without the multiprocessing part and achieved a roughly 25x speedup by making the following changes:
Used Python lists instead of a NumPy array, because the latter is really slow when you need to do a lot of indexing. My timings are including the time it takes to do ndarray.tolist() so this might actually be a viable option, as long as the array's are not to big I think.
Ran it in PyPy instead of the regular Python interpreter, because PyPy has a JIT-compiler, to make the comparison with MATLAB a bit more fair. Regular CPython doesn't have such a feature.
Made the "random" call local to the function, i.e. do rand = np.random.rand and later u = rand(), because in Python lookups in the local namespace are faster, which can be significant in tight loops such as this.
Used Python's random.random instead of np.random.rand (also bound to a name local to the function).
Exchanged range for the generator-based xrange.
(This list is sorted from most significant speedup to only very small gain)
Of course there's also the parallel computing aspect. With multiprocessing all the Python objects that are passed around between processes are "pickled", meaning they have to be serialized and de-serialized on top of being copied between processes. MATLAB also copies data between processes (or threads?), but probably does so in a less wasteful way. Next to that, setting up a multiprocessing.Pool also takes a short while, which may not be fair to your MATLAB benchmark, but I'm not sure about this.
Based on your timings and mine, I would say that Python and MATLAB could be about equally fast for this specific task. But, unfortunately you have to jump through some hoops to get that speed with Python. Maybe using the #autojit capability of Numba could be interesting in this regard, if you have that available.

Try this version of fun and see if it gives you a speedup.
def fun(args):
a, b, p = args
n = 1000
u = np.random.random((n, 19, 19))
msk = u < p
msk_sum = msk.sum(0)
a -= msk_sum
b += (n - msk_sum)
return a + b
This is the more efficient way of implement this kind of function using numpy.
These kinds of nested loops can have quite high overhead in interpreted languages like matlab and python, but I suspect the JIT is compensating, at least partially, in matlab so the performance difference between the vectorized and loop implementations will be smaller. Cpython does not currently have any optimizations for these kinds of loops (that I know of) but but at least one python implementation, pypy, does have a JIT. Unfortunately pypy currently only has limited numpy support.
Update:
It seems like you have an iterative algorithm, and at least in my experience, those are the hardest to optimize with numpy/cpython. Consider using cython, this tutorial might also be useful, to write your nested loops. Other people might have other suggestions, but that's the best I can think of.

Given the available information, you would probably benefit from using Cython, which let's you also use some parallelism. Depending on the random numbers you need you could for instance use GSL to generate them.
Original question
There's no need to use multiprocessing, because fun can be easily vectorized, which yields a massive speed up (over 50x).
I'm not sure why matlab is not hit so hard as numpy, but its JIT may be saving it. Python does not like doing two . look ups in the inner loop, nor calling expensive functions there.
def fun_fast(args):
a, b, p = args
for i in xrange(19):
for j in xrange(19):
u = np.random.rand(1000)
msk = u < p
msk_sum = msk.sum()
a[i, j] -= msk_sum
b[i, j] += msk.size - msk_sum
return a + b

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.