I've been using a pre-existing assignment tool which I use for allocating undergraduate students to projects. I am now keen to try and build an assignment tool in python which will allow me to add and adjust constraints as we are faced with extraordinary pressures on space usage due to COVID.
The basis of the task is to place as many possible students with their favored supervisors. I have 85 students who provided a 1-to-5 rank order of preferred supervisors, which allows me to tune in a cost variable. In addition, there are 40 supervisors each with varying levels of capacity; some can take 2 students, some 3 etc, with total capacity ca 100.
I have been using Google OR-Tools, python implementation, and have attempted "Assignment with Task Sizes" strategy so far, both CP-SAT and MIP. I can produce a solution using CP-SAT for a small dummy data set, but when I use last years data with a cost matrix of size 85x40, I haven't been able to generate an assignment solution. In contrast, the MIP solver approach produces an assignment with "0 cost" and no actual assignments. I have also started to construct a minimum flow model but so far I have not been able to get the program to run on my input data.
My general question is "what is the best general approach for such an assignment problem of more agents than tasks, but tasks with capacity to accept greater than 1 agent."
Happy to provide what code I have if useful.
Thanks, Dave
EDIT
The code I have been using to attempt a CP-SAT solution is below. I initially populate a np-array of sizes defined by number of students and number of staff and populate this with the value 100. This is what I use for elements which are not a choice in the cost matrix. I have taken the student's choices ( 1-to-5) and squared to give a bit of differentiation:
from __future__ import print_function
from ortools.sat.python import cp_model
import time
import pandas as pd
import numpy as np
capacity=pd.read_excel(r'C:\Users\Dave\Documents\assign_2019.xlsx',
index_col=0, sheet_name='capacity')
capacity=capacity.reset_index()
choices=pd.read_excel(r'C:\Users\Dave\Documents\assign_2019.xlsx',
index_col=0, sheet_name='choices')
choices=choices.reset_index()
array=np.empty((len(choices), len(capacity)))
array.fill(100)
cost = pd.DataFrame(data=array)
cost.index=choices['student']
cost.columns=capacity['staff']
choices=choices.set_index(['student'])
for i in choices.index:
for j in choices.columns:
s=choices.loc[i, j]
cost.loc[i,s]=j**2
cost=cost.to_numpy()
cost=cost.astype(int)
sizes = capacity['capacity']
sizes=sizes.to_numpy()
sizes=sizes.astype(int)
def main():
model = cp_model.CpModel()
start = time.time()
num_workers = len(cost)
num_tasks = len(cost[1])
# Variables
x = []
for i in range(num_workers):
t = []
for j in range(num_tasks):
t.append(model.NewIntVar(0, 1, "x[%i,%i]" % (i, j)))
x.append(t)
x_array = [x[i][j] for i in range(num_workers) for j in
range(num_tasks)]
# Constraints
# Each staff is allocated no more than capacity.
[model.Add(sum(x[i][j] for i in range(num_workers)) <= sizes[j])
for j in range(num_tasks)]
# Number of projects allocated to a student is 1.
[model.Add(sum(x[i][j] for j in range(num_tasks)) == 1)
for i in range(num_workers)]
model.Minimize(sum([np.dot(x_row, cost_row) for (x_row, cost_row) in
zip(x, cost)]))
solver = cp_model.CpSolver()
status = solver.Solve(model)
if status == cp_model.OPTIMAL:
print('Minimum cost = %i' % solver.ObjectiveValue())
print()
for i in range(num_workers):
for j in range(num_tasks):
if solver.Value(x[i][j]) >= 1:
print('Worker ', i, ' assigned to task ', j, ' Cost = ',
cost[i][j])
print()
end = time.time()
print("Time = ", round(end - start, 4), "seconds")
if __name__ == '__main__':
main()
EDIT 2:
Code used for MIP solver (data input is the same as the CP-SAT attempt described above):
def main():
solver = pywraplp.Solver('SolveAssignmentProblem',
pywraplp.Solver.CBC_MIXED_INTEGER_PROGRAMMING)
start = time.time()
num_workers = len(cost)
num_tasks = len(cost[1])
# Variables
x = {}
for i in range(num_workers):
for j in range(num_tasks):
x[i, j] = solver.IntVar(0, 1, 'x[%i,%i]' % (i, j))
# Constraints
# Staff can accept students up to capacity.
for i in range(num_workers):
solver.Add(solver.Sum([x[i, j] for j in range(num_tasks)]) <=
sizes[j])
# Student is allocated 1 project.
for j in range(num_tasks):
solver.Add(solver.Sum([x[i, j] for i in range(num_workers)]) == 1)
solver.Minimize(solver.Sum([cost[i][j] * x[i,j] for i in
range(num_workers)
for j in range(num_tasks)]))
print('Minimum cost = ', solver.Objective().Value())
print()
for i in range(num_workers):
for j in range(num_tasks):
if x[i, j].solution_value() > 0:
print('Worker', i,' assigned to task', j, ' Cost = ', cost[i]
[j])
print()
end = time.time()
print("Time = ", round(end - start, 4), "seconds")
if __name__ == '__main__':
main()
The outcome of the MIP attempt is:
Minimum cost = 0.0
Time = 0.553 seconds
Related
tools and im trying to use it to generate a timeTable for a highschool.
Variables are Lesson, rooms, and timeslots and the goal of course to assign all the lessons to a certain room and a timeslot while respecting the given constraints.
The problem is in the documentation i don't see it talking about soft and hard constraint and all the constraint i've added are surely hard ones, is there a way to implement soft constraint for this example just a simple one.
Thanks in advance.
Probably a duplicate of Do Google Optimization Tools support Soft Constraints?, but I'll add some examples with CP-SAT.
Here's a simple soft limit example:
from ortools.sat.python import cp_model
model = cp_model.CpModel()
solver = cp_model.CpSolver()
x = [model.NewBoolVar("") for i in range(10)]
# hard constraint: number of 1's >= 4
model.Add(sum(x) >= 4)
# soft constraint: number of 1's <= 5
delta = model.NewIntVar(-5, 5, "")
excess = model.NewIntVar(0, 5, "")
model.Add(delta == sum(x) - 5)
model.AddMaxEquality(excess, [delta, 0])
model.Minimize(excess)
solver.Solve(model)
print([solver.Value(i) for i in x])
print(solver.Value(excess))
See a more complex example here
And here's one concerning fullfiled requests:
from ortools.sat.python import cp_model
model = cp_model.CpModel()
solver = cp_model.CpSolver()
x = [model.NewIntVar(0, 10, "") for i in range(10)]
# request: sum() <= 10
req1 = model.NewBoolVar("")
model.Add(sum(x) <= 10).OnlyEnforceIf(req1)
# request: sum() >= 5
req2 = model.NewBoolVar("")
model.Add(sum(x) >= 5).OnlyEnforceIf(req2)
# request: sum() >= 100
req3 = model.NewBoolVar("")
model.Add(sum(x) >= 100).OnlyEnforceIf(req3)
model.Maximize(req1 + req2 + req3)
solver.Solve(model)
print(solver.Value(sum(x)))
print(solver.ObjectiveValue())
Trying to code the secretary problem in python by doing a Monte Carlo simulation (without using e). The essence of the problem is here: https://en.wikipedia.org/wiki/Secretary_problem
Described as :Imagine an administrator who wants to hire the best secretary out of n rankable applicants for a position. The applicants are interviewed one by one in random order. A decision about each particular applicant is to be made immediately after the interview. Once rejected, an applicant cannot be recalled. During the interview, the administrator can rank the applicant among all applicants interviewed so far but is unaware of the quality of yet unseen applicants. The question is about the optimal strategy (stopping rule) to maximize the probability of selecting the best applicant. Taken from: https://www.geeksforgeeks.org/secretary-problem-optimal-stopping-problem/
Table that I'm checking my code against:
Here is my python code so far:
n = 7; # of applicants
m = 10000; # of repeats
plot = np.zeros(1);
for i in range (2,m): #multiple runs
array = np.random.randint(1,1000,n);
for j in range(2,n): #over range of array
test = 0;
if array[j] > array[1] and array[j] == array.max():
plot=plot+1
test = 1;
break
if array[j]> array[1]:
test = 2;
break
print(plot/m)
print(array)
print("j = ",j)
print("test = ",test)
I am doing something wrong with my code here that I'm unable to replicate the table. In the above code I've tried to do 7 = number of applicants and take the best applicant after '2'.
The plot/m should output the percentage in column three given the number of applicants and 'take the best after'.
Answered! As below.
Additional code:
import numpy as np
import matplotlib.pyplot as plt
import time
plt.style.use('seaborn-whitegrid')
n = 150 #total number of applicants
nplot = np.empty([1,1])
#take = 3 #not necessary, turned into J below:
for k in range(2,n):
m = 10000 #number of repeats
plot = np.empty([1,1]);
for j in range(1,k):
passed = 0
for i in range (0,m): #multiple runs
array = np.random.rand(k);
picked = np.argmax(array[j:]>max(array[0:j])) + j
best = np.argmax(array)
if best == picked:
passed = passed+1
#print(passed/m)
plot = np.append(plot,[passed/m])
#print(plot)
plot = plot[1:];
x = range(1,k);
y = plot
#print("N = ",k)
print("Check ",plot.argmax()," if you have ",k," applicants", round(100* plot.max(),2),"% chance of finding the best applicant")
nplot =np.append(nplot,plot.max())
# Plot:
nplot = nplot[1:];
x = range(2,n);
y = nplot
plt.plot(x, y, 'o', color='black');
plt.xlabel("Number of Applicants")
plt.ylabel("Probability of Best Applicant")
Here is something that seems to do the job and is a bit simpler. Comments:
Use argmax to determine who is the best secretary, or to pick the first that has a better grade than another group
Draw from a real-valued function to reduce the odds of having 2 secretaries having the same grade.
Hence:
import numpy as np
n = 7
take = 2
m = 100000
passed = 0
for i in range (0,m): #multiple runs
array = np.random.rand(n);
picked = np.argmax(array[take:]>max(array[0:take])) + take
best = np.argmax(array)
if best == picked:
passed = passed+1
print(passed/m)
I am currently working on a data mining project that is creating a similarity matrix that is 18000x18000
Here are the two methods which build the matrix
def CreateSimilarityMatrix(dbSubsetData, distancePairsList):
global matrix
matrix = [ [0.0 for y in range(dbSubsetData.shape[0])] for x in range(dbSubsetData.shape[0])]
for i in range(len(dbSubsetData)): #record1
SimilarityArray = []
start = time.time()
for j in range(i+1, len(dbSubsetData)): #record2
Similarity = GetDistanceBetweenTwoRecords(dbSubsetData, i, j, distancePairsList)
#The similarities are all very small numbers which might be why the preference value needs to be so precise.
#Let's multiply the value by a scalar 10 to give the values more range.
matrix[i][j] = Similarity * 10.0
matrix[j][i] = Similarity * 10.0
end = time.time()
return matrix
def GetDistanceBetweenTwoRecords(dbSubsetData, i, j, distancePairsList):
Record1 = dbSubsetData.iloc[i]
Record2 = dbSubsetData.iloc[j]
columns = dbSubsetData.columns
distancer = 0.0
distancec = 0.0
for i in range(len(Record1)):
columnName = columns[i]
Record1Value = Record1[i]
Record2Value = Record2[i]
if(Record1Value != Record2Value):
ob = distancePairsList[distancePairsDict[columnName]-1]
if(ob.attributeType == "String"):
strValue = Record1Value+":"+Record2Value
strValue2 = Record2Value+":"+Record1Value
if strValue in ob.distancePairs:
val = ((ob.distancePairs[strValue])**2)
val = val * -1
distancec = distancec + val
elif strValue2 in ob.distancePairs:
val = ((ob.distancePairs[strValue2])**2)
val = val * -1
distancec = distancec + val
elif(ob.attributeType == "Number"):
val = ((Record1Value - Record2Value)*ob.getSignificance())**2
val = val * -1
distancer = distancer + val
distance = distancer + distancec
return distance
Each iteration is looping 18000x19 times (18000 for each row and 19 times for each attribute). The total number of iterations is (18000x18000x19)/2 since it is symmetric and therefore I only have to do one half of the matrix. This will take around 36 hours to complete, which is a timeframe I obviously want to shave down.
I figured Multiprocessing is the trick. Since each row is independently generating numbers and fitting them to the matrix, I could run multiprocess with CreateSimilarityMatrix. So I created this in the function which will create my processes
matrix = [ [0.0 for y in range(SubsetDBNormalizedAttributes.shape[0])] for x in range(SubsetDBNormalizedAttributes.shape[0])]
if __name__ == '__main__':
procs = []
for i in range(4):
proc = Process(target=CreateSimilarityMatrix, args=(SubsetDBNormalizedAttributes, distancePairsList, i, 4))
procs.append(proc)
proc.start()
proc.join()
CreateSimilarityMatrix is now changed to
def CreateSimilarityMatrix(dbSubsetData, distancePairsList, counter=0, iteration=1):
global Matrix
for i in range(counter, len(dbSubsetData), iteration): #record1
SimilarityArray = []
start = time.time()
for j in range(i+1, len(dbSubsetData)): #record2
Similarity = GetDistanceBetweenTwoRecords(dbSubsetData, i, j, distancePairsList)
#print("Similarity Between Records",i,":",j," is ", Similarity)
#The similarities are all very small numbers which might be why the preference value needs to be so precise.
#Let's multiply the value by a scalar 10 to give the values more range.
Matrix[i][j] = Similarity * 10.0
Matrix[j][i] = Similarity * 10.0
end = time.time()
print("Iteration",i,"took",end-start,"(s)")
Currently this goes s-l-o-w. It's really slow. It takes minutes to start one process, then it takes minutes to start the next one. I thought these were supposed to run concurrently? Is my application of the process incorrect?
If you are using CPython, there is something called the global interpreter lock (GIL) that makes it difficult to actually multithread while making things faster, and can instead slow it down substantially.
If you are dealing with matrices, use numpy, which is definitely a lot faster than regular Python.
I want to use multiprocessing in Python to speed up a while loop.
More specifically:
I have a matrix (samples*features). I want to select x subsets of samples whose values at a random subset of features is unequal to a certain value (-1 in this case).
My serial code:
np.random.seed(43)
datafile = '...'
df = pd.read_csv(datafile, sep=" ", nrows = 89)
no_feat = 500
no_samp = 5
no_trees = 5
i=0
iter=0
samples = np.zeros((no_trees, no_samp))
features = np.zeros((no_trees, no_feat))
while i < no_trees:
rand_feat = np.random.choice(df.shape[1], no_feat, replace=False)
iter_order = np.random.choice(df.shape[0], df.shape[0], replace=False)
samp_idx = []
a=0
#--------------
#how to run in parallel?
for j in iter_order:
pot_samp = df.iloc[j, rand_feat]
if len(np.where(pot_samp==-1)[0]) == 0:
samp_idx.append(j)
if len(samp_idx) == no_samp:
print a
break
a+=1
#--------------
if len(samp_idx) == no_samp:
samples[i,:] = samp_idx
features[i, :] = rand_feat
i+=1
iter+=1
if iter>1000: #break if subsets cannot be found
break
Searching for fitting samples is the potentially expensive part (the j for loop), which in theory can be run in parallel. In some cases, it is not necessary to iterate over all samples to find a large enough subset, which is why I am breaking out of the loop as soon as the subset is large enough.
I am struggling to find an implementation that would allow for checks of how many valid results are generated already. Is it even possible?
I have used joblib before. If I understand correctly this uses the pool methods of multiprocessing as a backend which only works for separate tasks? I am thinking that queues might be helpful but thus far I failed at implementing them.
I found a working solution. I decided to run the while loop in parallel and have the different processes interact over a shared counter. Furthermore, I vectorized the search for suitable samples.
The vectorization yielded a ~300x speedup and running on 4 cores speeds up the computation ~twofold.
First I tried to implement separate processes and put the results into a queue. Turns out these aren't made to store large amounts of data.
If someone sees another bottleneck in that code I would be glad if someone pointed it out.
With my basically nonexistent knowledge about parallel computing I found it really hard to puzzle this together, especially since the example on the internet are all very basic. I learnt a lot though =)
My code:
import numpy as np
import pandas as pd
import itertools
from multiprocessing import Pool, Lock, Value
from datetime import datetime
import settings
val = Value('i', 0)
worker_ID = Value('i', 1)
lock = Lock()
def findSamp(no_trees, df, no_feat, no_samp):
lock.acquire()
print 'starting worker - {0}'.format(worker_ID.value)
worker_ID.value +=1
worker_ID_local = worker_ID.value
lock.release()
max_iter = 100000
samp = []
feat = []
iter_outer = 0
iter = 0
while val.value < no_trees and iter_outer<max_iter:
rand_feat = np.random.choice(df.shape[1], no_feat, replace=False
#get samples with random features from dataset;
#find and select samples that don't have missing values in the random features
samp_rand = df.iloc[:,rand_feat]
nan_idx = np.unique(np.where(samp_rand == -1)[0])
all_idx = np.arange(df.shape[0])
notnan_bool = np.invert(np.in1d(all_idx, nan_idx))
notnan_idx = np.where(notnan_bool == True)[0]
if notnan_idx.shape[0] >= no_samp:
#if enough samples for random feature subset, select no_samp samples randomly
notnan_idx_rand = np.random.choice(notnan_idx, no_samp, replace=False)
rand_feat_rand = rand_feat
lock.acquire()
val.value += 1
#x = val.value
lock.release()
#print 'no of trees generated: {0}'.format(x)
samp.append(notnan_idx_rand)
feat.append(rand_feat_rand)
else:
#increase iter_outer counter if no sample subset could be found for random feature subset
iter_outer += 1
iter+=1
if iter >= max_iter:
print 'exiting worker{0} because iter >= max_iter'.format(worker_ID_local)
else:
print 'worker{0} - finished'.format(worker_ID_local)
return samp, feat
def initialize(*args):
global val, worker_ID, lock
val, worker_ID, lock = args
def star_findSamp(i_df_no_feat_no_samp):
return findSamp(*i_df_no_feat_no_samp)
if __name__ == '__main__':
np.random.seed(43)
datafile = '...'
df = pd.read_csv(datafile, sep=" ", nrows = 89)
df = df.fillna(-1)
df = df.iloc[:, 6:]
no_feat = 700
no_samp = 10
no_trees = 5000
startTime = datetime.now()
print 'starting multiprocessing'
ncores = 4
p = Pool(ncores, initializer=initialize, initargs=(val, worker_ID, lock))
args = itertools.izip([no_trees]*ncores, itertools.repeat(df), itertools.repeat(no_feat), itertools.repeat(no_samp))
result = p.map(star_findSamp, args)#, callback=log_result)
p.close()
p.join()
print '{0} sample subsets for tree training have been found'.format(val.value)
samples = [x[0] for x in result if x != None]
samples = np.vstack(samples)
features = [x[1] for x in result if x != None]
features = np.vstack(features)
print datetime.now() - startTime
I want to make changes to the coefficients in an existing model. Currently (with the Python API) I'm looping through the constraints and calling model.chgCoeff but it's quite slow. Is there a faster way, perhaps accessing the constraint matrix directly, in the Python and/or C API?
Example code below. The reason it's slow seems to be mostly because of the loop itself; replacing chgCoeff with any other operation is still slow. Normally I would get around this by using vector operations rather than for loops, but without access to the constraint matrix I don't think I can do that.
from __future__ import division
import gurobipy as gp
import numpy as np
import time
N = 300
M = 2000
m = gp.Model()
m.setParam('OutputFlag', False)
masks = [np.random.rand(N) for i in range(M)]
p = 1/np.random.rand(N)
rets = [p * masks[i] - 1 for i in range(M)]
v = np.random.rand(N)*10000 * np.round(np.random.rand(N))
t = m.addVar()
x = [m.addVar(vtype=gp.GRB.SEMICONT, lb=1000, ub=v[i]) for i in range(N)]
m.update()
cons = [m.addConstr(t <= gp.LinExpr(ret, x)) for ret in rets]
m.setObjective(t, gp.GRB.MAXIMIZE)
m.update()
start_time = time.time()
m.optimize()
solve_ms = int(((time.time() - start_time)*1000))
print('First solve took %s ms' % solve_ms)
p = 1/np.random.rand(N)
rets = [p * masks[i] - 1 for i in range(M)]
start_time = time.time()
for i in range(M):
for j in range(N):
if rets[i][j] != -1:
m.chgCoeff(cons[i], x[j], -rets[i][j])
m.update()
update_ms = int(((time.time() - start_time)*1000))
print('Model update took %s ms' % update_ms)
start_time = time.time()
m.optimize()
solve_ms = int(((time.time() - start_time)*1000))
print('Second solve took %s ms' % solve_ms)
k = 2
start_time = time.time()
for i in range(M):
for j in range(N):
if rets[i][j] != -1:
k *= rets[i][j]
solve_ms = int(((time.time() - start_time)*1000))
print('Plain loop took %s ms' % solve_ms)
R = np.array(rets)
start_time = time.time()
S = np.copy(R)
copy_ms = int(((time.time() - start_time)*1000))
print('np.copy() took %s ms' % copy_ms)
Output:
First solve took 1767 ms
Model update took 2051 ms
Second solve took 1872 ms
Plain loop took 1103 ms
np.copy() took 3 ms
A call to np.copy on the size (2000, 300) constraint matrix takes 3ms. Is there a fundamental reason I'm missing that the whole model update can't be that fast?
You can't access the constraint matrix directly in Gurobi using the Python interface. Even if you could, you couldn't do an np.copy operation because the matrix is in the CSR format, not a dense format. To make the sort of wholesale changes to a constraint, it is better to modify the constraint matrix by removing the constraint and adding a new one. In your case, the changes to each are so significant that you aren't going to get much benefit from a warm start, so you won't loose anything by not keeping the same constraint objects.
Assuming you adjust the rets array in the code above for the -1 special case, the following code will do want you want and be much faster.
for con in cons:
m.remove(con)
new_cons = [m.addConstr(t <= gp.LinExpr(ret, x)) for ret in rets]