Portfolio Selection in Python with constraints from a fixed set - python

I am working on a project where I am trying to select the optimal subset of players from a set of 125 players (example below)
The constraints are:
a) Number of players = 3
b) Sum of prices <= 30
The optimization function is Max(Sum of Votes)
Player Vote Price
William Smith 0.67 8.6
Robert Thompson 0.31 6.7
Joseph Robinson 0.61 6.2
Richard Johnson 0.88 4.3
Richard Hall 0.28 9.7
I looked at the scipy optimize package but I can't find anywhere a way to constraint the universe to this subset. Can anyone point me if there is a library that would do that?
Thanks

The problem is well suited to be formulated as mathematical program and can be solved with different Optimization libraries.
It is known as the exact k-item knapsack problem.
You can use the Package PuLP for example. It has interfaces to different optimization software packages, but comes bundled with a free solver.
easy_install pulp
Free solvers are often way slower than commercial ones, but I think PuLP should be able to solve reasonably large versions of your problem with its standard solver.
Your problem can be solved with PuLP as follows:
from pulp import *
# Data input
players = ["William Smith", "Robert Thompson", "Joseph Robinson", "Richard Johnson", "Richard Hall"]
vote = [0.67, 0.31, 0.61, 0.88, 0.28]
price = [8.6, 6.7, 6.2, 4.3, 9.7]
P = range(len(players))
# Declare problem instance, maximization problem
prob = LpProblem("Portfolio", LpMaximize)
# Declare decision variable x, which is 1 if a
# player is part of the portfolio and 0 else
x = LpVariable.matrix("x", list(P), 0, 1, LpInteger)
# Objective function -> Maximize votes
prob += sum(vote[p] * x[p] for p in P)
# Constraint definition
prob += sum(x[p] for p in P) == 3
prob += sum(price[p] * x[p] for p in P) <= 30
# Start solving the problem instance
prob.solve()
# Extract solution
portfolio = [players[p] for p in P if x[p].varValue]
print(portfolio)
The runtime to draw 3 players from 125 with the same random data as used by Brad Solomon is 0.5 seconds on my machine.

Your problem is discrete optimization task because of a) constraint. You should introduce discrete variables to represent taken/not taken players. Consider the following Minizinc pseudocode:
array[players_num] of var bool: taken_players;
array[players_num] of float: votes;
array[players_num] of float: prices;
constraint sum (taken_players * prices) <= 30;
constraint sum (taken_players) = 3;
solve maximize sum (taken_players * votes);
As far as I know, you can't use scipy to solve such problems (e.g. this).
You can solve your problem in these ways:
You can generate Minizinc problem in Python and solve it by calling external solver. It seems to be more scalable and robust.
You can use simulated annealing
Mixed integer approach
The second option seems to be simpler for you. But, personally, I prefer the first one: it allows you introducing a wide range of various constraints, problem formulation feels more natural and clear.

#CaptainTrunky is correct, scipy.minimize will not work here.
Here is an awfully crappy workaround using itertools, please ignore if one of the other methods has worked. Consider that to draw 3 players from 125 creates 317,750 combinations, n!/((n - k)! * k!). Runtime on the main loop ~ 6m.
from itertools import combinations
df = DataFrame({'Player' : np.arange(0, 125),
'Vote' : 10 * np.random.random(125),
'Price' : np.random.randint(1, 10, 125)
})
df
Out[109]:
Player Price Vote
0 0 4 7.52425
1 1 6 3.62207
2 2 9 4.69236
3 3 4 5.24461
4 4 4 5.41303
.. ... ... ...
120 120 9 8.48551
121 121 8 9.95126
122 122 8 6.29137
123 123 8 1.07988
124 124 4 2.02374
players = df.Player.values
idx = pd.MultiIndex.from_tuples([i for i in combinations(players, 3)])
votes = []
prices = []
for i in combinations(players, 3):
vote = df[df.Player.isin(i)].sum()['Vote']
price = df[df.Player.isin(i)].sum()['Price']
votes.append(vote); prices.append(price)
result = DataFrame({'Price' : prices, 'Vote' : votes}, index=idx)
# The index below is (first player, second player, third player)
result[result.Price <= 30].sort_values('Vote', ascending=False)
Out[128]:
Price Vote
63 87 121 25.0 29.75051
64 121 20.0 29.62626
64 87 121 19.0 29.61032
63 64 87 20.0 29.56665
65 121 24.0 29.54248
... ...
18 22 78 12.0 1.06352
23 103 20.0 1.02450
22 23 103 20.0 1.00835
18 22 103 15.0 0.98461
23 14.0 0.98372

Related

How to get the number pairs on the left of a dataframegroupby? Python pandas

I got a dataframeGroup by and then I print all pairs of groupby columns like:
('br17', 'CBC')
   instances nbMaxVilles Solver Formulation Temps_Calculs Optimal
0 br17 10 CBC fillMTZ 0.46119860999999673 39
2 br17 10 CBC fillDL 0.565562113999988 39
('br17', 'Gurobi')
instances nbMaxVilles Solver Formulation Temps_Calculs Optimal
1 br17 10 Gurobi fillMTZ 0.4650516840000023 39
3 br17 10 Gurobi fillDL 0.5595316250000337 39
('ft53', 'CBC')
instances nbMaxVilles Solver Formulation Temps_Calculs Optimal
6 ft53 13 CBC fillDL 0.7997463710000261 1918
4 ft53 13 CBC fillMTZ 1.214644144000033 1918
if this is not clear, here is a picture:
enter image description here
and then I want to get all numbers pairs on the left such as:
[(0,2),(1,3),(6,4)]
Assuming the groupby object is called grouper, try this -
[tuple(i) for i in grouper.indices.values()]
Another way -
grouper.apply(lambda x:x.index.tolist()).tolist()

Need help dividing the ratio of elements in a Python list

I am working on a problem where I have been asked to a) output Fibonacci numbers in a sequence based on user input, as I have done below, and b) divide and print the ratio of the two most recent terms.
fixed_start = [0, 1]
def fib(fixed_start, n):
if n == 0:
return fixed_start
else:
fixed_start.append(fixed_start[-1] + fixed_start[-2])
return fib(fixed_start, n -1)
numb = int(input('How many terms: '))
fibonacci_list = fib(fixed_start, numb)
print(fibonacci_list[:-1])
I would like for my output to look something like the below:
"How many terms:" 3
1 1
the ratio is 1.0
1 2
the ratio is 2.0
2 3
the ratio is 1.5
Are you looking for ratio of the last 2 items in the list? If yes, this should work.
print(fibonacci_list[-2:])
print(float(fibonacci_list[-1]/fibonacci_list[-2]))
Or, if you are looking for ratio between every 2 numbers (except 0 & 1 right at the start), the below code should do the trick
for x,y in zip(fibonacci_list[1:],fibonacci_list[2:]):
print(x,y)
print('the ratio is ' + str(round((y/x),3)))
output is something like below for a fibonacci list of 15 terms
1 1
the ratio is 1.0
1 2
the ratio is 2.0
2 3
the ratio is 1.5
3 5
the ratio is 1.667
5 8
the ratio is 1.6
8 13
the ratio is 1.625
13 21
the ratio is 1.615
21 34
the ratio is 1.619
34 55
the ratio is 1.618
55 89
the ratio is 1.618
89 144
the ratio is 1.618
144 233
the ratio is 1.618
233 377
the ratio is 1.618
377 610
the ratio is 1.618
610 987
the ratio is 1.618
As you have already solved the part one of generating the Fibonacci series in the form of a list, you can access the last two elements (most recent) from it and take their ratio. Python allows us to access the elements of the list from backwards using the negative indexing
def fibonacci_ratio(fibonacci_list):
last_element = fibonacci_list[-1]
second_last_element = fibonacci_list[-2]
ratio = last_element//second_last_element
return ratio
The double // in python will ensure floating point division.
Hope this helps!

Finding all combinations based on multiple conditions for a large list

I am trying to calculate the optimal team for a Fantasy Cycling game. I have a csv-file containing 176 cyclist, their teams, the amount of points they have scored and the price to put them in my team. I am trying to find the highest scoring team of 16 cyclists.
The rules that apply to the composition of any team are:
The total cost of the team can't exceed 100.
No more than 4 cyclists from the same team can be in a fantasy team.
A short excerpt of my csv-file can be found below.
THOMAS Geraint,Team INEOS,142,13
SAGAN Peter,BORA - hansgrohe,522,11.5
GROENEWEGEN Dylan,Team Jumbo-Visma,205,11
FUGLSANG Jakob,Astana Pro Team,46,10
BERNAL Egan,Team INEOS,110,10
BARDET Romain,AG2R La Mondiale,21,9.5
QUINTANA Nairo,Movistar Team,58,9.5
YATES Adam,Mitchelton-Scott,40,9.5
VIVIANI Elia,Deceuninck - Quick Step,273,9.5
YATES Simon,Mitchelton-Scott,13,9
EWAN Caleb,Lotto Soudal,13,9
The simplest way of solving this problem would be to generate a list of all possible combinations of teams, then apply the rules to exclude teams that do not comply with the aforementioned rules. After this it is simple to calculate the total score for each team and pick the highest scoring one. In theory, generating the list of usable teams can be achieved by the code below.
database_csv = pd.read_csv('renner_db_example.csv')
renners = database_csv.to_dict(orient='records')
budget = 100
max_same_team = 4
team_total = 16
combos = itertools.combinations(renners,team_total)
usable_combos = []
for i in combos:
if sum(persoon["Waarde"] for persoon in i) < budget and all(z <= max_same_team for z in [len(list(group)) for key, group in groupby([persoon["Ploeg"] for persoon in i])]) == True:
usable_combos.append(i)
However, calculating all combinations of a list of 176 cyclists into teams of 16 is something that is just too many calculations for my computer to handle, even though the answer to this question implies something else. I have done some research and could not find any way to apply the aforementioned conditions without having to iterate through every option.
Is there a way to find the optimal team, either by making the above approach work or by using an alternative approach?
Edit:
In text, the full CSV-file can be found here
It is a classical operations research problem.
There are tons of algorithms that permit to find an optimal (or just a very good depending on the algorithm) solution :
Mixed-Integer Programming
Metaheuristics
Constraint Programming
...
Here is a code that will find the optimal solution using MIP, ortools library and default solver COIN-OR :
from ortools.linear_solver import pywraplp
import pandas as pd
solver = pywraplp.Solver('cyclist', pywraplp.Solver.CBC_MIXED_INTEGER_PROGRAMMING)
cyclist_df = pd.read_csv('cyclists.csv')
# Variables
variables_name = {}
variables_team = {}
for _, row in cyclist_df.iterrows():
variables_name[row['Naam']] = solver.IntVar(0, 1, 'x_{}'.format(row['Naam']))
if row['Ploeg'] not in variables_team:
variables_team[row['Ploeg']] = solver.IntVar(0, solver.infinity(), 'y_{}'.format(row['Ploeg']))
# Constraints
# Link cyclist <-> team
for team, var in variables_team.items():
constraint = solver.Constraint(0, solver.infinity())
constraint.SetCoefficient(var, 1)
for cyclist in cyclist_df[cyclist_df.Ploeg == team]['Naam']:
constraint.SetCoefficient(variables_name[cyclist], -1)
# Max 4 cyclist per team
for team, var in variables_team.items():
constraint = solver.Constraint(0, 4)
constraint.SetCoefficient(var, 1)
# Max cyclists
constraint_max_cyclists = solver.Constraint(16, 16)
for cyclist in variables_name.values():
constraint_max_cyclists.SetCoefficient(cyclist, 1)
# Max cost
constraint_max_cost = solver.Constraint(0, 100)
for _, row in cyclist_df.iterrows():
constraint_max_cost.SetCoefficient(variables_name[row['Naam']], row['Waarde'])
# Objective
objective = solver.Objective()
objective.SetMaximization()
for _, row in cyclist_df.iterrows():
objective.SetCoefficient(variables_name[row['Naam']], row['Punten totaal:'])
# Solve and retrieve solution
solver.Solve()
chosen_cyclists = [key for key, variable in variables_name.items() if variable.solution_value() > 0.5]
print(cyclist_df[cyclist_df.Naam.isin(chosen_cyclists)])
Prints :
Naam Ploeg Punten totaal: Waarde
1 SAGAN Peter BORA - hansgrohe 522 11.5
2 GROENEWEGEN Dylan Team Jumbo-Visma 205 11.0
8 VIVIANI Elia Deceuninck - Quick Step 273 9.5
11 ALAPHILIPPE Julian Deceuninck - Quick Step 399 9.0
14 PINOT Thibaut Groupama - FDJ 155 8.5
15 MATTHEWS Michael Team Sunweb 323 8.5
22 TRENTIN Matteo Mitchelton-Scott 218 7.5
24 COLBRELLI Sonny Bahrain Merida 238 6.5
25 VAN AVERMAET Greg CCC Team 192 6.5
44 STUYVEN Jasper Trek - Segafredo 201 4.5
51 CICCONE Giulio Trek - Segafredo 153 4.0
82 TEUNISSEN Mike Team Jumbo-Visma 255 3.0
83 HERRADA Jesús Cofidis, Solutions Crédits 255 3.0
104 NIZZOLO Giacomo Dimension Data 121 2.5
123 MEURISSE Xandro Wanty - Groupe Gobert 141 2.0
151 TRATNIK Jan Bahrain Merida 87 1.0
How does this code solve the problem ? As #KyleParsons said, it looks like the knapsack problem and can be modelized using Integer Programming.
Let's define variables Xi (0 <= i <= nb_cyclists) and Yj (0 <= j <= nb_teams).
Xi = 1 if cyclist n°i is chosen, =0 otherwise
Yj = n where n is the number of cyclists chosen within team j
To define the relation between those variable, you can model these constraints :
# Link cyclist <-> team
For all j, Yj >= sum(Xi, for all i where Xi is part of team j)
To select only 4 cyclists per team max, you create these constraints :
# Max 4 cyclist per team
For all j, Yj <= 4
To select 16 cyclists, here are the associated constraints :
# Min 16 cyclists
sum(Xi, 1<=i<=nb_cyclists) >= 16
# Max 16 cyclists
sum(Xi, 1<=i<=nb_cyclists) <= 16
The cost constraint :
# Max cost
sum(ci * Xi, 1<=i<=n_cyclists) <= 100
# where ci = cost of cyclist i
Then you can maximize
# Objective
max sum(pi * Xi, 1<=i<=n_cyclists)
# where pi = nb_points of cyclist i
Notice that we model the problem using linear objective and linear inequation constraints. If Xi and Yj would be continous variables, this problem would be polynomial (Linear programming) and could be solved using :
Interior point methodes (polynomial solution)
Simplex (non polynomial but more effective in practice)
Because these variables are integers (Integer Programming or Mixed Integer Programming), the problem is known as be part of NP_complete class (cannot be solved using polynomial solutions unless you are a genious). Solvers like COIN-OR use complex Branch & Bound or Branch & Cut methods to solve them efficiently. ortools provides a nice wrapper to use COIN with python. These tools are free & open source.
All these methods have the advantage of finding an optimal solution without iterating on all the possible solutions (and considerably reduce the combinatorics).
I add an other answer for your question :
The CSV I posted was actually modified, my original one also contains a list for each rider with their score for each stage. This list looks like this [0, 40, 13, 0, 2, 55, 1, 17, 0, 14]. I am trying to find the team that performs the best overall. So I have a pool of 16 cyclists, from which the score of 10 cyclists counts towards the score of each day. The scores for each day are then summed to get a total score. The purpose is to get this final total score as high as possible.
If you think I should edit my first post please let me know, I think that it is more clear like this because my first post is quite dense and answers the initial question.
Let's introduce a new variable :
Zik = 1 if cyclist i is selected and is one of the top 10 in your team on day k
You need to add these constraints to link variables Zik and Xi (variable Zik cannot be = 1 if cyclist i is not selected i.e if Xi = 0)
For all i, sum(Zik, 1<=k<=n_days) <= n_days * Xi
And these constraints to select 10 cyclists per day :
For all k, sum(Zik, 1<=i<=n_cyclists) <= 10
Finally, your objective could be written like this :
Maximize sum(pik * Xi * Zik, 1<=i<=n_cyclists, 1 <= k <= n_days)
# where pik = nb_points of cyclist i at day k
And here is the thinking part. An objective written like this is not linear (notice the multiplication between the two variables X and Z). Fortunately, there are both binaries and there is a trick to transform this formula to its linear form.
Let's introduce again new variables Lik (Lik = Xi * Zik) to linearize the objective.
The objective can now be written like this and be linear :
Maximize sum(pik * Lik, 1<=i<=n_cyclists, 1 <= k <= n_days)
# where pik = nb_points of cyclist i at day k
And we need now to add these constraints to make Lik equal to Xi * Zik :
For all i,k : Xi + Zik - 1 <= Lik
For all i,k : Lik <= 1/2 * (Xi + Zik)
And voilà. This is the beauty of mathematics, you can model a lot of things with linear equations. I presented advanced notions and it is normal if you don't understand them at first glance.
I simulated the score per day column on this file.
Here is the Python code to solve the new problem :
import ast
from ortools.linear_solver import pywraplp
import pandas as pd
solver = pywraplp.Solver('cyclist', pywraplp.Solver.CBC_MIXED_INTEGER_PROGRAMMING)
cyclist_df = pd.read_csv('cyclists_day.csv')
cyclist_df['Punten_day'] = cyclist_df['Punten_day'].apply(ast.literal_eval)
# Variables
variables_name = {}
variables_team = {}
variables_name_per_day = {}
variables_linear = {}
for _, row in cyclist_df.iterrows():
variables_name[row['Naam']] = solver.IntVar(0, 1, 'x_{}'.format(row['Naam']))
if row['Ploeg'] not in variables_team:
variables_team[row['Ploeg']] = solver.IntVar(0, solver.infinity(), 'y_{}'.format(row['Ploeg']))
for k in range(10):
variables_name_per_day[(row['Naam'], k)] = solver.IntVar(0, 1, 'z_{}_{}'.format(row['Naam'], k))
variables_linear[(row['Naam'], k)] = solver.IntVar(0, 1, 'l_{}_{}'.format(row['Naam'], k))
# Link cyclist <-> team
for team, var in variables_team.items():
constraint = solver.Constraint(0, solver.infinity())
constraint.SetCoefficient(var, 1)
for cyclist in cyclist_df[cyclist_df.Ploeg == team]['Naam']:
constraint.SetCoefficient(variables_name[cyclist], -1)
# Max 4 cyclist per team
for team, var in variables_team.items():
constraint = solver.Constraint(0, 4)
constraint.SetCoefficient(var, 1)
# Max cyclists
constraint_max_cyclists = solver.Constraint(16, 16)
for cyclist in variables_name.values():
constraint_max_cyclists.SetCoefficient(cyclist, 1)
# Max cost
constraint_max_cost = solver.Constraint(0, 100)
for _, row in cyclist_df.iterrows():
constraint_max_cost.SetCoefficient(variables_name[row['Naam']], row['Waarde'])
# Link Zik and Xi
for name, cyclist in variables_name.items():
constraint_link_cyclist_day = solver.Constraint(-solver.infinity(), 0)
constraint_link_cyclist_day.SetCoefficient(cyclist, - 10)
for k in range(10):
constraint_link_cyclist_day.SetCoefficient(variables_name_per_day[name, k], 1)
# Min/Max 10 cyclists per day
for k in range(10):
constraint_cyclist_per_day = solver.Constraint(10, 10)
for name in cyclist_df.Naam:
constraint_cyclist_per_day.SetCoefficient(variables_name_per_day[name, k], 1)
# Linearization constraints
for name, cyclist in variables_name.items():
for k in range(10):
constraint_linearization1 = solver.Constraint(-solver.infinity(), 1)
constraint_linearization2 = solver.Constraint(-solver.infinity(), 0)
constraint_linearization1.SetCoefficient(cyclist, 1)
constraint_linearization1.SetCoefficient(variables_name_per_day[name, k], 1)
constraint_linearization1.SetCoefficient(variables_linear[name, k], -1)
constraint_linearization2.SetCoefficient(cyclist, -1/2)
constraint_linearization2.SetCoefficient(variables_name_per_day[name, k], -1/2)
constraint_linearization2.SetCoefficient(variables_linear[name, k], 1)
# Objective
objective = solver.Objective()
objective.SetMaximization()
for _, row in cyclist_df.iterrows():
for k in range(10):
objective.SetCoefficient(variables_linear[row['Naam'], k], row['Punten_day'][k])
solver.Solve()
chosen_cyclists = [key for key, variable in variables_name.items() if variable.solution_value() > 0.5]
print('\n'.join(chosen_cyclists))
for k in range(10):
print('\nDay {} :'.format(k + 1))
chosen_cyclists_day = [name for (name, day), variable in variables_name_per_day.items()
if (day == k and variable.solution_value() > 0.5)]
assert len(chosen_cyclists_day) == 10
assert all(chosen_cyclists_day[i] in chosen_cyclists for i in range(10))
print('\n'.join(chosen_cyclists_day))
Here are the results :
Your team :
SAGAN Peter
GROENEWEGEN Dylan
VIVIANI Elia
ALAPHILIPPE Julian
PINOT Thibaut
MATTHEWS Michael
TRENTIN Matteo
COLBRELLI Sonny
VAN AVERMAET Greg
STUYVEN Jasper
BENOOT Tiesj
CICCONE Giulio
TEUNISSEN Mike
HERRADA Jesús
MEURISSE Xandro
GRELLIER Fabien
Selected cyclists per day
Day 1 :
SAGAN Peter
VIVIANI Elia
ALAPHILIPPE Julian
MATTHEWS Michael
COLBRELLI Sonny
VAN AVERMAET Greg
STUYVEN Jasper
CICCONE Giulio
TEUNISSEN Mike
HERRADA Jesús
Day 2 :
SAGAN Peter
ALAPHILIPPE Julian
MATTHEWS Michael
TRENTIN Matteo
COLBRELLI Sonny
VAN AVERMAET Greg
STUYVEN Jasper
TEUNISSEN Mike
NIZZOLO Giacomo
MEURISSE Xandro
Day 3 :
SAGAN Peter
GROENEWEGEN Dylan
VIVIANI Elia
MATTHEWS Michael
TRENTIN Matteo
VAN AVERMAET Greg
STUYVEN Jasper
CICCONE Giulio
TEUNISSEN Mike
HERRADA Jesús
Day 4 :
SAGAN Peter
VIVIANI Elia
PINOT Thibaut
MATTHEWS Michael
TRENTIN Matteo
COLBRELLI Sonny
VAN AVERMAET Greg
STUYVEN Jasper
TEUNISSEN Mike
HERRADA Jesús
Day 5 :
SAGAN Peter
VIVIANI Elia
ALAPHILIPPE Julian
PINOT Thibaut
MATTHEWS Michael
TRENTIN Matteo
COLBRELLI Sonny
VAN AVERMAET Greg
CICCONE Giulio
HERRADA Jesús
Day 6 :
SAGAN Peter
GROENEWEGEN Dylan
VIVIANI Elia
ALAPHILIPPE Julian
MATTHEWS Michael
TRENTIN Matteo
COLBRELLI Sonny
STUYVEN Jasper
CICCONE Giulio
TEUNISSEN Mike
Day 7 :
SAGAN Peter
VIVIANI Elia
ALAPHILIPPE Julian
MATTHEWS Michael
COLBRELLI Sonny
VAN AVERMAET Greg
STUYVEN Jasper
TEUNISSEN Mike
HERRADA Jesús
MEURISSE Xandro
Day 8 :
SAGAN Peter
GROENEWEGEN Dylan
VIVIANI Elia
ALAPHILIPPE Julian
MATTHEWS Michael
STUYVEN Jasper
TEUNISSEN Mike
HERRADA Jesús
NIZZOLO Giacomo
MEURISSE Xandro
Day 9 :
SAGAN Peter
GROENEWEGEN Dylan
VIVIANI Elia
ALAPHILIPPE Julian
PINOT Thibaut
TRENTIN Matteo
COLBRELLI Sonny
VAN AVERMAET Greg
TEUNISSEN Mike
HERRADA Jesús
Day 10 :
SAGAN Peter
GROENEWEGEN Dylan
VIVIANI Elia
PINOT Thibaut
COLBRELLI Sonny
STUYVEN Jasper
CICCONE Giulio
TEUNISSEN Mike
HERRADA Jesús
NIZZOLO Giacomo
Let's compare the results of answer 1 and answer 2 print(solver.Objective().Value()):
You get 3738.0 with the first model, 3129.087388325567 with the second one. The value is lower because you select only 10 cyclists per stage instead of 16.
Now if keep the first solution and use the new scoring method, we get 3122.9477585307413
We could consider that the first model is good enough : we didn't have to introduce new variables/constraints, model stays simple and we got a solution almost as good as the complex model. Sometimes it is not necessary to be 100% accurate and a model can be solved more easily and quickly with some approximations.

Python PuLP performance issue - taking too much time to solve

I am using pulp to create an allocator function which packs the items in the trucks based on the weight and volume. It works fine(takes 10-15 sec) for 10-15 items but when I double the items it takes more than half hour to solve it.
def allocator(item_mass,item_vol,truck_mass,truck_vol,truck_cost, id_series):
n_items = len(item_vol)
set_items = range(n_items)
n_trucks = len(truck_cost)
set_trucks = range(n_trucks)
print("working1")
y = pulp.LpVariable.dicts('truckUsed', set_trucks,
lowBound=0, upBound=1, cat=LpInteger)
x = pulp.LpVariable.dicts('itemInTruck', (set_items, set_trucks),
lowBound=0, upBound=1, cat=LpInteger)
print("working2")
# Model formulation
prob = LpProblem("Truck allocation problem", LpMinimize)
# Objective
prob += lpSum([truck_cost[i] * y[i] for i in set_trucks])
print("working3")
# Constraints
for j in set_items:
# Every item must be taken in one truck
prob += lpSum([x[j][i] for i in set_trucks]) == 1
for i in set_trucks:
# Respect the mass constraint of trucks
prob += lpSum([item_mass[j] * x[j][i] for j in set_items]) <= truck_mass[i]*y[i]
# Respect the volume constraint of trucks
prob += lpSum([item_vol[j] * x[j][i] for j in set_items]) <= truck_vol[i]*y[i]
print("working4")
# Ensure y variables have to be set to make use of x variables:
for j in set_items:
for i in set_trucks:
x[j][i] <= y[i]
print("working5")
s = id_series #id_series
prob.solve()
print("working6")
This is the data i am running it on
items:
Name Pid Quantity Length Width Height Volume Weight t_type
0 A 1 1 4.60 4.30 4.3 85.05 1500 Open
1 B 2 1 4.60 4.30 4.3 85.05 1500 Open
2 C 3 1 6.00 5.60 9.0 302.40 10000 Container
3 D 4 1 8.75 5.60 6.6 441.00 1000 Open
4 E 5 1 6.00 5.16 6.6 204.33 3800 Open
5 C 6 1 6.00 5.60 9.0 302.40 10000 All
6 C 7 1 6.00 5.60 9.0 302.40 10000 Container
7 D 8 1 8.75 5.60 6.6 441.00 6000 Open
8 E 9 1 6.00 5.16 6.6 204.33 3800 Open
9 C 10 1 6.00 5.60 9.0 302.40 10000 All
.... times 5
trucks(this just the top 5 rows, I have 54 types of trucks in total):
Category Name TruckID Length(ft) Breadth(ft) Height(ft) Volume \
0 LCV Tempo 407 0 9.5 5.5 5.5 287.375
1 LCV Tempo 407 1 9.5 5.5 5.5 287.375
2 LCV Tempo 407 2 9.5 5.5 5.5 287.375
3 LCV 13 Feet 3 13.0 5.5 7.0 500.500
4 LCV 14 Feet 4 14.0 6.0 6.0 504.000
Weight Price
0 1500 1
1 2000 1
2 2500 2
3 3500 3
4 4000 3
where ItemId is this:
data["ItemId"] = data.index + 1
id_series = data["ItemId"].tolist()
PuLP can handle multiple solvers. See what ones you have with:
pulp.pulpTestAll()
This will give a list like:
Solver pulp.solvers.PULP_CBC_CMD unavailable.
Solver pulp.solvers.CPLEX_DLL unavailable.
Solver pulp.solvers.CPLEX_CMD unavailable.
Solver pulp.solvers.CPLEX_PY unavailable.
Testing zero subtraction
Testing continuous LP solution
Testing maximize continuous LP solution
...
* Solver pulp.solvers.COIN_CMD passed.
Solver pulp.solvers.COINMP_DLL unavailable.
Testing zero subtraction
Testing continuous LP solution
Testing maximize continuous LP solution
...
* Solver pulp.solvers.GLPK_CMD passed.
Solver pulp.solvers.XPRESS unavailable.
Solver pulp.solvers.GUROBI unavailable.
Solver pulp.solvers.GUROBI_CMD unavailable.
Solver pulp.solvers.PYGLPK unavailable.
Solver pulp.solvers.YAPOSIB unavailable.
You can then solve using, e.g.:
lp_prob.solve(pulp.COIN_CMD())
Gurobi and CPLEX are commercial solvers that tend to work quite well. Perhaps you could access them? Gurobi has a good academic license.
Alternatively, you may wish to look into an approximate solution, depending on your quality constraints.

Python computing error

I’m using the API mpmath to compute the following sum
Let us consider the serie u0, u1, u2 defined by:
u0 = 3/2 = 1,5
u1 = 5/3 = 1,6666666…
un+1 = 2003 - 6002/un + 4000/un un-1
The serie converges on 2, but with rounding problem it seems to converge on 2000.
n Calculated value Rounded off exact value
2 1,800001 1,800000000
3 1,890000 1,888888889
4 3,116924 1,941176471
5 756,3870306 1,969696970
6 1996,761549 1,984615385
7 1999,996781 1,992248062
8 1999,999997 1,996108949
9 2000,000000 1,998050682
10 2000,000000 1,999024390
My code :
from mpmath import *
mp.dps = 50
u0=mpf(3/2.0)
u1=mpf(5/3.0)
u=[]
u.append(u0)
u.append(u1)
for i in range (2,11):
un1=(2003-6002/u[i-1]+(mpf(4000)/mpf((u[i-1]*u[i-2]))))
u.append(un1)
print u
my bad results :
[mpf('1.5'),
mpf('1.6666666666666667406815349750104360282421112060546875'),
mpf('1.8000000000000888711326751945268011597589466120961647'),
mpf('1.8888888889876302386905492787148253684796100079942617'),
mpf('1.9411765751351638992775070422559330255517747908588059'),
mpf('1.9698046831709839591526211645628191427874374792786951'),
mpf('2.093979191783975876606205176530675127058752077926479'),
mpf('106.44733511712489354422046139349654833300787666477228'),
mpf('1964.5606972399290690749220686397494349501387742896911'),
mpf('1999.9639916238009625032390578545797067344576357100626'),
mpf('1999.9999640260895343960004614025893194430187653900418')]
I tried to perform with some others functions (fdiv…) or to change the precision: same bad result
What’s wrong with this code ?
Question:
How to change my code to find the value 2.0 ??? with the formula :
un+1 = 2003 - 6002/un + 4000/un un-1
thanks
Using the decimal module, you can see the series also has a solution converging at 2000:
from decimal import Decimal, getcontext
getcontext().prec = 100
u0=Decimal(3) / Decimal(2)
u1=Decimal(5) / Decimal(3)
u=[u0, u1]
for i in range(100):
un1 = 2003 - 6002/u[-1] + 4000/(u[-1]*u[-2])
u.append(un1)
print un1
The recurrence relation has multiple fixed points (one at 2 and the other at 2000):
>>> u = [Decimal(2), Decimal(2)]
>>> 2003 - 6002/u[-1] + 4000/(u[-1]*u[-2])
Decimal('2')
>>> u = [Decimal(2000), Decimal(2000)]
>>> 2003 - 6002/u[-1] + 4000/(u[-1]*u[-2])
Decimal('2000.000')
The solution at 2 is an unstable fixed-point. The attractive fixed-point is at 2000.
The convergence gets very close to two and when the round-off causes the value to slightly exceed two, that difference gets amplified again and again until hitting 2000.
Your (non-linear) recurrence sequence has three fixed points: 1, 2 and 2000. The values 1 and 2 are close to each other compared to 2000, which is usually an indication of unstable fixed points because they are "almost" double roots.
You need to do some maths in order to diverge less early. Let v(n) be a side sequence:
v(n) = (1+2^n)u(n)
The following holds true:
v(n+1) = (1+2^(n+1)) * (2003v(n)v(n-1) - 6002(1+2^n)v(n-1) + 4000(1+2^n)(1+2^n-1)) / (v(n)v(n-1))
You can then simply compute v(n) and deduce u(n) from u(n) = v(n)/(1+2^n):
#!/usr/bin/env python
from mpmath import *
mp.dps = 50
v0 = mpf(3)
v1 = mpf(5)
v=[]
v.append(v0)
v.append(v1)
u=[]
u.append(v[0]/2)
u.append(v[1]/3)
for i in range (2,25):
vn1 = (1+2**i) * (2003*v[i-1]*v[i-2] \
- 6002*(1+2**(i-1))*v[i-2] \
+ 4000*(1+2**(i-1))*(1+2**(i-2))) \
/ (v[i-1]*v[i-2])
v.append(vn1)
u.append(vn1/(1+2**i))
print u
And the result:
[mpf('1.5'),
mpf('1.6666666666666666666666666666666666666666666666666676'),
mpf('1.8000000000000000000000000000000000000000000000000005'),
mpf('1.8888888888888888888888888888888888888888888888888892'),
mpf('1.9411764705882352941176470588235294117647058823529413'),
mpf('1.969696969696969696969696969696969696969696969696969'),
mpf('1.9846153846153846153846153846153846153846153846153847'),
mpf('1.992248062015503875968992248062015503875968992248062'),
mpf('1.9961089494163424124513618677042801556420233463035019'),
mpf('1.9980506822612085769980506822612085769980506822612089'),
mpf('1.9990243902439024390243902439024390243902439024390251'),
mpf('1.9995119570522205954123962908735968765251342118106393'),
mpf('1.99975591896509641200878691725652916768367097876495'),
mpf('1.9998779445868424264616135725619431221774685707311133'),
mpf('1.9999389685688129386634116570033567287152883735123589'),
mpf('1.9999694833531691537733833806341359211449845890933504'),
mpf('1.9999847414437645909944001098616048949448403192089965'),
mpf('1.9999923706636759668276456631037666033431751771913355'),
...
Note that this will still diverge eventually. In order to really converge, you need to compute v(n) with arbitrary precision. But this is now a lot easier since all the values are integers.
You calculate your initial values to 53-bits of precision and then assign that rounded value to the high-precision mpf variable. You should use u0=mpf(3)/mpf(2) and u1=mpf(5)/mpf(3). You'll stay close to 2 for a few more interations, but you'll still end up converging at 2000. This is due to rounding error. One alternative is to compute with fractions. I used gmpy and the following code converges to 2.
from __future__ import print_function
import gmpy
u = [gmpy.mpq(3,2), gmpy.mpq(5,3)]
for i in range(2,300):
temp = (2003 - 6002/u[-1] + 4000/(u[-1]*u[-2]))
u.append(temp)
for i in u: print(gmpy.mpf(i,300))
If you compute with infinite precision then you get 2 otherwise you get 2000:
import itertools
from fractions import Fraction
def series(u0=Fraction(3, 2), u1=Fraction(5, 3)):
yield u0
yield u1
while u0 != u1:
un = 2003 - 6002/u1 + 4000/(u1*u0)
yield un
u1, u0 = un, u1
for i, u in enumerate(itertools.islice(series(), 100)):
err = (2-u)/2 # relative error
print("%d\t%.2g" % (i, err))
Output
0 0.25
1 0.17
2 0.1
3 0.056
4 0.029
5 0.015
6 0.0077
7 0.0039
8 0.0019
9 0.00097
10 0.00049
11 0.00024
12 0.00012
13 6.1e-05
14 3.1e-05
15 1.5e-05
16 7.6e-06
17 3.8e-06
18 1.9e-06
19 9.5e-07
20 4.8e-07
21 2.4e-07
22 1.2e-07
23 6e-08
24 3e-08
25 1.5e-08
26 7.5e-09
27 3.7e-09
28 1.9e-09
29 9.3e-10
30 4.7e-10
31 2.3e-10
32 1.2e-10
33 5.8e-11
34 2.9e-11
35 1.5e-11
36 7.3e-12
37 3.6e-12
38 1.8e-12
39 9.1e-13
40 4.5e-13
41 2.3e-13
42 1.1e-13
43 5.7e-14
44 2.8e-14
45 1.4e-14
46 7.1e-15
47 3.6e-15
48 1.8e-15
49 8.9e-16
50 4.4e-16
51 2.2e-16
52 1.1e-16
53 5.6e-17
54 2.8e-17
55 1.4e-17
56 6.9e-18
57 3.5e-18
58 1.7e-18
59 8.7e-19
60 4.3e-19
61 2.2e-19
62 1.1e-19
63 5.4e-20
64 2.7e-20
65 1.4e-20
66 6.8e-21
67 3.4e-21
68 1.7e-21
69 8.5e-22
70 4.2e-22
71 2.1e-22
72 1.1e-22
73 5.3e-23
74 2.6e-23
75 1.3e-23
76 6.6e-24
77 3.3e-24
78 1.7e-24
79 8.3e-25
80 4.1e-25
81 2.1e-25
82 1e-25
83 5.2e-26
84 2.6e-26
85 1.3e-26
86 6.5e-27
87 3.2e-27
88 1.6e-27
89 8.1e-28
90 4e-28
91 2e-28
92 1e-28
93 5e-29
94 2.5e-29
95 1.3e-29
96 6.3e-30
97 3.2e-30
98 1.6e-30
99 7.9e-31
Well, as casevh said, I just added the mpf function in first initials terms in my code :
u0=mpf(3)/mpf(2)
u1=mpf(5)/mpf(3)
and the value converge for 16 steps to the correct value 2.0 before diverged again (see below).
So, even with a good python library for arbitrary-precision floating-point arithmetic and some basics operations the result can become totally false and it is not algorithmic, mathematical or recurrence problem as I read sometimes.
So it is necessary to remain watchful and critic !!! ( I’m very afraid about the mpmath.lerchphi(z, s, a) function ;-)
2 1.8000000000000000000000000000000000000000000000022 3
1.8888888888888888888888888888888888888888888913205 4 1.9411764705882352941176470588235294117647084569125 5 1.9696969696969696969696969696969696969723495083846 6 1.9846153846153846153846153846153846180779422496889 7 1.992248062015503875968992248062018218070968279944 8 1.9961089494163424124513618677070049064461141667961 9 1.998050682261208576998050684991268132991329645551 10 1.9990243902439024390243929766241359876402781522945 11 1.9995119570522205954151303455889283862002420414092 12 1.9997559189650964147435086295745928366095548127257 13 1.9998779445868451615169464386495752584786229236677 14 1.9999389685715481608370784691478769380770569091713 15 1.9999694860884747554701272066241108169217231319376 16 1.9999874767910784720428384947047783821702386000249 17 2.0027277350948824117795762659330557916802871427763 18 4.7316350177463946015607576536159982430500337286276 19 1156.6278675611076227796014310764287933259776352198 20 1998.5416721291457644804673979070312813731252347786 21 1999.998540608689366669273522363692463645090555294 22 1999.9999985406079725746311606572627439743947878652
The exact solution to your recurrence relation (with initial values u_0 = 3/2, u_1 = 5/3) is easily verified to be
u_n = (2^(n+1) + 1) / (2^n + 1). (*)
The problem you're seeing is that although the solution is such that
lim_{n -> oo} u_n = 2,
this limit is a repelling fixed point of your recurrence relation. That is, any departure from the correct values of u_{n-1}, u{n-2}, for some n, will result in further values diverging from the correct limit. Consequently, unless your implementation of the recurrence relation correctly represents every u_n value exactly, it can be expected to exhibit eventual divergence from the correct limit, converging to the incorrect value of 2000 that just happens to be the only attracting fixed point of your recurrence relation.
(*) In fact, u_n = (2^(n+1) + 1) / (2^n + 1) is the solution to any recurrence relation of the form
u_n = C + (7 - 3C)/u_{n-1} + (2C - 6)/(u_{n-1} u_{n-2})
with the same initial values as given above, where C is an arbitrary constant. If I haven't made a mistake finding the roots of the characteristic polynomial, this will have the set of fixed points {1, 2, C - 3}\{0}. The limit 2 can be either a repelling fixed point or an attracting fixed point, depending on the value of C. E.g., for C = 2003 the set of fixed points is {1, 2, 2000} with 2 being a repellor, whereas for C = 3 the fixed points are {1, 2} with 2 being an attractor.

Categories