My objective is:
to find which teams have the easiest game for each game week
to select 2 teams max in the solution (by permuting score from those 2 teams for each game week)
my data set looks like this:
event (1) being the first week game, event (2) the second week game etc...
I am currently able to select the best game for each fixture using:
for event_id in np.unique(events):
model += sum(decisions[i] for i in range(event) if events[i]==event_id) == 1 # pick one fixture for each game week
Bu i do not know how to build a constraint asking the model to pick only 2 teams for the season and managing the permutation. I have tried a few things with no success.
my LP function is as follow:
def fixtures_analyser(team, events, expected_scores):
event = len(events)
objFunction=pulp.LpMaximize
model = pulp.LpProblem("Constrained value maximisation", objFunction)
decisions = [
pulp.LpVariable("x{}".format(i), lowBound=0, upBound=1, cat='Integer')
for i in range(event)
]
# objective function:
model += sum((decisions[i]) * (float(expected_scores[i]))
for i in range(event)), "Objective"
# event constraint
for event_id in np.unique(events):
model += sum(decisions[i] for i in range(event) if events[i]==event_id) == 1 # total cost
model.solve()
print("Total expected score = {}".format(model.objective.value()))
return decisions
Output is currently like this:
Expected outcome would be to see only 2 teams e.g. Liverpool and Man city and not chelsea etc..
Related
I have been trying to make this simulation using Simpy, but I just can't figure out how it works. If you have any tips on how to learn it from example code (starting at the bottom and going up through functions, or the other way around?), or any good sources that would already be of great help.
What I want to simulate:
A bike rental service with S rental stations and T bikes at t=0.
Customers arrivals and rental times are exponentially distributed. When a bike is rented, there is a given probability to go to any of the rental stations. For example, with S=2, the probabilities are [[0.9,0.1],[0.5,0.5]].
I tried to do it without simpy, but I don't know how to manage the number of bikes at the stations and manage arrivals while rentals are happening.
Any help is more than welcome as I am starting to get kind of desperate.
Thank you!
Here is one way to do it
"""
Simple simulation of several bike rental stations
Stations are modeled with containers so bikes can be returned
to a station different from where it was rented from
programer: Michael R. Gibbs
"""
import simpy
import random
# scenario attributes
station_names = ['A','B']
rent_probs = [.9,.1]
return_probs = [.5,.5]
bikes_per_station = 5
def rent_proc(env, id, station_names, rent_probs, return_probs, station_map):
"""
Models the process of:
selecting a station
renting a bike
using a bike
returning a bike (can be different station)
"""
#select a station
name = random.choices(station_names,weights=rent_probs)
name = name[0]
station = station_map[name]
print(f'{env.now}: id:{id} has arrived at station {name} q-len:{len(station.get_queue)} and {station.level} bikes')
# get a bike
yield station.get(1)
print(f'{env.now}: id:{id} has rented bike at station {name} q-len:{len(station.get_queue)} and {station.level} bikes')
# use bike
yield env.timeout(random.triangular(1,5,3))
# return bike
name = random.choices(station_names,weights=return_probs)
name = name[0]
station = station_map[name]
yield station.put(1)
print(f'{env.now}: id:{id} has returned bike at station {name} q-len:{len(station.get_queue)} and {station.level} bikes')
def gen_arrivals(env, station_names, rent_probs, return_probs, station_map):
"""
Generates arrivales to the rental stations
"""
cnt = 0
while True:
yield env.timeout(random.expovariate(2.5))
cnt += 1
env.process(rent_proc(env,cnt,station_names,rent_probs,return_probs, station_map))
# set up
env = simpy.Environment()
# create station based on name list
cap = len(station_names) * bikes_per_station
station_map = {
name: simpy.Container(env, init=bikes_per_station, capacity=cap)
for name in station_names
}
# start generation arrivals
env.process(gen_arrivals(env, station_names, rent_probs, return_probs, station_map))
# start sim
env.run(100)
I'm working on a naive multinomial bayes classifier for articles in Pandas and have run into a bit of an issue with performance. My repo is here if you want the full code and the dataset I'm using: https://github.com/kingcodefish/multinomial-bayesian-classification/blob/master/main.ipynb
Here's my current setup with two dataframes: df for the articles with lists of tokenized words and word_freq to store precomputed frequency and P(word | category) values.
for category in df['category'].unique():
category_filter = word_freq.loc[word_freq['category'] == category]
cat_articles = df.loc[df['category'] == category].shape[0] # The number of categorized articles
p_cat = cat_articles / df.shape[0] # P(Cat) = # of articles per category / # of articles
df[category] = df['content'].apply(lambda x: category_filter[category_filter['word'].isin(x)]['p_given_cat'].prod()) * p_cat
Example data:
df
category content
0 QUEER VOICES [online, dating, thoughts, first, date, grew, ...
1 COLLEGE [wishes, class, believe, generation, better, j...
2 RELIGION [six, inspiring, architectural, projects, revi...
3 WELLNESS [ultramarathon, runner, micah, true, died, hea...
4 ENTERTAINMENT [miley, cyrus, ball, debuts, album, art, cyrus...
word_freq
category word freq p_given_cat
46883 MEDIA seat 1.0 0.333333
14187 CRIME ends 1.0 0.333333
81317 WORLD NEWS seat 1.0 0.333333
12463 COMEDY living 1.0 0.200000
20868 EDUCATION director 1.0 0.500000
Please note that the word_freq table is a cross product of the categories x words, so every word appears once and only once in each category, so the table does contain duplicates. Also, the freq column has been increased by 1 to avoid zero values (Laplace smoothed).
After running the above, I do this to find the max category P (each category's P is stored in a column after its name) and get the following:
df['predicted_category'] = df[df.columns.difference(['category', 'content'])].idxmax(axis=1)
df = df.drop(df.columns.difference(['category', 'content', 'predicted_category']), axis=1).reset_index(drop = True)
category content \
0 POLITICS [bernie, sanders, campaign, split, whether, fi...
1 COMEDY [bill, maher, compares, police, unions, cathol...
2 WELLNESS [busiest, people, earth, find, time, relax, th...
3 ENTERTAINMENT [lamar, odom, gets, standing, ovation, first, ...
4 GREEN [lead, longer, life, go, gut]
predicted_category
0 ARTS
1 ARTS
2 ARTS
3 TASTE
4 GREEN
This method seems to work well, but it is unfortunately really slow. I am using a large dataset of 200,000 articles with short descriptions and operating on only 1% of this is taking almost a minute. I know it's because I am looping through the categories instead of relying on vectorization, but I am very very new to Pandas and trying to formulate this in a groupby succinctly escapes me (especially with the two data tables, also might be unnecessary), so I'm looking for suggestions here.
Thanks!
Just in case someone happens to come across this later...
Instead of representing my categories x words as a cross product of every possible word of every category, which inflated to over 3 million rows in my data set, I decided to reduce them to only the necessary ones per category and provide a default value for ones that did not exist, which ended up being about 600k rows.
But the biggest speedup came from changing to the following:
for category in df['category'].unique():
# Calculate P(Category)
category_filter = word_freq.loc[word_freq['category'] == category]
cat_articles = df.loc[df['category'] == category].shape[0]
p_cat = cat_articles / df.shape[0]
# Create a word->P(word | category) dictionary for quick lookups
category_dict = category_filter.set_index('word').to_dict()['p_given_cat']
# For every article, find the product of P(word | category) values of the words, then multiply by P(category) to get bayes.
df[category] = df['content'].apply(lambda x: np.prod([category_dict.get(y, 0.001 / (cat_articles + 0.001)) for y in x])) * p_cat
I created a dictionary from the two columns word and the P(word | category) as the key-value respectively. This reduced the problem to a quick dictionary lookup for each element of each list and computing that product.
This ended up being about 100x faster, parsing the whole dataset in ~40 seconds.
I am new to BeautifulSoup and I wanted to try out some web scraping. For my little project, I wanted to get the Golden State Warrior win rate from Wikipedia. I was planning to get the table that had that information and make it into a panda so I could graph it over the years. However, my code selects the Table Key table instead of the Seasons table. I know this is because they are the same type of table (wikitable), but I don't know how to solve this problem. I am sure that there is an easy explanation that I am missing. Can someone please explain how to fix my code and explain how I could choose which tables to web scrape in the future? Thanks!
c_data = "https://en.wikipedia.org/wiki/List_of_Golden_State_Warriors_seasons" #wikipedia page
c_page = urllib.request.urlopen(c_data)
c_soup = BeautifulSoup(c_page, "lxml")
c_table=c_soup.find('table', class_='wikitable') #this is the problem
c_year = []
c_rate = []
for row in c_table.findAll('tr'): #setup for dataframe
cells=row.findAll('td')
if len(cells)==13:
c_year = c_year.append(cells[0].find(text=True))
c_rate = c_rate.append(cells[9].find(text=True))
print(c_year, c_rate)
Use pd.read_html to get all the tables
This function returns a list of dataframes
tables[0] through tables[17], in this case
import pandas as pd
# read tables
tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_Golden_State_Warriors_seasons')
print(len(tables))
>>> 18
tables[0]
0 1
0 AHC NBA All-Star Game Head Coach
1 AMVP All-Star Game Most Valuable Player
2 COY Coach of the Year
3 DPOY Defensive Player of the Year
4 Finish Final position in division standings
5 GB Games behind first-place team in division[b]
6 Italics Season in progress
7 Losses Number of regular season losses
8 EOY Executive of the Year
9 FMVP Finals Most Valuable Player
10 MVP Most Valuable Player
11 ROY Rookie of the Year
12 SIX Sixth Man of the Year
13 SPOR Sportsmanship Award
14 Wins Number of regular season wins
# display all dataframes in tables
for i, table in enumerate(tables):
print(f'Table {i}')
display(table)
print('\n')
Select specific table
df_i_want = tables[x] # x is the specified table, 0 indexed
# delete tables
del(tables)
I am trying to solve a MIP problem. I am trying to find the number of exams to be done by each tech on a date for a week by minimizing the total number of techs used.
I have demand, time taken by each tech, list of techs etc. in separate dataframes.
Initially, I was using the cost function as minimizing the total time used to finish demand which #kabdulla helped me solve, linkhere!
Now, with the new cost function, the script gets stuck and doesn't seem to converge and I am not able to identify the reason.
Below is my code so far:
# Instantiate problem class
model = pulp.LpProblem("Time minimizing problem", pulp.LpMinimize)
capacity = pulp.LpVariable.dicts("capacity",
((examdate , techname, region) for examdate, techname, region in tech_data_new.index),
lowBound=0,
cat='Integer')
tech_used = pulp.LpVariable.dicts("techs",
((examdate,techname) for examdate,techname,region in tech_data_new.index.unique()),
cat='Binary')
model += pulp.lpSum(tech_used[examdate, techname] for examdate,techname in date_techname_index.index.unique())
for date in demand_data.index.get_level_values('Exam Date').unique():
for i in demand_data.loc[date].index.tolist():
model += pulp.lpSum([capacity[examdate,techname,region] for examdate, techname, region in tech_data_new.index if (date == examdate and i == region)]) == demand_data.loc[(demand_data.index.get_level_values('Exam Date') == date) & (demand_data.index.get_level_values('Body Region') == i), shiftname].item()
for examdate, techname,region in tech_data_new.index:
model += (capacity[examdate, techname, region]) <= tech_data_new.loc[(examdate,techname,region), 'Max Capacity']*tech_used[examdate, techname]
# Number of techs used in a day should be less than 8
for examdate in tech_data_new.index.get_level_values('Exam Date').unique():
model += pulp.lpSum(tech_used[examdate, techname] for techname in tech_data_new.index.get_level_values('Technologist Name').unique()) <=8
# Max time each tech should work in a day should be less than 8 hours(28800 secs)
for date in tech_data_new.index.get_level_values('Exam Date').unique():
for name in tech_data_new.loc[date].index.get_level_values('Technologist Name').unique():
#print(name)
model += pulp.lpSum(capacity[examdate,techname,region] * tech_data_new.loc[(examdate,techname,region), 'Time taken'] for examdate, techname, region in tech_data_new.index if (date == examdate and name == techname)) <= 28800
The last condition seems to be the problem, if I remove it, the problem converges. However, I am not able to understand the problem.
Please let me know, what I am missing in my understanding. Thanks.
I've setup a simulation example below.
Setup:
I have weekly data, say 6 years of data each week of around 1000 stocks some weeks more other weeks less than 1000. I randomly chose 75 stocks at time t0. At t1 some stocks dies (probability p, goes out of fashion) or leave the index (structural such as merging). I need to simulate stocks so that every week I've exactly 75 stocks. Every week some stocks dies (between 0 and 75) and I pick new ones not from the existing 75. I also check if the stock leaves do to structural reasons. Every week I calculate the returns of the 75 stocks.
Questions: Is there an obvious why to improve the speed. I started with Pandas objects (group sort) which was to slow. I haven't tried to parallel the loop. I'm more interesting to hear if I should use numba (but it doesn't have the np.in1d function) or if there is a faster way to shuffle (I actually only need to shuffle the ones). I've also think about creating a fixed array with all stocks id using NaN, the problem here is that I need 75 names so I still need to filter out these NaN every week.
Maybe this is to detailed problem for this forum, I apologize if that's the case
Code:
from timeit import default_timer
import numpy as np
# Create dataset
n_weeks = 312 # Approximately 6 years of weekly data
n_stocks = np.random.normal(1000, 5, n_weeks).astype(dtype=np.uint16) # Around 1000 stocks every week but not fixed
idx_new_week = np.cumsum(np.hstack((0, n_stocks)))
# We give each stock a stock idea
n_obs = n_stocks.sum()
stock_id = np.ones([n_obs], dtype=np.uint16)
for j in range(1, n_weeks+1):
stock_id[idx_new_week[j-1]:idx_new_week[j]] = np.cumsum(np.ones(n_stocks[j-1]))
stock_rtn = np.random.normal(0, 0.25/np.sqrt(52), n_obs) # Simulated forward (one week ahead) return for each stock
# Simulation part
# Week 0 pick randomly 75 stocks
# Week n >=1 a stock dies for two reasons
# 1) randomness (probability 'p')
# 2) structural event (could be merger, fall out of index).
# We cannot assume that it is always the high stockid which dies for structural reasons (as it looks like here)
# If a stock dies we randomely pick a stock from the "deak" stock dataset (not included the ones which dies this week)
n_sim = 100 # I want this to be 1 mill
n_stock_cand = 75 # For this example we pick 75 stocks
p_survial = 0.90
# The weekly periodcal returns
pf_rtn = np.zeros([n_weeks, n_sim])
start = default_timer()
for k in range(0, n_sim):
# Randomely choice n_stock_cand at time zero
boolean_list = np.array([False] * (n_stocks[0] - n_stock_cand) + [True] * n_stock_cand)
np.random.shuffle(boolean_list) # Shuffle the list
stock_id_this_week = stock_id[idx_new_week[0]:idx_new_week[1]][boolean_list]
stock_rtn_this_week = stock_rtn[idx_new_week[0]:idx_new_week[1]][boolean_list]
# This part only simulate the Buzz portfolio names - later we simulate returns and from specific holdings of the 75 names
for j in range(1, n_weeks):
pf_rtn[j-1, k] = stock_rtn_this_week.mean()
# Find the number of stocks to keep
boolean_keep_stocks = np.random.rand(n_stock_cand) < p_survial
# Next we need to check if a stock is still part of the universe next period
stock_cand_temp = stock_id[idx_new_week[j-1]:idx_new_week[j]]
stock_rtn_temp = stock_rtn[idx_new_week[j-1]:idx_new_week[j]]
boolean_keep_stocks = (boolean_keep_stocks) & (np.in1d(stock_id_this_week, stock_cand_temp, assume_unique=True))
n_stocks_to_replace = n_stock_cand - boolean_keep_stocks.sum() # Number of new stocks to pick this week
if n_stocks_to_replace > 0:
# We have to pick from stocks which is not part of the portfolio already
boolean_cand = np.in1d(stock_cand_temp, stock_id_this_week, assume_unique=True, invert=True)
n_stocks_to_pick_from = boolean_cand.sum()
boolean_list = np.array([False] * (n_stocks_to_pick_from - n_stocks_to_replace) + [True] * n_stocks_to_replace)
np.random.shuffle(boolean_list) # Shuffle the list
# First avoid picking the same stock twich, next pick from the unique candidate list
stock_id_new = stock_cand_temp[boolean_cand][boolean_list] # The new stocks
stock_rtn_new = stock_rtn_temp[boolean_cand][boolean_list] # and their returns
stock_id_this_week = np.hstack((stock_id_this_week[boolean_keep_stocks], stock_id_new))
stock_rtn_this_week = np.hstack((stock_rtn_this_week[boolean_keep_stocks], stock_rtn_new))
else:
# No replacement of stocks / all surview but order might differ
boolean_cand = np.in1d(stock_cand_temp, stock_id_this_week, assume_unique=True, invert=False)
stock_id_this_week = stock_cand_temp[boolean_cand]
stock_rtn_this_week = stock_rtn_temp[boolean_cand]
# PnL last period
pf_rtn[n_weeks-1, k] = stock_rtn_this_week.mean()
print(default_timer() - start)