Kernel keeps dying in Jupyter notebook with pulp solver - python

I've created a LP solver in Jupyter notebooks that is giving me some issues. Specifically, when I run the last line of code in the script below, I get the error message saying The kernel appears to have died. It will restart automatically.
Edit: the final dataframe, dfs_proj, is a 240-row, 5-column dataframe.
import pandas as pd
from pulp import *
from pulp import LpMaximize
dfs_proj = pd.read_csv("4for4_dfs_projections_120321.csv")
dfs_proj['count'] = 1
cols = ['Player', 'Pos', 'FFPts', 'DK ($)', 'count']
dfs_proj = dfs_proj[cols]
dfs_proj = dfs_proj[(dfs_proj['DK ($)'] >= 4000) | (dfs_proj['Pos'] == "DEF") | (dfs_proj['Pos'] == "TE")]
player_dict = dict(zip(dfs_proj['Player'], dfs_proj['count']))
# create a helper function to return the number of players assigned each position
def get_position_sum(player_vars, df, position):
return pulp.lpSum([player_vars[i] * (position in df['Pos'].iloc[i]) for i in range(len(df))])
def get_optimals(site, data, num_lineups, optimize_on='FFPts'):
"""
Generates x number of optimal lineups, based on the column to
designate as the one to optimize on.
:param str site: DK or FD. Used for salary constraints
:param pd.DataFrame data: Pandas dataframe containing projections.
:param int num_lineups: Number of lineups to generate.
:param str optimize_on: Name of column in dataframe to use when optimizing
"""
#global lineups
lineups = []
player_dict = dict(zip(data['Player'], data['count']))
for i in range(1, num_lineups+1):
prob = pulp.LpProblem('DK_NFL_weekly', pulp.const.LpMaximize)
player_vars = []
for row in data.itertuples():
var = pulp.LpVariable(f'{row.Player}', cat='Binary')
player_vars.append((row.Player, var))
# total assigned players constraint
prob += pulp.lpSum(player_var for player_var in player_vars) == 9
# total salary constraint
prob += pulp.lpSum(data['DK ($)'].iloc[i] * player_vars[i][1] for i in range(len(data))) <= 50000
# for QB and DST, require 1 of each in the lineup
prob += get_position_sum(player_vars, df, 'QB') == 1
prob += get_position_sum(player_vars, df, 'DEF') == 1
# to account for the FLEX position, we allow additional selections of the 3 FLEX-eligible positions: RB, WR, TE
prob += get_position_sum(player_vars, df, 'RB') >= 2
prob += get_position_sum(player_vars, df, 'WR') >= 3
prob += get_position_sum(player_vars, df, 'TE') >= 1
if i > 1:
if optimize_on == 'Optimal Frequency':
prob += pulp.lpSum([data['FFPts'].iloc[i] * player_vars[i][1] for i in range(len(data))]) <= (optimal - 0.001)
else:
prob += pulp.lpSum([data['FFPts'].iloc[i] * player_vars[i][1] for i in range(len(data))]) <= (optimal - 0.01)
prob += pulp.lpSum([data['FFPts'].iloc[i] * player_vars[i][1] for i in range(len(data))])
# solve and print the status
prob.solve(PULP_CBC_CMD(msg=False))
optimal = prob.objective.value()
count = 1
lineup = {}
for i in range(len(data)):
if player_vars[i][1].value() == 1:
row = data.iloc[i]
lineup[f'G{count}'] = row['Player']
count += 1
lineup['Total Points'] = optimal
lineups.append(lineup)
players = list(lineup.values())
for i in range(0, len(players)):
if type(players[i]) == str:
player_dict[players[i]] += 1
if player_dict[players[i]] == 45:
data = data[data['Player'] != players[i]]
return lineups
lineups = get_optimals(dfs_proj, 20, 'FFPts')
I have tried reinstalling all the libraries that are used in the script and still get the same issue. Even running it in a normal Python script gives me the same error message. I think this might have to do with memory, but I'm not sure how to check for that or adjust for that, either.
Thanks in advance for any help!

You had a handful of typos here... Not sure if/how you got this running.
A couple of issues you had:
You co-mingled df and data variable names inside your function. So who knows what that was pulling in. (One of the hazards of working in a notebook.)
In several locations where you used player_vars you were not indexing the tuple to get the variable piece, I'd suggest you use the LpVariable.dicts() for these, it is easier to manage.
Your function call doesn't account for site in the function params.
Other advice:
Do NOT turn off the messaging. You must check the solver output to see the status. First attempts came back as "infeasible" which is how I discovered the player_vars problem. If you do decide to turn off the message, figure out a way to assert(status==optimal) or risk junk results. I think it is doable in pulp, I just forgot how. Edit: here's how. This works when using the default CBC solver, after solving (obviously). Other solvers, not sure:
status = LpStatus[prob.status]
assert(status=='Optimal')
print out the problem a couple times to see if it passes the giggle test while building it. If you had done this, you would have seen some of the construction problems.
Anyhow, this is working fine for fake data and handles 1000+ players in a couple seconds for 20 lineups.
Buyer beware: I did not review all of the constraints too closely or the conditional constraint, so you should.
import pandas as pd
from pulp import *
# from pulp import LpMaximize
from random import randint, choice
num_players = 1000
positions = ['RB', 'WR', 'TE', 'DEF', 'QB']
players = [(i, choice(positions), randint(1,100), randint(3000,5000), 1) for i in range(num_players)]
cols = ['Player', 'Pos', 'FFPts', 'DK ($)', 'count']
dfs_proj = pd.DataFrame.from_records(players, columns = cols)
print(dfs_proj.head())
# dfs_proj = pd.read_csv("4for4_dfs_projections_120321.csv")
# dfs_proj['count'] = 1
# cols = ['Player', 'Pos', 'FFPts', 'DK ($)', 'count']
# dfs_proj = dfs_proj[cols]
dfs_proj = dfs_proj[(dfs_proj['DK ($)'] >= 4000) | (dfs_proj['Pos'] == "DEF") | (dfs_proj['Pos'] == "TE")]
# player_dict = dict(zip(dfs_proj['Player'], dfs_proj['count']))
print(dfs_proj.head())
# create a helper function to return the number of players assigned each position
def get_position_sum(player_vars, df, position):
return pulp.lpSum([player_vars[i][1] * (position in df['Pos'].iloc[i]) for i in range(len(df))]) #player vars not indexed
#def get_optimals(site, data, num_lineups, optimize_on='FFPts'): # site??? # data vs df ???
def get_optimals(data, num_lineups, optimize_on='FFPts'):
"""
Generates x number of optimal lineups, based on the column to
designate as the one to optimize on.
:param str site: DK or FD. Used for salary constraints
:param pd.DataFrame data: Pandas dataframe containing projections.
:param int num_lineups: Number of lineups to generate.
:param str optimize_on: Name of column in dataframe to use when optimizing
"""
#global lineups
lineups = []
player_dict = dict(zip(data['Player'], data['count']))
for i in range(1, num_lineups+1):
prob = pulp.LpProblem('DK_NFL_weekly', pulp.const.LpMaximize)
player_vars = []
for row in data.itertuples():
var = pulp.LpVariable(f'P{row.Player}', cat='Binary') # added 'P' to player name for clarity
player_vars.append((row.Player, var))
# total assigned players constraint
prob += pulp.lpSum(player_var[1] for player_var in player_vars) == 9 # player var not indexed
# total salary constraint
prob += pulp.lpSum(data['DK ($)'].iloc[i] * player_vars[i][1] for i in range(len(data))) <= 50000
# for QB and DST, require 1 of each in the lineup
# !!!! you had 'df' here which who knows what you were pulling in.... changed to data
prob += get_position_sum(player_vars, data, 'QB') == 1
prob += get_position_sum(player_vars, data, 'DEF') == 1
# to account for the FLEX position, we allow additional selections of the 3 FLEX-eligible positions: RB, WR, TE
prob += get_position_sum(player_vars, data, 'RB') >= 2
prob += get_position_sum(player_vars, data, 'WR') >= 3
prob += get_position_sum(player_vars, data, 'TE') >= 1
if i > 1:
if optimize_on == 'Optimal Frequency':
prob += pulp.lpSum([data['FFPts'].iloc[i] * player_vars[i][1] for i in range(len(data))]) <= (optimal - 0.001)
else:
prob += pulp.lpSum([data['FFPts'].iloc[i] * player_vars[i][1] for i in range(len(data))]) <= (optimal - 0.01)
prob += pulp.lpSum([data['FFPts'].iloc[i] * player_vars[i][1] for i in range(len(data))])
print(prob)
# solve and print the status
prob.solve(PULP_CBC_CMD())
optimal = prob.objective.value()
count = 1
lineup = {}
for i in range(len(data)):
if player_vars[i][1].value() == 1:
row = data.iloc[i]
lineup[f'G{count}'] = row['Player']
count += 1
lineup['Total Points'] = optimal
lineups.append(lineup)
players = list(lineup.values())
for i in range(0, len(players)):
if type(players[i]) == str:
player_dict[players[i]] += 1
if player_dict[players[i]] == 45:
data = data[data['Player'] != players[i]]
return lineups
lineups = get_optimals(dfs_proj, 10, 'FFPts')
for lineup in lineups:
print(lineup)

Related

I'd like to cross-calculate the two formular over and over again in python

Q_optimal=((np.array(demand_lt_std)**2+np.array(Q_safety_stock)**2))**(1/2)+((np.array(demand_lt_std)**2+np.array(Q_safety_stock)**2+(2*np.array(order_cost)*np.array(demand_lt_avg)/np.array(carrying_cost))))
#get optimal value
while 1:
#new safety stock
new_safety_stock=((np.array(demand_lt_std))**2/(4*beta*np.array( Q_optimal)))-(beta*np.array( Q_optimal))
new_safety_stock[np.isnan(new_safety_stock)] = 0
new_safety_stock=new_safety_stock.tolist()
#delete 0
for i in range(len(order_cost)):
if new_safety_stock[i]< 0:
pos_1=np.where(np.array(new_safety_stock)<0)[0]
for i in pos_1:
new_safety_stock[i]=0
new_safety_stock=np.array(new_safety_stock)
Q_a_result = (np.array(demand_lt_std)**2+np.array(Q_safety_stock)**2)**0.5
Q_b_result= (2*np.array(order_cost)*np.array(demand_lt_avg))/np.array(carrying_cost)
Q_c_result=(np.array(demand_lt_std)**2+np.array(Q_safety_stock)**2+Q_b_result)**0.5
Q_d_result=Q_a_result+Q_c_result
#new Q
new_Q = Q_d_result
loop += 1
above there is all code
Please look at the cord below. first i caculate new_safety_stock using Q_optimal and i get new_Q.
After that, I want to do this repetition of using new_Q to get new_safety, and then using new_safety to get new_Q. I want to repeat it 10000 times, but I have no choice but to use Q_optimal when I get new_safety for the first time, but I wonder how to let you use new_Q again after that.
'''
while 1:
#new safety stock
new_safety_stock=((np.array(demand_lt_std))**2/(4*beta*np.array( Q_optimal)))-(beta*np.array( Q_optimal))
new_safety_stock[np.isnan(new_safety_stock)] = 0
new_safety_stock=new_safety_stock.tolist()
#delete 0
for i in range(len(order_cost)):
if new_safety_stock[i]< 0:
pos_1=np.where(np.array(new_safety_stock)<0)[0]
for i in pos_1:
new_safety_stock[i]=0
new_safety_stock=np.array(new_safety_stock)
Q_a_result = (np.array(demand_lt_std)**2+np.array(Q_safety_stock)**2)**0.5
Q_b_result= (2*np.array(order_cost)*np.array(demand_lt_avg))/np.array(carrying_cost)
Q_c_result=(np.array(demand_lt_std)**2+np.array(Q_safety_stock)**2+Q_b_result)**0.5
Q_d_result=Q_a_result+Q_c_result
#new Q
new_Q = Q_d_result
loop += 1
'''

Speeding Up Datafile Reading Program For School Project

I am in a lower-level coding class (Python) and have a major project due in three days. One of our grading criteria is program speed. My program runs in about 30 seconds, ideally it would execute in 15 or less. Here is my code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
import time
start_time = time.time()#for printing execution time
#function for appending any number of files to a dataframe
def read_data_files(pre, start, end): #reading in the data
data = pd.DataFrame()#dataframe with all the data from files
x = start
while x <= end:
filename = pre + str(x) + ".csv" #string manipulation
dpath = pd.read_csv("C:\\Users\\jacks\\Downloads\\BMEN 207 Project 1
Data\\" + filename )
for y in dpath:
dpath = dpath.rename(columns = {y: y})
data = data.append(dpath)
x += 1
return data
data = read_data_files("Data_", 5, 163) #start, end, prefix...
#converting to human time and adding to new column in dataframe
human_time = []
for i in data[' Time']:
i = int(i)
j = datetime.utcfromtimestamp(i).strftime('%Y-%m-%d %H:%M:%S')
human_time.append(j)
human_timen = np.array(human_time) #had issues here for some reason, so i
created another array to replace the time column in the dataframe
data[' Time'] = human_timen
hours = [] #for use as x-axis in plot
stdlist = [] #for use as y-axis in plot
histlist = [] #for storing magnitudes of most active hour
def magfind(row): #separate function to calculate the magnitude of each row in
each dataframe
return (row[' Acc X'] ** 2 + row[' Acc Y'] ** 2 + row[' Acc Z'] ** 2) ** .5
def filterfunction(intro1, intro2, first, last): #two different intros to deal
with the issue of '00:' versus '10:' timestamps
k = first
meanmax = 0
active = 0
while k <= last:
if 0 <= k < 7: #data from hours 0 to 6, none after
hr = intro1 + str(k) + ':'
tfilter = data[' Time'].str.contains(hr)
acc = data.loc[tfilter, [' Acc X', ' Acc Y', ' Acc Z']]
acc['magnitude'] = acc.apply(magfind, axis = 1) #creates magnitude
column using prior function, column has magnitudes for every row of every file
p = acc.loc[:, 'magnitude'].std()#finds std dev for the column and
appends to a list for graphing
m = acc.loc[:, 'magnitude'].mean()
stdlist.append(p)
elif k == 12 or 20 < k <= last: #data at 12 and beyond hour 20
hr = intro2 + str(k) + ":"
tfilter = data[' Time'].str.contains(hr)
acc = data.loc[tfilter, [' Acc X', ' Acc Y', ' Acc Z']]
acc['magnitude'] = acc.apply(magfind, axis = 1)
p = acc.loc[:, 'magnitude'].std()
m = acc.loc[:, 'magnitude'].mean()
stdlist.append(p)
else: #in the case that we are dealing with an hour that has no data
p = 0
m = 0
stdlist.append(p) #need this so that the hours with no data still
get graphed
if m > meanmax: # for determining which hour was the most active, and
appending those magnitudes to a list for histogramming
meanmax = m
active = k #most active hour
for i in acc['magnitude']:
histlist.append(i) #adding all the magnitudes for histogramming
hours.append(k)
k += 1
print("mean magnitude", meanmax)
print("most active hour", active)
return hours, stdlist, histlist
filterfunction(' 0', ' ', 0, 23)
The slow speed stems from the "filterfunction" function. What this program does is read data from over 100 files, and this function specifically sorts the data into a dataframe and analyzes by time (each individual hour) in order to calculate the data in all rows for that hour. I believe that it might be able to be sped up by changing up the way that the data is filtered to search by hour, but am not sure. The reason I have statements to dis-include certain k-values is that there are hours with no data to manipulate, which would mess up the list of standard deviation calculations as well as the plot that this data will father. Any tips or ideas for speeding this up would be greatly appreciated!
One suggestion to speed it up a bit is to remove this line since it is not being used anywhere in the program:
import matplotlib.pyplot as plt
matplotlib is a big library so removing it should improve performance.
Also I think you could get rid of numpy since it is used once only...consider using a tuple
I could not able to test because I am on mobile now. However my main idea is not making the code better or leen. I changed the functioning part of the process.
Integrated the 'multiprocessing' library(method) into your code and also calculated the system cpu cores and divide all processes between them.
Multiprocessing library detailed documentation: https://docs.python.org/2/library/multiprocessing.html
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
import psutil
from datetime import datetime
from multiprocessing import Pool
cores = psutil.cpu_count()
start_time = time.time()#for printing execution time
#function for appending any number of files to a dataframe
def read_data_files(pre, start, end): #reading in the data
data = pd.DataFrame()#dataframe with all the data from files
x = start
while x <= end:
filename = pre + str(x) + ".csv" #string manipulation
dpath = pd.read_csv("C:\\Users\\jacks\\Downloads\\BMEN 207 Project 1
Data\\" + filename )
for y in dpath:
dpath = dpath.rename(columns = {y: y})
data = data.append(dpath)
x += 1
return data
data = read_data_files("Data_", 5, 163) #start, end, prefix...
#converting to human time and adding to new column in dataframe
human_time = []
for i in data[' Time']:
i = int(i)
j = datetime.utcfromtimestamp(i).strftime('%Y-%m-%d %H:%M:%S')
human_time.append(j)
human_timen = np.array(human_time) #had issues here for some reason, so i
created another array to replace the time column in the dataframe
data[' Time'] = human_timen
hours = [] #for use as x-axis in plot
stdlist = [] #for use as y-axis in plot
histlist = [] #for storing magnitudes of most active hour
def magfind(row): #separate function to calculate the magnitude of each row in
each dataframe
return (row[' Acc X'] ** 2 + row[' Acc Y'] ** 2 + row[' Acc Z'] ** 2) ** .5
def filterfunction(intro1, intro2, first, last): #two different intros to deal
with the issue of '00:' versus '10:' timestamps
k = first
meanmax = 0
active = 0
while k <= last:
if 0 <= k < 7: #data from hours 0 to 6, none after
hr = intro1 + str(k) + ':'
tfilter = data[' Time'].str.contains(hr)
acc = data.loc[tfilter, [' Acc X', ' Acc Y', ' Acc Z']]
acc['magnitude'] = acc.apply(magfind, axis = 1) #creates magnitude
column using prior function, column has magnitudes for every row of every file
p = acc.loc[:, 'magnitude'].std()#finds std dev for the column and
appends to a list for graphing
m = acc.loc[:, 'magnitude'].mean()
stdlist.append(p)
elif k == 12 or 20 < k <= last: #data at 12 and beyond hour 20
hr = intro2 + str(k) + ":"
tfilter = data[' Time'].str.contains(hr)
acc = data.loc[tfilter, [' Acc X', ' Acc Y', ' Acc Z']]
acc['magnitude'] = acc.apply(magfind, axis = 1)
p = acc.loc[:, 'magnitude'].std()
m = acc.loc[:, 'magnitude'].mean()
stdlist.append(p)
else: #in the case that we are dealing with an hour that has no data
p = 0
m = 0
stdlist.append(p) #need this so that the hours with no data still
get graphed
if m > meanmax: # for determining which hour was the most active, and
appending those magnitudes to a list for histogramming
meanmax = m
active = k #most active hour
for i in acc['magnitude']:
histlist.append(i) #adding all the magnitudes for histogramming
hours.append(k)
k += 1
print("mean magnitude", meanmax)
print("most active hour", active)
return hours, stdlist, histlist
# Run this with a pool of 5 agents having a chunksize of 3 until finished
agents = cores
chunksize = (len(data) / cores)
with Pool(processes=agents) as pool:
pool.map(filterfunction, (' 0', ' ', 0, 23))
don't use apply, it's not vectorized. instead, use vectorized operations whenever you can. in this case, instead of doing df.apply(magfind, 1), do:
def add_magnitude(df):
df['magnitude'] = (df[' Acc X'] ** 2 + df[' Acc Y'] ** 2 + df[' Acc Z'] ** 2) ** .5

How to add a stopping condition for Jacobian Matrix?

def jacobi(m,numiter=100):
#Number of rows determins the number of variables
numvars = m.shape[0]
#construct array for final iterations
history = np.zeros((numvars,numiter))
i = 1
while(i < numiter): #Loop for numiter
for v in range(numvars): # Loop over all variables
current = m[v,numvars] # Start with left hand side (augmented side of matrix)
for col in range(numvars): #Loop over columns
if v != col: # Don't count colume for current variable
current = current - (m[v,col]*history[col, i-1]) #subtract other guesses form previous timestep
current = current/m[v,v] #divide by current variable coefficent
history[v,i] = current #Add this answer to the rest
i = i + 1 #iterate
#plot each variable
for v in range(numvars):
plt.plot(history[v,: i]);
return history[:,i-1]
I have this code that calculates Jacobian method. How do I add a stopping condition for when the solutions converge? i.e. the values for the current iteration have changed less than some threshold e from the values for the previous iteration.
The threshold e will be an input to the function and the default value to 0.00001
You could add another condition to your while loop, so when it reaches your error threshold it stops.
def jacobi(m,numiter=100, error_threshold = 1e-4):
#Number of rows determins the number of variables
numvars = m.shape[0]
#construct array for final iterations
history = np.zeros((numvars,numiter))
i = 1
err = 10*error_threshold
while(i < numiter and err > error_threshold): #Loop for numiter and error threshold
for v in range(numvars): # Loop over all variables
current = m[v,numvars] # Start with left hand side (augmented side of matrix)
for col in range(numvars): #Loop over columns
if v != col: # Don't count colume for current variable
current = current - (m[v,col]*history[col, i-1]) #subtract other guesses form previous timestep
current = current/m[v,v] #divide by current variable coefficent
history[v,i] = current #Add this answer to the rest
#check error here. In this case the maximum error
if i > 1:
err = max((history[:,i] - history[:,i-1])/history[:,i-1])
i = i + 1 #iterate
#plot each variable
for v in range(numvars):
plt.plot(history[v,: i]);
return history[:,i-1]

Comparing values in Python data frame efficiently

I'm trading daily on Cryptocurrencies and would like to find which are the most desirable Cryptos for trading.
I have CSV file for every Crypto with the following fields:
Date Sell Buy
43051.23918 1925.16 1929.83
43051.23919 1925.12 1929.79
43051.23922 1925.12 1929.79
43051.23924 1926.16 1930.83
43051.23925 1926.12 1930.79
43051.23926 1926.12 1930.79
43051.23927 1950.96 1987.56
43051.23928 1190.90 1911.56
43051.23929 1926.12 1930.79
I would like to check:
How many quotes will end with profit:
for Buy positions - if one of the following Sells > current Buy.
for Sell positions - if one of the following Buys < current Sell.
How much time it would take to a theoretical position to become profitable.
What can be the profit potential.
I'm using the following code:
#converting from OLE to datetime
OLE_TIME_ZERO = dt.datetime(1899, 12, 30, 0, 0, 0)
def ole(oledt):
return OLE_TIME_ZERO + dt.timedelta(days=float(oledt))
#variables initialization
buy_time = ole(43031.57567) - ole(43031.57567)
sell_time = ole(43031.57567) - ole(43031.57567)
profit_buy_counter = 0
no_profit_buy_counter = 0
profit_sell_counter = 0
no_profit_sell_counter = 0
max_profit_buy_positions = 0
max_profit_buy_counter = 0
max_profit_sell_positions = 0
max_profit_sell_counter = 0
df = pd.read_csv("C:/P/Crypto/bitcoin_test_normal_276k.csv")
#comparing to max
for index, row in df.iterrows():
a = index + 1
df_slice = df[a:]
if df_slice["Sell"].max() - row["Buy"] > 0:
max_profit_buy_positions += df_slice["Sell"].max() - row["Buy"]
max_profit_buy_counter += 1
for index1, row1 in df_slice.iterrows():
if row["Buy"] < row1["Sell"] :
buy_time += ole(row1["Date"])- ole(row["Date"])
profit_buy_counter += 1
break
else:
no_profit_buy_counter += 1
#comparing to sell
for index, row in df.iterrows():
a = index + 1
df_slice = df[a:]
if row["Sell"] - df_slice["Buy"].min() > 0:
max_profit_sell_positions += row["Sell"] - df_slice["Buy"].min()
max_profit_sell_counter += 1
for index2, row2 in df_slice.iterrows():
if row["Sell"] > row2["Buy"] :
sell_time += ole(row2["Date"])- ole(row["Date"])
profit_sell_counter += 1
break
else:
no_profit_sell_counter += 1
num_rows = len(df.index)
buy_avg_time = buy_time/num_rows
sell_avg_time = sell_time/num_rows
if max_profit_buy_counter == 0:
avg_max_profit_buy = "There is no profitable buy positions"
else:
avg_max_profit_buy = max_profit_buy_positions/max_profit_buy_counter
if max_profit_sell_counter == 0:
avg_max_profit_sell = "There is no profitable sell positions"
else:
avg_max_profit_sell = max_profit_sell_positions/max_profit_sell_counter
The code works fine for 10K-20K lines but for a larger amount (276K) it take a long time (more than 10 hrs)
What can I do in order to improve it?
Is there any "Pythonic" way to compare each value in a data frame to all following values?
note - the dates in the CSV are in OLE so I need to convert it to Datetime.
File for testing:
Thanks for your comment.
Here you can find the file that I used:
First, I'd want to create the cumulative maximum/minimum values for Sell and Buy per row, so it's easy to compare to. pandas has cummax and cummin, but they go the wrong way. So we'll do:
df['Max Sell'] = df[::-1]['Sell'].cummax()[::-1]
df['Min Buy'] = df[::-1]['Buy'].cummin()[::-1]
Now, we can just compare each row:
df['Buy Profit'] = df['Max Sell'] - df['Buy']
df['Sell Profit'] = df['Sell'] - df['Min Buy']
I'm positive this isn't exactly what you want as I don't perfectly understand what you're trying to do, but hopefully it leads you in the right direction.
After comparing your function and mine, there is a slight difference, as your a is offset one off the index. Removing that offset, you'll see that my method produces the same results as yours, only in vastly shorter time:
for index, row in df.iterrows():
a = index
df_slice = df[a:]
assert (df_slice["Sell"].max() - row["Buy"]) == df['Max Sell'][a] - df['Buy'][a]
else:
print("All assertions passed!")
Note this will still take the very long time required by your function. Note that this can be fixed with shift, but I don't want to run your function for long enough to figure out what way to shift it.

Python Script slowing down as it progresses?

I have a simulation running that has this basic structure:
from time import time
def CSV(*args):
#write * args to .CSV file
return
def timeleft(a,L,period):
print(#details on how long last period took, ETA#)
for L in range(0,6,4):
for a in range(1,100):
timeA = time()
for t in range(1,1000):
## Manufacturer in Supply Chain ##
inventory_accounting_lists.append(#simple calculations#)
# Simulation to determine the optimal B-value (Basestock level)
for B in range(1,100):
for tau in range(1,1000):
## simple inventory accounting operations##
## Distributor in Supply Chain ##
inventory_accounting_lists.append(#simple calculations#)
# Simulation to determine the optimal B-value (Basestock level)
for B in range(1,100):
for tau in range(1,1000):
## simple inventory accounting operations##
## Wholesaler in Supply Chain ##
inventory_accounting_lists.append(#simple calculations#)
# Simulation to determine the optimal B-value (Basestock level)
for B in range(1,100):
for tau in range(1,1000):
## simple inventory accounting operations##
## Retailer in Supply Chain ##
inventory_accounting_lists.append(#simple calculations#)
# Simulation to determine the optimal B-value (Basestock level)
for B in range(1,100):
for tau in range(1,1000):
## simple inventory accounting operations##
CSV(Simulation_Results)
timeB = time()
timeleft(a,L,timeB-timeA)
As the script continues, it seems to be getting slower and slower. Here is what it is for these values (and it increases linearly as a increases).
L = 0, a = 1: 1.15 minutes
L = 0, a = 99: 1.7 minutes
L = 2, a = 1: 2.7 minutes
L = 2, a = 99: 5.15 minutes
L = 4, a = 1: 4.5 minutes
L = 4, a = 15: 4.95 minutes (this is the latest value it has reached)
Why would each iteration take longer? Each iteration of the loop essentially resets everything except for a master global list, which is being added to each time. However, loops inside each "period" aren't accessing this master list -- they are accessing the same local list every time.
EDIT 1: I will post the simulation code here, in case anyone wants to wade through it, but I warn you, it is rather long, and the variable names are probably unnecessarily confusing.
#########
a = 0.01
L = 0
total = 1000
sim = 500
inv_cost = 1
bl_cost = 4
#########
# Functions
import random
from time import time
time0 = time()
# function to report ETA etc.
def timeleft(a,L,period_time):
if L==0:
periods_left = ((1-a)*100)-1+2*99
if L==2:
periods_left = ((1-a)*100)-1+99
if L==4:
periods_left = ((1-a)*100)-1+0*99
minute_time = period_time/60
minutes_left = (periods_left*period_time)/60
hours_left = (periods_left*period_time)/3600
percentage_complete = 100*((297-periods_left)/297)
print("Time for last period = ","%.2f" % minute_time," minutes")
print("%.2f" % percentage_complete,"% complete")
if hours_left<1:
print("%.2f" % minutes_left," minutes left")
else:
print("%.2f" % hours_left," hours left")
print("")
return
def dcopy(inList):
if isinstance(inList, list):
return list( map(dcopy, inList) )
return inList
# Save values to .CSV file
def CSV(a,L,I_STD_1,I_STD_2,I_STD_3,I_STD_4,O_STD_0,
O_STD_1,O_STD_2,O_STD_3,O_STD_4):
pass
# Initialization
# These are the global, master lists of data
I_STD_1 = [[0],[0],[0]]
I_STD_2 = [[0],[0],[0]]
I_STD_3 = [[0],[0],[0]]
I_STD_4 = [[0],[0],[0]]
O_STD_0 = [[0],[0],[0]]
O_STD_1 = [[0],[0],[0]]
O_STD_2 = [[0],[0],[0]]
O_STD_3 = [[0],[0],[0]]
O_STD_4 = [[0],[0],[0]]
for L in range(0,6,2):
# These are local lists that are appended to at the end of every period
I_STD_1_L = []
I_STD_2_L = []
I_STD_3_L = []
I_STD_4_L = []
O_STD_0_L = []
O_STD_1_L = []
O_STD_2_L = []
O_STD_3_L = []
O_STD_4_L = []
test = []
for n in range(1,100): # THIS is the start of the 99 value loop
a = n/100
print ("L=",L,", alpha=",a)
# Initialization for each Period
F_1 = [0,10] # Forecast
F_2 = [0,10]
F_3 = [0,10]
F_4 = [0,10]
R_0 = [10] # Items Received
R_1 = [10]
R_2 = [10]
R_3 = [10]
R_4 = [10]
for i in range(L):
R_1.append(10)
R_2.append(10)
R_3.append(10)
R_4.append(10)
I_1 = [10] # Final Inventory
I_2 = [10]
I_3 = [10]
I_4 = [10]
IP_1 = [10+10*L] # Inventory Position
IP_2 = [10+10*L]
IP_3 = [10+10*L]
IP_4 = [10+10*L]
O_1 = [10] # Items Ordered
O_2 = [10]
O_3 = [10]
O_4 = [10]
BL_1 = [0] # Backlog
BL_2 = [0]
BL_3 = [0]
BL_4 = [0]
OH_1 = [20] # Items on Hand
OH_2 = [20]
OH_3 = [20]
OH_4 = [20]
OR_1 = [10] # Order received from customer
OR_2 = [10]
OR_3 = [10]
OR_4 = [10]
Db_1 = [10] # Running Average Demand
Db_2 = [10]
Db_3 = [10]
Db_4 = [10]
var_1 = [0] # Running Variance in Demand
var_2 = [0]
var_3 = [0]
var_4 = [0]
B_1 = [IP_1[0]+10] # Optimal Basestock
B_2 = [IP_2[0]+10]
B_3 = [IP_3[0]+10]
B_4 = [IP_4[0]+10]
D = [0,10] # End constomer demand
for i in range(total+1):
D.append(9)
D.append(12)
D.append(8)
D.append(11)
period = [0]
from time import time
timeA = time()
# 1000 time periods t
for t in range(1,total+1):
period.append(t)
#### MANUFACTURER ####
# Manufacturing order from previous time period put into production
R_4.append(O_4[t-1])
#recieve shipment from supplier, calculate items OH HAND
if I_4[t-1]<0:
OH_4.append(R_4[t])
else:
OH_4.append(I_4[t-1]+R_4[t])
# Recieve and dispatch order, update Inventory and Backlog for time t
if (O_3[t-1] + BL_4[t-1]) <= OH_4[t]: # No Backlog
I_4.append(OH_4[t] - (O_3[t-1] + BL_4[t-1]))
BL_4.append(0)
R_3.append(O_3[t-1]+BL_4[t-1])
else:
I_4.append(OH_4[t] - (O_3[t-1] + BL_4[t-1])) # Backlogged
BL_4.append(-I_4[t])
R_3.append(OH_4[t])
# Update Inventory Position
IP_4.append(IP_4[t-1] + O_4[t-1] - O_3[t-1])
# Use exponential smoothing to forecast future demand
future_demand = (1-a)*F_4[t] + a*O_3[t-1]
F_4.append(future_demand)
# Calculate D_bar(t) and Var(t)
Db_4.append((1/t)*sum(O_3[0:t]))
s = 0
for i in range(0,t):
s+=(O_3[i]-Db_4[t])**2
if t==1:
var_4.append(0) # var(1) = 0
else:
var_4.append((1/(t-1))*s)
# Simulation to determine B(t)
S_BC_4 = [10000000000]*10
Run_4 = [0]*10
for B in range(10,500):
S_OH_4 = OH_4[:]
S_I_4 = I_4[:]
S_R_4 = R_4[:]
S_BL_4 = BL_4[:]
S_IP_4 = IP_4[:]
S_O_4 = O_4[:]
# Update O(t)(the period just before the simulation begins)
# using the B value for the simulation
if B - S_IP_4[t] > 0:
S_O_4.append(B - S_IP_4[t])
else:
S_O_4.append(0)
c = 0
for i in range(t+1,t+sim+1):
S_R_4.append(S_O_4[i-1])
#simulate demand
demand = -1
while demand <0:
demand = random.normalvariate(F_4[t+1],(var_4[t])**(.5))
# Receive simulated shipment, calculate simulated items on hand
if S_I_4[i-1]<0:
S_OH_4.append(S_R_4[i])
else:
S_OH_4.append(S_I_4[i-1]+S_R_4[i])
# Receive and send order, update Inventory and Backlog (simulated)
owed = (demand + S_BL_4[i-1])
S_I_4.append(S_OH_4[i] - owed)
if owed <= S_OH_4[i]: # No Backlog
S_BL_4.append(0)
c += inv_cost*S_I_4[i]
else:
S_BL_4.append(-S_I_4[i]) # Backlogged
c += bl_cost*S_BL_4[i]
# Update Inventory Position
S_IP_4.append(S_IP_4[i-1] + S_O_4[i-1] - demand)
# Update Order, Upstream member dispatches goods
if (B-S_IP_4[i]) > 0:
S_O_4.append(B - S_IP_4[i])
else:
S_O_4.append(0)
# Log Simulation costs for that B-value
S_BC_4.append(c)
# If the simulated costs are increasing, stop
if B>11:
dummy = []
for i in range(0,10):
dummy.append(S_BC_4[B-i]-S_BC_4[B-i-1])
Run_4.append(sum(dummy)/float(len(dummy)))
if Run_4[B-3] > 0 and B>20:
break
else:
Run_4.append(0)
# Use minimum cost as new B(t)
var = min((val, idx) for (idx, val) in enumerate(S_BC_4))
optimal_B = var[1]
B_4.append(optimal_B)
# Calculate O(t)
if B_4[t] - IP_4[t] > 0:
O_4.append(B_4[t] - IP_4[t])
else:
O_4.append(0)
#### DISTRIBUTOR ####
#recieve shipment from supplier, calculate items OH HAND
if I_3[t-1]<0:
OH_3.append(R_3[t])
else:
OH_3.append(I_3[t-1]+R_3[t])
# Recieve and dispatch order, update Inventory and Backlog for time t
if (O_2[t-1] + BL_3[t-1]) <= OH_3[t]: # No Backlog
I_3.append(OH_3[t] - (O_2[t-1] + BL_3[t-1]))
BL_3.append(0)
R_2.append(O_2[t-1]+BL_3[t-1])
else:
I_3.append(OH_3[t] - (O_2[t-1] + BL_3[t-1])) # Backlogged
BL_3.append(-I_3[t])
R_2.append(OH_3[t])
# Update Inventory Position
IP_3.append(IP_3[t-1] + O_3[t-1] - O_2[t-1])
# Use exponential smoothing to forecast future demand
future_demand = (1-a)*F_3[t] + a*O_2[t-1]
F_3.append(future_demand)
# Calculate D_bar(t) and Var(t)
Db_3.append((1/t)*sum(O_2[0:t]))
s = 0
for i in range(0,t):
s+=(O_2[i]-Db_3[t])**2
if t==1:
var_3.append(0) # var(1) = 0
else:
var_3.append((1/(t-1))*s)
# Simulation to determine B(t)
S_BC_3 = [10000000000]*10
Run_3 = [0]*10
for B in range(10,500):
S_OH_3 = OH_3[:]
S_I_3 = I_3[:]
S_R_3 = R_3[:]
S_BL_3 = BL_3[:]
S_IP_3 = IP_3[:]
S_O_3 = O_3[:]
# Update O(t)(the period just before the simulation begins)
# using the B value for the simulation
if B - S_IP_3[t] > 0:
S_O_3.append(B - S_IP_3[t])
else:
S_O_3.append(0)
c = 0
for i in range(t+1,t+sim+1):
#simulate demand
demand = -1
while demand <0:
demand = random.normalvariate(F_3[t+1],(var_3[t])**(.5))
S_R_3.append(S_O_3[i-1])
# Receive simulated shipment, calculate simulated items on hand
if S_I_3[i-1]<0:
S_OH_3.append(S_R_3[i])
else:
S_OH_3.append(S_I_3[i-1]+S_R_3[i])
# Receive and send order, update Inventory and Backlog (simulated)
owed = (demand + S_BL_3[i-1])
S_I_3.append(S_OH_3[i] - owed)
if owed <= S_OH_3[i]: # No Backlog
S_BL_3.append(0)
c += inv_cost*S_I_3[i]
else:
S_BL_3.append(-S_I_3[i]) # Backlogged
c += bl_cost*S_BL_3[i]
# Update Inventory Position
S_IP_3.append(S_IP_3[i-1] + S_O_3[i-1] - demand)
# Update Order, Upstream member dispatches goods
if (B-S_IP_3[i]) > 0:
S_O_3.append(B - S_IP_3[i])
else:
S_O_3.append(0)
# Log Simulation costs for that B-value
S_BC_3.append(c)
# If the simulated costs are increasing, stop
if B>11:
dummy = []
for i in range(0,10):
dummy.append(S_BC_3[B-i]-S_BC_3[B-i-1])
Run_3.append(sum(dummy)/float(len(dummy)))
if Run_3[B-3] > 0 and B>20:
break
else:
Run_3.append(0)
# Use minimum cost as new B(t)
var = min((val, idx) for (idx, val) in enumerate(S_BC_3))
optimal_B = var[1]
B_3.append(optimal_B)
# Calculate O(t)
if B_3[t] - IP_3[t] > 0:
O_3.append(B_3[t] - IP_3[t])
else:
O_3.append(0)
#### WHOLESALER ####
#recieve shipment from supplier, calculate items OH HAND
if I_2[t-1]<0:
OH_2.append(R_2[t])
else:
OH_2.append(I_2[t-1]+R_2[t])
# Recieve and dispatch order, update Inventory and Backlog for time t
if (O_1[t-1] + BL_2[t-1]) <= OH_2[t]: # No Backlog
I_2.append(OH_2[t] - (O_1[t-1] + BL_2[t-1]))
BL_2.append(0)
R_1.append(O_1[t-1]+BL_2[t-1])
else:
I_2.append(OH_2[t] - (O_1[t-1] + BL_2[t-1])) # Backlogged
BL_2.append(-I_2[t])
R_1.append(OH_2[t])
# Update Inventory Position
IP_2.append(IP_2[t-1] + O_2[t-1] - O_1[t-1])
# Use exponential smoothing to forecast future demand
future_demand = (1-a)*F_2[t] + a*O_1[t-1]
F_2.append(future_demand)
# Calculate D_bar(t) and Var(t)
Db_2.append((1/t)*sum(O_1[0:t]))
s = 0
for i in range(0,t):
s+=(O_1[i]-Db_2[t])**2
if t==1:
var_2.append(0) # var(1) = 0
else:
var_2.append((1/(t-1))*s)
# Simulation to determine B(t)
S_BC_2 = [10000000000]*10
Run_2 = [0]*10
for B in range(10,500):
S_OH_2 = OH_2[:]
S_I_2 = I_2[:]
S_R_2 = R_2[:]
S_BL_2 = BL_2[:]
S_IP_2 = IP_2[:]
S_O_2 = O_2[:]
# Update O(t)(the period just before the simulation begins)
# using the B value for the simulation
if B - S_IP_2[t] > 0:
S_O_2.append(B - S_IP_2[t])
else:
S_O_2.append(0)
c = 0
for i in range(t+1,t+sim+1):
#simulate demand
demand = -1
while demand <0:
demand = random.normalvariate(F_2[t+1],(var_2[t])**(.5))
# Receive simulated shipment, calculate simulated items on hand
S_R_2.append(S_O_2[i-1])
if S_I_2[i-1]<0:
S_OH_2.append(S_R_2[i])
else:
S_OH_2.append(S_I_2[i-1]+S_R_2[i])
# Receive and send order, update Inventory and Backlog (simulated)
owed = (demand + S_BL_2[i-1])
S_I_2.append(S_OH_2[i] - owed)
if owed <= S_OH_2[i]: # No Backlog
S_BL_2.append(0)
c += inv_cost*S_I_2[i]
else:
S_BL_2.append(-S_I_2[i]) # Backlogged
c += bl_cost*S_BL_2[i]
# Update Inventory Position
S_IP_2.append(S_IP_2[i-1] + S_O_2[i-1] - demand)
# Update Order, Upstream member dispatches goods
if (B-S_IP_2[i]) > 0:
S_O_2.append(B - S_IP_2[i])
else:
S_O_2.append(0)
# Log Simulation costs for that B-value
S_BC_2.append(c)
# If the simulated costs are increasing, stop
if B>11:
dummy = []
for i in range(0,10):
dummy.append(S_BC_2[B-i]-S_BC_2[B-i-1])
Run_2.append(sum(dummy)/float(len(dummy)))
if Run_2[B-3] > 0 and B>20:
break
else:
Run_2.append(0)
# Use minimum cost as new B(t)
var = min((val, idx) for (idx, val) in enumerate(S_BC_2))
optimal_B = var[1]
B_2.append(optimal_B)
# Calculate O(t)
if B_2[t] - IP_2[t] > 0:
O_2.append(B_2[t] - IP_2[t])
else:
O_2.append(0)
#### RETAILER ####
#recieve shipment from supplier, calculate items OH HAND
if I_1[t-1]<0:
OH_1.append(R_1[t])
else:
OH_1.append(I_1[t-1]+R_1[t])
# Recieve and dispatch order, update Inventory and Backlog for time t
if (D[t] +BL_1[t-1]) <= OH_1[t]: # No Backlog
I_1.append(OH_1[t] - (D[t] + BL_1[t-1]))
BL_1.append(0)
R_0.append(D[t]+BL_1[t-1])
else:
I_1.append(OH_1[t] - (D[t] + BL_1[t-1])) # Backlogged
BL_1.append(-I_1[t])
R_0.append(OH_1[t])
# Update Inventory Position
IP_1.append(IP_1[t-1] + O_1[t-1] - D[t])
# Use exponential smoothing to forecast future demand
future_demand = (1-a)*F_1[t] + a*D[t]
F_1.append(future_demand)
# Calculate D_bar(t) and Var(t)
Db_1.append((1/t)*sum(D[1:t+1]))
s = 0
for i in range(1,t+1):
s+=(D[i]-Db_1[t])**2
if t==1: # Var(1) = 0
var_1.append(0)
else:
var_1.append((1/(t-1))*s)
# Simulation to determine B(t)
S_BC_1 = [10000000000]*10
Run_1 = [0]*10
for B in range(10,500):
S_OH_1 = OH_1[:]
S_I_1 = I_1[:]
S_R_1 = R_1[:]
S_BL_1 = BL_1[:]
S_IP_1 = IP_1[:]
S_O_1 = O_1[:]
# Update O(t)(the period just before the simulation begins)
# using the B value for the simulation
if B - S_IP_1[t] > 0:
S_O_1.append(B - S_IP_1[t])
else:
S_O_1.append(0)
c=0
for i in range(t+1,t+sim+1):
#simulate demand
demand = -1
while demand <0:
demand = random.normalvariate(F_1[t+1],(var_1[t])**(.5))
S_R_1.append(S_O_1[i-1])
# Receive simulated shipment, calculate simulated items on hand
if S_I_1[i-1]<0:
S_OH_1.append(S_R_1[i])
else:
S_OH_1.append(S_I_1[i-1]+S_R_1[i])
# Receive and send order, update Inventory and Backlog (simulated)
owed = (demand + S_BL_1[i-1])
S_I_1.append(S_OH_1[i] - owed)
if owed <= S_OH_1[i]: # No Backlog
S_BL_1.append(0)
c += inv_cost*S_I_1[i]
else:
S_BL_1.append(-S_I_1[i]) # Backlogged
c += bl_cost*S_BL_1[i]
# Update Inventory Position
S_IP_1.append(S_IP_1[i-1] + S_O_1[i-1] - demand)
# Update Order, Upstream member dispatches goods
if (B-S_IP_1[i]) > 0:
S_O_1.append(B - S_IP_1[i])
else:
S_O_1.append(0)
# Log Simulation costs for that B-value
S_BC_1.append(c)
# If the simulated costs are increasing, stop
if B>11:
dummy = []
for i in range(0,10):
dummy.append(S_BC_1[B-i]-S_BC_1[B-i-1])
Run_1.append(sum(dummy)/float(len(dummy)))
if Run_1[B-3] > 0 and B>20:
break
else:
Run_1.append(0)
# Use minimum as your new B(t)
var = min((val, idx) for (idx, val) in enumerate(S_BC_1))
optimal_B = var[1]
B_1.append(optimal_B)
# Calculate O(t)
if B_1[t] - IP_1[t] > 0:
O_1.append(B_1[t] - IP_1[t])
else:
O_1.append(0)
### Calculate the Standard Devation of the last half of time periods ###
def STD(numbers):
k = len(numbers)
mean = sum(numbers) / k
SD = (sum([dev*dev for dev in [x-mean for x in numbers]])/(k-1))**.5
return SD
start = (total//2)+1
# Only use the last half of the time periods to calculate the standard deviation
I_STD_1_L.append(STD(I_1[start:]))
I_STD_2_L.append(STD(I_2[start:]))
I_STD_3_L.append(STD(I_3[start:]))
I_STD_4_L.append(STD(I_4[start:]))
O_STD_0_L.append(STD(D[start:]))
O_STD_1_L.append(STD(O_1[start:]))
O_STD_2_L.append(STD(O_2[start:]))
O_STD_3_L.append(STD(O_3[start:]))
O_STD_4_L.append(STD(O_4[start:]))
from time import time
timeB = time()
timeleft(a,L,timeB-timeA)
I_STD_1[L//2] = I_STD_1_L[:]
I_STD_2[L//2] = I_STD_2_L[:]
I_STD_3[L//2] = I_STD_3_L[:]
I_STD_4[L//2] = I_STD_4_L[:]
O_STD_0[L//2] = O_STD_0_L[:]
O_STD_1[L//2] = O_STD_1_L[:]
O_STD_2[L//2] = O_STD_2_L[:]
O_STD_3[L//2] = O_STD_3_L[:]
O_STD_4[L//2] = O_STD_4_L[:]
CSV(a,L,I_STD_1,I_STD_2,I_STD_3,I_STD_4,O_STD_0,
O_STD_1,O_STD_2,O_STD_3,O_STD_4)
from time import time
timeE = time()
print("Run Time: ",(timeE-time0)/3600," hours")
This would be a good time to look at a profiler. You can profile the code to determine where time is being spent. It would appear likely that you issue is in the simulation code, but without being able to see that code the best help you're likely to get going to be vague.
Edit in light of added code:
You're doing a fair amount of copying of lists, which while not terribly expensive can consume a lot of time.
I agree the your code is probably unnecessarily confusing and would advise you to clean up the code. Changing the confusing names to meaningful ones may help you find where you're having a problem.
Finally, it may be the case that your simulation is simply computationally expensive. You might want to consider looking into a SciPy, Pandas, or some other Python mathematic package to get better performance and perhaps better tools for expressing the model you're simulating.
I experienced a similar problem with a Python 3.x script I wrote. The script randomly generated 1,000,000 (one million) JSON objects, writing them out to a file.
My problem was that the program was growing progressively slower as time proceeded. Here is a timestamp trace every 10,000 objects:
So far: Mar23-17:56:46: 0
So far: Mar23-17:56:48: 10000 ( 2 seconds)
So far: Mar23-17:56:50: 20000 ( 2 seconds)
So far: Mar23-17:56:55: 30000 ( 5 seconds)
So far: Mar23-17:57:01: 40000 ( 6 seconds)
So far: Mar23-17:57:09: 50000 ( 8 seconds)
So far: Mar23-17:57:18: 60000 ( 8 seconds)
So far: Mar23-17:57:29: 70000 (11 seconds)
So far: Mar23-17:57:42: 80000 (13 seconds)
So far: Mar23-17:57:56: 90000 (14 seconds)
So far: Mar23-17:58:13: 100000 (17 seconds)
So far: Mar23-17:58:30: 110000 (17 seconds)
So far: Mar23-17:58:51: 120000 (21 seconds)
So far: Mar23-17:59:12: 130000 (21 seconds)
So far: Mar23-17:59:35: 140000 (23 seconds)
As can be seen, the script takes progressively longer to generate groups of 10,000 records.
In my case it turned out to be the way I was generating unique ID numbers, each in the range of 10250000000000-10350000000000. To avoid regenerating the same ID twice, I stored a newly generated ID in a list, checking later that the ID does not exist in the list:
trekIdList = []
def GenerateRandomTrek ():
global trekIdList
while True:
r = random.randint (10250000000000, 10350000000000)
if not r in trekIdList:
trekIdList.append (r)
return r
The problem is that an unsorted list takes O(n) to search. As newly generated IDs are appended to the list, the time needed to traverse/search the list grows.
The solution was to switch to a dictionary (or map):
trekIdList = {}
. . .
def GenerateRandomTrek ():
global trekIdList
while True:
r = random.randint (10250000000000, 10350000000000)
if not r in trekIdList:
trekIdList [r] = 1
return r
The improvement was immediate:
So far: Mar23-18:11:30: 0
So far: Mar23-18:11:30: 10000
So far: Mar23-18:11:31: 20000
So far: Mar23-18:11:31: 30000
So far: Mar23-18:11:31: 40000
So far: Mar23-18:11:32: 50000
So far: Mar23-18:11:32: 60000
So far: Mar23-18:11:32: 70000
So far: Mar23-18:11:33: 80000
So far: Mar23-18:11:33: 90000
So far: Mar23-18:11:33: 100000
So far: Mar23-18:11:34: 110000
So far: Mar23-18:11:34: 120000
So far: Mar23-18:11:34: 130000
So far: Mar23-18:11:35: 140000
The reason is that accessing a value in a dictionary/map/hash is O(1).
Moral: When dealing with large numbers of items, use a dictionary/map or binary searching a sorted list rathen than an unordered list.
You can use cProfile and the like but many times it will still be hard to spot the issue. However knowing that slowness is in linear progression is at huge benefit for you since you already kind of know what the problem is, but not exactly where it is.
I'd start by elimination and simplifying:
Make a small fast example that demonstrates the sluggishness as a separate file.
Run the above and keep removing/commenting out huge portions of the code.
Once you have narrowed down enough, look for Python keywords values(), items(), in, for , deepcopy as good examples.
By continuously simplifying the example and re-running the test script you will eventually get down to the core issue.
Once you resolved one bottleneck, you might find that you still exhibit the sluggishness when you bring back the old code. Most probably there's more than 1 bottlenecks then.

Categories