I am in a lower-level coding class (Python) and have a major project due in three days. One of our grading criteria is program speed. My program runs in about 30 seconds, ideally it would execute in 15 or less. Here is my code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
import time
start_time = time.time()#for printing execution time
#function for appending any number of files to a dataframe
def read_data_files(pre, start, end): #reading in the data
data = pd.DataFrame()#dataframe with all the data from files
x = start
while x <= end:
filename = pre + str(x) + ".csv" #string manipulation
dpath = pd.read_csv("C:\\Users\\jacks\\Downloads\\BMEN 207 Project 1
Data\\" + filename )
for y in dpath:
dpath = dpath.rename(columns = {y: y})
data = data.append(dpath)
x += 1
return data
data = read_data_files("Data_", 5, 163) #start, end, prefix...
#converting to human time and adding to new column in dataframe
human_time = []
for i in data[' Time']:
i = int(i)
j = datetime.utcfromtimestamp(i).strftime('%Y-%m-%d %H:%M:%S')
human_time.append(j)
human_timen = np.array(human_time) #had issues here for some reason, so i
created another array to replace the time column in the dataframe
data[' Time'] = human_timen
hours = [] #for use as x-axis in plot
stdlist = [] #for use as y-axis in plot
histlist = [] #for storing magnitudes of most active hour
def magfind(row): #separate function to calculate the magnitude of each row in
each dataframe
return (row[' Acc X'] ** 2 + row[' Acc Y'] ** 2 + row[' Acc Z'] ** 2) ** .5
def filterfunction(intro1, intro2, first, last): #two different intros to deal
with the issue of '00:' versus '10:' timestamps
k = first
meanmax = 0
active = 0
while k <= last:
if 0 <= k < 7: #data from hours 0 to 6, none after
hr = intro1 + str(k) + ':'
tfilter = data[' Time'].str.contains(hr)
acc = data.loc[tfilter, [' Acc X', ' Acc Y', ' Acc Z']]
acc['magnitude'] = acc.apply(magfind, axis = 1) #creates magnitude
column using prior function, column has magnitudes for every row of every file
p = acc.loc[:, 'magnitude'].std()#finds std dev for the column and
appends to a list for graphing
m = acc.loc[:, 'magnitude'].mean()
stdlist.append(p)
elif k == 12 or 20 < k <= last: #data at 12 and beyond hour 20
hr = intro2 + str(k) + ":"
tfilter = data[' Time'].str.contains(hr)
acc = data.loc[tfilter, [' Acc X', ' Acc Y', ' Acc Z']]
acc['magnitude'] = acc.apply(magfind, axis = 1)
p = acc.loc[:, 'magnitude'].std()
m = acc.loc[:, 'magnitude'].mean()
stdlist.append(p)
else: #in the case that we are dealing with an hour that has no data
p = 0
m = 0
stdlist.append(p) #need this so that the hours with no data still
get graphed
if m > meanmax: # for determining which hour was the most active, and
appending those magnitudes to a list for histogramming
meanmax = m
active = k #most active hour
for i in acc['magnitude']:
histlist.append(i) #adding all the magnitudes for histogramming
hours.append(k)
k += 1
print("mean magnitude", meanmax)
print("most active hour", active)
return hours, stdlist, histlist
filterfunction(' 0', ' ', 0, 23)
The slow speed stems from the "filterfunction" function. What this program does is read data from over 100 files, and this function specifically sorts the data into a dataframe and analyzes by time (each individual hour) in order to calculate the data in all rows for that hour. I believe that it might be able to be sped up by changing up the way that the data is filtered to search by hour, but am not sure. The reason I have statements to dis-include certain k-values is that there are hours with no data to manipulate, which would mess up the list of standard deviation calculations as well as the plot that this data will father. Any tips or ideas for speeding this up would be greatly appreciated!
One suggestion to speed it up a bit is to remove this line since it is not being used anywhere in the program:
import matplotlib.pyplot as plt
matplotlib is a big library so removing it should improve performance.
Also I think you could get rid of numpy since it is used once only...consider using a tuple
I could not able to test because I am on mobile now. However my main idea is not making the code better or leen. I changed the functioning part of the process.
Integrated the 'multiprocessing' library(method) into your code and also calculated the system cpu cores and divide all processes between them.
Multiprocessing library detailed documentation: https://docs.python.org/2/library/multiprocessing.html
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
import psutil
from datetime import datetime
from multiprocessing import Pool
cores = psutil.cpu_count()
start_time = time.time()#for printing execution time
#function for appending any number of files to a dataframe
def read_data_files(pre, start, end): #reading in the data
data = pd.DataFrame()#dataframe with all the data from files
x = start
while x <= end:
filename = pre + str(x) + ".csv" #string manipulation
dpath = pd.read_csv("C:\\Users\\jacks\\Downloads\\BMEN 207 Project 1
Data\\" + filename )
for y in dpath:
dpath = dpath.rename(columns = {y: y})
data = data.append(dpath)
x += 1
return data
data = read_data_files("Data_", 5, 163) #start, end, prefix...
#converting to human time and adding to new column in dataframe
human_time = []
for i in data[' Time']:
i = int(i)
j = datetime.utcfromtimestamp(i).strftime('%Y-%m-%d %H:%M:%S')
human_time.append(j)
human_timen = np.array(human_time) #had issues here for some reason, so i
created another array to replace the time column in the dataframe
data[' Time'] = human_timen
hours = [] #for use as x-axis in plot
stdlist = [] #for use as y-axis in plot
histlist = [] #for storing magnitudes of most active hour
def magfind(row): #separate function to calculate the magnitude of each row in
each dataframe
return (row[' Acc X'] ** 2 + row[' Acc Y'] ** 2 + row[' Acc Z'] ** 2) ** .5
def filterfunction(intro1, intro2, first, last): #two different intros to deal
with the issue of '00:' versus '10:' timestamps
k = first
meanmax = 0
active = 0
while k <= last:
if 0 <= k < 7: #data from hours 0 to 6, none after
hr = intro1 + str(k) + ':'
tfilter = data[' Time'].str.contains(hr)
acc = data.loc[tfilter, [' Acc X', ' Acc Y', ' Acc Z']]
acc['magnitude'] = acc.apply(magfind, axis = 1) #creates magnitude
column using prior function, column has magnitudes for every row of every file
p = acc.loc[:, 'magnitude'].std()#finds std dev for the column and
appends to a list for graphing
m = acc.loc[:, 'magnitude'].mean()
stdlist.append(p)
elif k == 12 or 20 < k <= last: #data at 12 and beyond hour 20
hr = intro2 + str(k) + ":"
tfilter = data[' Time'].str.contains(hr)
acc = data.loc[tfilter, [' Acc X', ' Acc Y', ' Acc Z']]
acc['magnitude'] = acc.apply(magfind, axis = 1)
p = acc.loc[:, 'magnitude'].std()
m = acc.loc[:, 'magnitude'].mean()
stdlist.append(p)
else: #in the case that we are dealing with an hour that has no data
p = 0
m = 0
stdlist.append(p) #need this so that the hours with no data still
get graphed
if m > meanmax: # for determining which hour was the most active, and
appending those magnitudes to a list for histogramming
meanmax = m
active = k #most active hour
for i in acc['magnitude']:
histlist.append(i) #adding all the magnitudes for histogramming
hours.append(k)
k += 1
print("mean magnitude", meanmax)
print("most active hour", active)
return hours, stdlist, histlist
# Run this with a pool of 5 agents having a chunksize of 3 until finished
agents = cores
chunksize = (len(data) / cores)
with Pool(processes=agents) as pool:
pool.map(filterfunction, (' 0', ' ', 0, 23))
don't use apply, it's not vectorized. instead, use vectorized operations whenever you can. in this case, instead of doing df.apply(magfind, 1), do:
def add_magnitude(df):
df['magnitude'] = (df[' Acc X'] ** 2 + df[' Acc Y'] ** 2 + df[' Acc Z'] ** 2) ** .5
def jacobi(m,numiter=100):
#Number of rows determins the number of variables
numvars = m.shape[0]
#construct array for final iterations
history = np.zeros((numvars,numiter))
i = 1
while(i < numiter): #Loop for numiter
for v in range(numvars): # Loop over all variables
current = m[v,numvars] # Start with left hand side (augmented side of matrix)
for col in range(numvars): #Loop over columns
if v != col: # Don't count colume for current variable
current = current - (m[v,col]*history[col, i-1]) #subtract other guesses form previous timestep
current = current/m[v,v] #divide by current variable coefficent
history[v,i] = current #Add this answer to the rest
i = i + 1 #iterate
#plot each variable
for v in range(numvars):
plt.plot(history[v,: i]);
return history[:,i-1]
I have this code that calculates Jacobian method. How do I add a stopping condition for when the solutions converge? i.e. the values for the current iteration have changed less than some threshold e from the values for the previous iteration.
The threshold e will be an input to the function and the default value to 0.00001
You could add another condition to your while loop, so when it reaches your error threshold it stops.
def jacobi(m,numiter=100, error_threshold = 1e-4):
#Number of rows determins the number of variables
numvars = m.shape[0]
#construct array for final iterations
history = np.zeros((numvars,numiter))
i = 1
err = 10*error_threshold
while(i < numiter and err > error_threshold): #Loop for numiter and error threshold
for v in range(numvars): # Loop over all variables
current = m[v,numvars] # Start with left hand side (augmented side of matrix)
for col in range(numvars): #Loop over columns
if v != col: # Don't count colume for current variable
current = current - (m[v,col]*history[col, i-1]) #subtract other guesses form previous timestep
current = current/m[v,v] #divide by current variable coefficent
history[v,i] = current #Add this answer to the rest
#check error here. In this case the maximum error
if i > 1:
err = max((history[:,i] - history[:,i-1])/history[:,i-1])
i = i + 1 #iterate
#plot each variable
for v in range(numvars):
plt.plot(history[v,: i]);
return history[:,i-1]
I'm trading daily on Cryptocurrencies and would like to find which are the most desirable Cryptos for trading.
I have CSV file for every Crypto with the following fields:
Date Sell Buy
43051.23918 1925.16 1929.83
43051.23919 1925.12 1929.79
43051.23922 1925.12 1929.79
43051.23924 1926.16 1930.83
43051.23925 1926.12 1930.79
43051.23926 1926.12 1930.79
43051.23927 1950.96 1987.56
43051.23928 1190.90 1911.56
43051.23929 1926.12 1930.79
I would like to check:
How many quotes will end with profit:
for Buy positions - if one of the following Sells > current Buy.
for Sell positions - if one of the following Buys < current Sell.
How much time it would take to a theoretical position to become profitable.
What can be the profit potential.
I'm using the following code:
#converting from OLE to datetime
OLE_TIME_ZERO = dt.datetime(1899, 12, 30, 0, 0, 0)
def ole(oledt):
return OLE_TIME_ZERO + dt.timedelta(days=float(oledt))
#variables initialization
buy_time = ole(43031.57567) - ole(43031.57567)
sell_time = ole(43031.57567) - ole(43031.57567)
profit_buy_counter = 0
no_profit_buy_counter = 0
profit_sell_counter = 0
no_profit_sell_counter = 0
max_profit_buy_positions = 0
max_profit_buy_counter = 0
max_profit_sell_positions = 0
max_profit_sell_counter = 0
df = pd.read_csv("C:/P/Crypto/bitcoin_test_normal_276k.csv")
#comparing to max
for index, row in df.iterrows():
a = index + 1
df_slice = df[a:]
if df_slice["Sell"].max() - row["Buy"] > 0:
max_profit_buy_positions += df_slice["Sell"].max() - row["Buy"]
max_profit_buy_counter += 1
for index1, row1 in df_slice.iterrows():
if row["Buy"] < row1["Sell"] :
buy_time += ole(row1["Date"])- ole(row["Date"])
profit_buy_counter += 1
break
else:
no_profit_buy_counter += 1
#comparing to sell
for index, row in df.iterrows():
a = index + 1
df_slice = df[a:]
if row["Sell"] - df_slice["Buy"].min() > 0:
max_profit_sell_positions += row["Sell"] - df_slice["Buy"].min()
max_profit_sell_counter += 1
for index2, row2 in df_slice.iterrows():
if row["Sell"] > row2["Buy"] :
sell_time += ole(row2["Date"])- ole(row["Date"])
profit_sell_counter += 1
break
else:
no_profit_sell_counter += 1
num_rows = len(df.index)
buy_avg_time = buy_time/num_rows
sell_avg_time = sell_time/num_rows
if max_profit_buy_counter == 0:
avg_max_profit_buy = "There is no profitable buy positions"
else:
avg_max_profit_buy = max_profit_buy_positions/max_profit_buy_counter
if max_profit_sell_counter == 0:
avg_max_profit_sell = "There is no profitable sell positions"
else:
avg_max_profit_sell = max_profit_sell_positions/max_profit_sell_counter
The code works fine for 10K-20K lines but for a larger amount (276K) it take a long time (more than 10 hrs)
What can I do in order to improve it?
Is there any "Pythonic" way to compare each value in a data frame to all following values?
note - the dates in the CSV are in OLE so I need to convert it to Datetime.
File for testing:
Thanks for your comment.
Here you can find the file that I used:
First, I'd want to create the cumulative maximum/minimum values for Sell and Buy per row, so it's easy to compare to. pandas has cummax and cummin, but they go the wrong way. So we'll do:
df['Max Sell'] = df[::-1]['Sell'].cummax()[::-1]
df['Min Buy'] = df[::-1]['Buy'].cummin()[::-1]
Now, we can just compare each row:
df['Buy Profit'] = df['Max Sell'] - df['Buy']
df['Sell Profit'] = df['Sell'] - df['Min Buy']
I'm positive this isn't exactly what you want as I don't perfectly understand what you're trying to do, but hopefully it leads you in the right direction.
After comparing your function and mine, there is a slight difference, as your a is offset one off the index. Removing that offset, you'll see that my method produces the same results as yours, only in vastly shorter time:
for index, row in df.iterrows():
a = index
df_slice = df[a:]
assert (df_slice["Sell"].max() - row["Buy"]) == df['Max Sell'][a] - df['Buy'][a]
else:
print("All assertions passed!")
Note this will still take the very long time required by your function. Note that this can be fixed with shift, but I don't want to run your function for long enough to figure out what way to shift it.
I have a simulation running that has this basic structure:
from time import time
def CSV(*args):
#write * args to .CSV file
return
def timeleft(a,L,period):
print(#details on how long last period took, ETA#)
for L in range(0,6,4):
for a in range(1,100):
timeA = time()
for t in range(1,1000):
## Manufacturer in Supply Chain ##
inventory_accounting_lists.append(#simple calculations#)
# Simulation to determine the optimal B-value (Basestock level)
for B in range(1,100):
for tau in range(1,1000):
## simple inventory accounting operations##
## Distributor in Supply Chain ##
inventory_accounting_lists.append(#simple calculations#)
# Simulation to determine the optimal B-value (Basestock level)
for B in range(1,100):
for tau in range(1,1000):
## simple inventory accounting operations##
## Wholesaler in Supply Chain ##
inventory_accounting_lists.append(#simple calculations#)
# Simulation to determine the optimal B-value (Basestock level)
for B in range(1,100):
for tau in range(1,1000):
## simple inventory accounting operations##
## Retailer in Supply Chain ##
inventory_accounting_lists.append(#simple calculations#)
# Simulation to determine the optimal B-value (Basestock level)
for B in range(1,100):
for tau in range(1,1000):
## simple inventory accounting operations##
CSV(Simulation_Results)
timeB = time()
timeleft(a,L,timeB-timeA)
As the script continues, it seems to be getting slower and slower. Here is what it is for these values (and it increases linearly as a increases).
L = 0, a = 1: 1.15 minutes
L = 0, a = 99: 1.7 minutes
L = 2, a = 1: 2.7 minutes
L = 2, a = 99: 5.15 minutes
L = 4, a = 1: 4.5 minutes
L = 4, a = 15: 4.95 minutes (this is the latest value it has reached)
Why would each iteration take longer? Each iteration of the loop essentially resets everything except for a master global list, which is being added to each time. However, loops inside each "period" aren't accessing this master list -- they are accessing the same local list every time.
EDIT 1: I will post the simulation code here, in case anyone wants to wade through it, but I warn you, it is rather long, and the variable names are probably unnecessarily confusing.
#########
a = 0.01
L = 0
total = 1000
sim = 500
inv_cost = 1
bl_cost = 4
#########
# Functions
import random
from time import time
time0 = time()
# function to report ETA etc.
def timeleft(a,L,period_time):
if L==0:
periods_left = ((1-a)*100)-1+2*99
if L==2:
periods_left = ((1-a)*100)-1+99
if L==4:
periods_left = ((1-a)*100)-1+0*99
minute_time = period_time/60
minutes_left = (periods_left*period_time)/60
hours_left = (periods_left*period_time)/3600
percentage_complete = 100*((297-periods_left)/297)
print("Time for last period = ","%.2f" % minute_time," minutes")
print("%.2f" % percentage_complete,"% complete")
if hours_left<1:
print("%.2f" % minutes_left," minutes left")
else:
print("%.2f" % hours_left," hours left")
print("")
return
def dcopy(inList):
if isinstance(inList, list):
return list( map(dcopy, inList) )
return inList
# Save values to .CSV file
def CSV(a,L,I_STD_1,I_STD_2,I_STD_3,I_STD_4,O_STD_0,
O_STD_1,O_STD_2,O_STD_3,O_STD_4):
pass
# Initialization
# These are the global, master lists of data
I_STD_1 = [[0],[0],[0]]
I_STD_2 = [[0],[0],[0]]
I_STD_3 = [[0],[0],[0]]
I_STD_4 = [[0],[0],[0]]
O_STD_0 = [[0],[0],[0]]
O_STD_1 = [[0],[0],[0]]
O_STD_2 = [[0],[0],[0]]
O_STD_3 = [[0],[0],[0]]
O_STD_4 = [[0],[0],[0]]
for L in range(0,6,2):
# These are local lists that are appended to at the end of every period
I_STD_1_L = []
I_STD_2_L = []
I_STD_3_L = []
I_STD_4_L = []
O_STD_0_L = []
O_STD_1_L = []
O_STD_2_L = []
O_STD_3_L = []
O_STD_4_L = []
test = []
for n in range(1,100): # THIS is the start of the 99 value loop
a = n/100
print ("L=",L,", alpha=",a)
# Initialization for each Period
F_1 = [0,10] # Forecast
F_2 = [0,10]
F_3 = [0,10]
F_4 = [0,10]
R_0 = [10] # Items Received
R_1 = [10]
R_2 = [10]
R_3 = [10]
R_4 = [10]
for i in range(L):
R_1.append(10)
R_2.append(10)
R_3.append(10)
R_4.append(10)
I_1 = [10] # Final Inventory
I_2 = [10]
I_3 = [10]
I_4 = [10]
IP_1 = [10+10*L] # Inventory Position
IP_2 = [10+10*L]
IP_3 = [10+10*L]
IP_4 = [10+10*L]
O_1 = [10] # Items Ordered
O_2 = [10]
O_3 = [10]
O_4 = [10]
BL_1 = [0] # Backlog
BL_2 = [0]
BL_3 = [0]
BL_4 = [0]
OH_1 = [20] # Items on Hand
OH_2 = [20]
OH_3 = [20]
OH_4 = [20]
OR_1 = [10] # Order received from customer
OR_2 = [10]
OR_3 = [10]
OR_4 = [10]
Db_1 = [10] # Running Average Demand
Db_2 = [10]
Db_3 = [10]
Db_4 = [10]
var_1 = [0] # Running Variance in Demand
var_2 = [0]
var_3 = [0]
var_4 = [0]
B_1 = [IP_1[0]+10] # Optimal Basestock
B_2 = [IP_2[0]+10]
B_3 = [IP_3[0]+10]
B_4 = [IP_4[0]+10]
D = [0,10] # End constomer demand
for i in range(total+1):
D.append(9)
D.append(12)
D.append(8)
D.append(11)
period = [0]
from time import time
timeA = time()
# 1000 time periods t
for t in range(1,total+1):
period.append(t)
#### MANUFACTURER ####
# Manufacturing order from previous time period put into production
R_4.append(O_4[t-1])
#recieve shipment from supplier, calculate items OH HAND
if I_4[t-1]<0:
OH_4.append(R_4[t])
else:
OH_4.append(I_4[t-1]+R_4[t])
# Recieve and dispatch order, update Inventory and Backlog for time t
if (O_3[t-1] + BL_4[t-1]) <= OH_4[t]: # No Backlog
I_4.append(OH_4[t] - (O_3[t-1] + BL_4[t-1]))
BL_4.append(0)
R_3.append(O_3[t-1]+BL_4[t-1])
else:
I_4.append(OH_4[t] - (O_3[t-1] + BL_4[t-1])) # Backlogged
BL_4.append(-I_4[t])
R_3.append(OH_4[t])
# Update Inventory Position
IP_4.append(IP_4[t-1] + O_4[t-1] - O_3[t-1])
# Use exponential smoothing to forecast future demand
future_demand = (1-a)*F_4[t] + a*O_3[t-1]
F_4.append(future_demand)
# Calculate D_bar(t) and Var(t)
Db_4.append((1/t)*sum(O_3[0:t]))
s = 0
for i in range(0,t):
s+=(O_3[i]-Db_4[t])**2
if t==1:
var_4.append(0) # var(1) = 0
else:
var_4.append((1/(t-1))*s)
# Simulation to determine B(t)
S_BC_4 = [10000000000]*10
Run_4 = [0]*10
for B in range(10,500):
S_OH_4 = OH_4[:]
S_I_4 = I_4[:]
S_R_4 = R_4[:]
S_BL_4 = BL_4[:]
S_IP_4 = IP_4[:]
S_O_4 = O_4[:]
# Update O(t)(the period just before the simulation begins)
# using the B value for the simulation
if B - S_IP_4[t] > 0:
S_O_4.append(B - S_IP_4[t])
else:
S_O_4.append(0)
c = 0
for i in range(t+1,t+sim+1):
S_R_4.append(S_O_4[i-1])
#simulate demand
demand = -1
while demand <0:
demand = random.normalvariate(F_4[t+1],(var_4[t])**(.5))
# Receive simulated shipment, calculate simulated items on hand
if S_I_4[i-1]<0:
S_OH_4.append(S_R_4[i])
else:
S_OH_4.append(S_I_4[i-1]+S_R_4[i])
# Receive and send order, update Inventory and Backlog (simulated)
owed = (demand + S_BL_4[i-1])
S_I_4.append(S_OH_4[i] - owed)
if owed <= S_OH_4[i]: # No Backlog
S_BL_4.append(0)
c += inv_cost*S_I_4[i]
else:
S_BL_4.append(-S_I_4[i]) # Backlogged
c += bl_cost*S_BL_4[i]
# Update Inventory Position
S_IP_4.append(S_IP_4[i-1] + S_O_4[i-1] - demand)
# Update Order, Upstream member dispatches goods
if (B-S_IP_4[i]) > 0:
S_O_4.append(B - S_IP_4[i])
else:
S_O_4.append(0)
# Log Simulation costs for that B-value
S_BC_4.append(c)
# If the simulated costs are increasing, stop
if B>11:
dummy = []
for i in range(0,10):
dummy.append(S_BC_4[B-i]-S_BC_4[B-i-1])
Run_4.append(sum(dummy)/float(len(dummy)))
if Run_4[B-3] > 0 and B>20:
break
else:
Run_4.append(0)
# Use minimum cost as new B(t)
var = min((val, idx) for (idx, val) in enumerate(S_BC_4))
optimal_B = var[1]
B_4.append(optimal_B)
# Calculate O(t)
if B_4[t] - IP_4[t] > 0:
O_4.append(B_4[t] - IP_4[t])
else:
O_4.append(0)
#### DISTRIBUTOR ####
#recieve shipment from supplier, calculate items OH HAND
if I_3[t-1]<0:
OH_3.append(R_3[t])
else:
OH_3.append(I_3[t-1]+R_3[t])
# Recieve and dispatch order, update Inventory and Backlog for time t
if (O_2[t-1] + BL_3[t-1]) <= OH_3[t]: # No Backlog
I_3.append(OH_3[t] - (O_2[t-1] + BL_3[t-1]))
BL_3.append(0)
R_2.append(O_2[t-1]+BL_3[t-1])
else:
I_3.append(OH_3[t] - (O_2[t-1] + BL_3[t-1])) # Backlogged
BL_3.append(-I_3[t])
R_2.append(OH_3[t])
# Update Inventory Position
IP_3.append(IP_3[t-1] + O_3[t-1] - O_2[t-1])
# Use exponential smoothing to forecast future demand
future_demand = (1-a)*F_3[t] + a*O_2[t-1]
F_3.append(future_demand)
# Calculate D_bar(t) and Var(t)
Db_3.append((1/t)*sum(O_2[0:t]))
s = 0
for i in range(0,t):
s+=(O_2[i]-Db_3[t])**2
if t==1:
var_3.append(0) # var(1) = 0
else:
var_3.append((1/(t-1))*s)
# Simulation to determine B(t)
S_BC_3 = [10000000000]*10
Run_3 = [0]*10
for B in range(10,500):
S_OH_3 = OH_3[:]
S_I_3 = I_3[:]
S_R_3 = R_3[:]
S_BL_3 = BL_3[:]
S_IP_3 = IP_3[:]
S_O_3 = O_3[:]
# Update O(t)(the period just before the simulation begins)
# using the B value for the simulation
if B - S_IP_3[t] > 0:
S_O_3.append(B - S_IP_3[t])
else:
S_O_3.append(0)
c = 0
for i in range(t+1,t+sim+1):
#simulate demand
demand = -1
while demand <0:
demand = random.normalvariate(F_3[t+1],(var_3[t])**(.5))
S_R_3.append(S_O_3[i-1])
# Receive simulated shipment, calculate simulated items on hand
if S_I_3[i-1]<0:
S_OH_3.append(S_R_3[i])
else:
S_OH_3.append(S_I_3[i-1]+S_R_3[i])
# Receive and send order, update Inventory and Backlog (simulated)
owed = (demand + S_BL_3[i-1])
S_I_3.append(S_OH_3[i] - owed)
if owed <= S_OH_3[i]: # No Backlog
S_BL_3.append(0)
c += inv_cost*S_I_3[i]
else:
S_BL_3.append(-S_I_3[i]) # Backlogged
c += bl_cost*S_BL_3[i]
# Update Inventory Position
S_IP_3.append(S_IP_3[i-1] + S_O_3[i-1] - demand)
# Update Order, Upstream member dispatches goods
if (B-S_IP_3[i]) > 0:
S_O_3.append(B - S_IP_3[i])
else:
S_O_3.append(0)
# Log Simulation costs for that B-value
S_BC_3.append(c)
# If the simulated costs are increasing, stop
if B>11:
dummy = []
for i in range(0,10):
dummy.append(S_BC_3[B-i]-S_BC_3[B-i-1])
Run_3.append(sum(dummy)/float(len(dummy)))
if Run_3[B-3] > 0 and B>20:
break
else:
Run_3.append(0)
# Use minimum cost as new B(t)
var = min((val, idx) for (idx, val) in enumerate(S_BC_3))
optimal_B = var[1]
B_3.append(optimal_B)
# Calculate O(t)
if B_3[t] - IP_3[t] > 0:
O_3.append(B_3[t] - IP_3[t])
else:
O_3.append(0)
#### WHOLESALER ####
#recieve shipment from supplier, calculate items OH HAND
if I_2[t-1]<0:
OH_2.append(R_2[t])
else:
OH_2.append(I_2[t-1]+R_2[t])
# Recieve and dispatch order, update Inventory and Backlog for time t
if (O_1[t-1] + BL_2[t-1]) <= OH_2[t]: # No Backlog
I_2.append(OH_2[t] - (O_1[t-1] + BL_2[t-1]))
BL_2.append(0)
R_1.append(O_1[t-1]+BL_2[t-1])
else:
I_2.append(OH_2[t] - (O_1[t-1] + BL_2[t-1])) # Backlogged
BL_2.append(-I_2[t])
R_1.append(OH_2[t])
# Update Inventory Position
IP_2.append(IP_2[t-1] + O_2[t-1] - O_1[t-1])
# Use exponential smoothing to forecast future demand
future_demand = (1-a)*F_2[t] + a*O_1[t-1]
F_2.append(future_demand)
# Calculate D_bar(t) and Var(t)
Db_2.append((1/t)*sum(O_1[0:t]))
s = 0
for i in range(0,t):
s+=(O_1[i]-Db_2[t])**2
if t==1:
var_2.append(0) # var(1) = 0
else:
var_2.append((1/(t-1))*s)
# Simulation to determine B(t)
S_BC_2 = [10000000000]*10
Run_2 = [0]*10
for B in range(10,500):
S_OH_2 = OH_2[:]
S_I_2 = I_2[:]
S_R_2 = R_2[:]
S_BL_2 = BL_2[:]
S_IP_2 = IP_2[:]
S_O_2 = O_2[:]
# Update O(t)(the period just before the simulation begins)
# using the B value for the simulation
if B - S_IP_2[t] > 0:
S_O_2.append(B - S_IP_2[t])
else:
S_O_2.append(0)
c = 0
for i in range(t+1,t+sim+1):
#simulate demand
demand = -1
while demand <0:
demand = random.normalvariate(F_2[t+1],(var_2[t])**(.5))
# Receive simulated shipment, calculate simulated items on hand
S_R_2.append(S_O_2[i-1])
if S_I_2[i-1]<0:
S_OH_2.append(S_R_2[i])
else:
S_OH_2.append(S_I_2[i-1]+S_R_2[i])
# Receive and send order, update Inventory and Backlog (simulated)
owed = (demand + S_BL_2[i-1])
S_I_2.append(S_OH_2[i] - owed)
if owed <= S_OH_2[i]: # No Backlog
S_BL_2.append(0)
c += inv_cost*S_I_2[i]
else:
S_BL_2.append(-S_I_2[i]) # Backlogged
c += bl_cost*S_BL_2[i]
# Update Inventory Position
S_IP_2.append(S_IP_2[i-1] + S_O_2[i-1] - demand)
# Update Order, Upstream member dispatches goods
if (B-S_IP_2[i]) > 0:
S_O_2.append(B - S_IP_2[i])
else:
S_O_2.append(0)
# Log Simulation costs for that B-value
S_BC_2.append(c)
# If the simulated costs are increasing, stop
if B>11:
dummy = []
for i in range(0,10):
dummy.append(S_BC_2[B-i]-S_BC_2[B-i-1])
Run_2.append(sum(dummy)/float(len(dummy)))
if Run_2[B-3] > 0 and B>20:
break
else:
Run_2.append(0)
# Use minimum cost as new B(t)
var = min((val, idx) for (idx, val) in enumerate(S_BC_2))
optimal_B = var[1]
B_2.append(optimal_B)
# Calculate O(t)
if B_2[t] - IP_2[t] > 0:
O_2.append(B_2[t] - IP_2[t])
else:
O_2.append(0)
#### RETAILER ####
#recieve shipment from supplier, calculate items OH HAND
if I_1[t-1]<0:
OH_1.append(R_1[t])
else:
OH_1.append(I_1[t-1]+R_1[t])
# Recieve and dispatch order, update Inventory and Backlog for time t
if (D[t] +BL_1[t-1]) <= OH_1[t]: # No Backlog
I_1.append(OH_1[t] - (D[t] + BL_1[t-1]))
BL_1.append(0)
R_0.append(D[t]+BL_1[t-1])
else:
I_1.append(OH_1[t] - (D[t] + BL_1[t-1])) # Backlogged
BL_1.append(-I_1[t])
R_0.append(OH_1[t])
# Update Inventory Position
IP_1.append(IP_1[t-1] + O_1[t-1] - D[t])
# Use exponential smoothing to forecast future demand
future_demand = (1-a)*F_1[t] + a*D[t]
F_1.append(future_demand)
# Calculate D_bar(t) and Var(t)
Db_1.append((1/t)*sum(D[1:t+1]))
s = 0
for i in range(1,t+1):
s+=(D[i]-Db_1[t])**2
if t==1: # Var(1) = 0
var_1.append(0)
else:
var_1.append((1/(t-1))*s)
# Simulation to determine B(t)
S_BC_1 = [10000000000]*10
Run_1 = [0]*10
for B in range(10,500):
S_OH_1 = OH_1[:]
S_I_1 = I_1[:]
S_R_1 = R_1[:]
S_BL_1 = BL_1[:]
S_IP_1 = IP_1[:]
S_O_1 = O_1[:]
# Update O(t)(the period just before the simulation begins)
# using the B value for the simulation
if B - S_IP_1[t] > 0:
S_O_1.append(B - S_IP_1[t])
else:
S_O_1.append(0)
c=0
for i in range(t+1,t+sim+1):
#simulate demand
demand = -1
while demand <0:
demand = random.normalvariate(F_1[t+1],(var_1[t])**(.5))
S_R_1.append(S_O_1[i-1])
# Receive simulated shipment, calculate simulated items on hand
if S_I_1[i-1]<0:
S_OH_1.append(S_R_1[i])
else:
S_OH_1.append(S_I_1[i-1]+S_R_1[i])
# Receive and send order, update Inventory and Backlog (simulated)
owed = (demand + S_BL_1[i-1])
S_I_1.append(S_OH_1[i] - owed)
if owed <= S_OH_1[i]: # No Backlog
S_BL_1.append(0)
c += inv_cost*S_I_1[i]
else:
S_BL_1.append(-S_I_1[i]) # Backlogged
c += bl_cost*S_BL_1[i]
# Update Inventory Position
S_IP_1.append(S_IP_1[i-1] + S_O_1[i-1] - demand)
# Update Order, Upstream member dispatches goods
if (B-S_IP_1[i]) > 0:
S_O_1.append(B - S_IP_1[i])
else:
S_O_1.append(0)
# Log Simulation costs for that B-value
S_BC_1.append(c)
# If the simulated costs are increasing, stop
if B>11:
dummy = []
for i in range(0,10):
dummy.append(S_BC_1[B-i]-S_BC_1[B-i-1])
Run_1.append(sum(dummy)/float(len(dummy)))
if Run_1[B-3] > 0 and B>20:
break
else:
Run_1.append(0)
# Use minimum as your new B(t)
var = min((val, idx) for (idx, val) in enumerate(S_BC_1))
optimal_B = var[1]
B_1.append(optimal_B)
# Calculate O(t)
if B_1[t] - IP_1[t] > 0:
O_1.append(B_1[t] - IP_1[t])
else:
O_1.append(0)
### Calculate the Standard Devation of the last half of time periods ###
def STD(numbers):
k = len(numbers)
mean = sum(numbers) / k
SD = (sum([dev*dev for dev in [x-mean for x in numbers]])/(k-1))**.5
return SD
start = (total//2)+1
# Only use the last half of the time periods to calculate the standard deviation
I_STD_1_L.append(STD(I_1[start:]))
I_STD_2_L.append(STD(I_2[start:]))
I_STD_3_L.append(STD(I_3[start:]))
I_STD_4_L.append(STD(I_4[start:]))
O_STD_0_L.append(STD(D[start:]))
O_STD_1_L.append(STD(O_1[start:]))
O_STD_2_L.append(STD(O_2[start:]))
O_STD_3_L.append(STD(O_3[start:]))
O_STD_4_L.append(STD(O_4[start:]))
from time import time
timeB = time()
timeleft(a,L,timeB-timeA)
I_STD_1[L//2] = I_STD_1_L[:]
I_STD_2[L//2] = I_STD_2_L[:]
I_STD_3[L//2] = I_STD_3_L[:]
I_STD_4[L//2] = I_STD_4_L[:]
O_STD_0[L//2] = O_STD_0_L[:]
O_STD_1[L//2] = O_STD_1_L[:]
O_STD_2[L//2] = O_STD_2_L[:]
O_STD_3[L//2] = O_STD_3_L[:]
O_STD_4[L//2] = O_STD_4_L[:]
CSV(a,L,I_STD_1,I_STD_2,I_STD_3,I_STD_4,O_STD_0,
O_STD_1,O_STD_2,O_STD_3,O_STD_4)
from time import time
timeE = time()
print("Run Time: ",(timeE-time0)/3600," hours")
This would be a good time to look at a profiler. You can profile the code to determine where time is being spent. It would appear likely that you issue is in the simulation code, but without being able to see that code the best help you're likely to get going to be vague.
Edit in light of added code:
You're doing a fair amount of copying of lists, which while not terribly expensive can consume a lot of time.
I agree the your code is probably unnecessarily confusing and would advise you to clean up the code. Changing the confusing names to meaningful ones may help you find where you're having a problem.
Finally, it may be the case that your simulation is simply computationally expensive. You might want to consider looking into a SciPy, Pandas, or some other Python mathematic package to get better performance and perhaps better tools for expressing the model you're simulating.
I experienced a similar problem with a Python 3.x script I wrote. The script randomly generated 1,000,000 (one million) JSON objects, writing them out to a file.
My problem was that the program was growing progressively slower as time proceeded. Here is a timestamp trace every 10,000 objects:
So far: Mar23-17:56:46: 0
So far: Mar23-17:56:48: 10000 ( 2 seconds)
So far: Mar23-17:56:50: 20000 ( 2 seconds)
So far: Mar23-17:56:55: 30000 ( 5 seconds)
So far: Mar23-17:57:01: 40000 ( 6 seconds)
So far: Mar23-17:57:09: 50000 ( 8 seconds)
So far: Mar23-17:57:18: 60000 ( 8 seconds)
So far: Mar23-17:57:29: 70000 (11 seconds)
So far: Mar23-17:57:42: 80000 (13 seconds)
So far: Mar23-17:57:56: 90000 (14 seconds)
So far: Mar23-17:58:13: 100000 (17 seconds)
So far: Mar23-17:58:30: 110000 (17 seconds)
So far: Mar23-17:58:51: 120000 (21 seconds)
So far: Mar23-17:59:12: 130000 (21 seconds)
So far: Mar23-17:59:35: 140000 (23 seconds)
As can be seen, the script takes progressively longer to generate groups of 10,000 records.
In my case it turned out to be the way I was generating unique ID numbers, each in the range of 10250000000000-10350000000000. To avoid regenerating the same ID twice, I stored a newly generated ID in a list, checking later that the ID does not exist in the list:
trekIdList = []
def GenerateRandomTrek ():
global trekIdList
while True:
r = random.randint (10250000000000, 10350000000000)
if not r in trekIdList:
trekIdList.append (r)
return r
The problem is that an unsorted list takes O(n) to search. As newly generated IDs are appended to the list, the time needed to traverse/search the list grows.
The solution was to switch to a dictionary (or map):
trekIdList = {}
. . .
def GenerateRandomTrek ():
global trekIdList
while True:
r = random.randint (10250000000000, 10350000000000)
if not r in trekIdList:
trekIdList [r] = 1
return r
The improvement was immediate:
So far: Mar23-18:11:30: 0
So far: Mar23-18:11:30: 10000
So far: Mar23-18:11:31: 20000
So far: Mar23-18:11:31: 30000
So far: Mar23-18:11:31: 40000
So far: Mar23-18:11:32: 50000
So far: Mar23-18:11:32: 60000
So far: Mar23-18:11:32: 70000
So far: Mar23-18:11:33: 80000
So far: Mar23-18:11:33: 90000
So far: Mar23-18:11:33: 100000
So far: Mar23-18:11:34: 110000
So far: Mar23-18:11:34: 120000
So far: Mar23-18:11:34: 130000
So far: Mar23-18:11:35: 140000
The reason is that accessing a value in a dictionary/map/hash is O(1).
Moral: When dealing with large numbers of items, use a dictionary/map or binary searching a sorted list rathen than an unordered list.
You can use cProfile and the like but many times it will still be hard to spot the issue. However knowing that slowness is in linear progression is at huge benefit for you since you already kind of know what the problem is, but not exactly where it is.
I'd start by elimination and simplifying:
Make a small fast example that demonstrates the sluggishness as a separate file.
Run the above and keep removing/commenting out huge portions of the code.
Once you have narrowed down enough, look for Python keywords values(), items(), in, for , deepcopy as good examples.
By continuously simplifying the example and re-running the test script you will eventually get down to the core issue.
Once you resolved one bottleneck, you might find that you still exhibit the sluggishness when you bring back the old code. Most probably there's more than 1 bottlenecks then.