Python Google OR Tools, Surgical Booking AttributeError: 'DataFrame'

Python Google OR Tools, Surgical Booking AttributeError: 'DataFrame' - python

I have very limited coding ability. I am trying to re-create code found in a Canadian Surgical Journal DOI: https://doi.org/10.1503/cjs.016520. The code came from the Appendix and they lost some formatting and had to be manually edited. The python code is supposed to read data from an input excel file and optimize operating room times.
When I run the code as is, it says:
AttributeError: 'DataFrame' object has no attribute 'get_values'. Did you mean: '_get_value'?
I have searched stackoverflow for help and belive 'get_values' has reprecated since version 0.25.0: and I am to use np.asarray(..) or DataFrame.values() instead. However, DataFrame.values() is also reprecated and I should use DataFrame.to_numpy(self, dtype=None, copy=False).
I tried to change the code to DataFrame.to_numpy() I get this following error:
File "/Users/stefonirvine/Desktop/code.py", line 110, in <module>
actualTotalTime[day] += time
TypeError: unsupported operand type(s) for +=: 'int' and 'datetime.time'
I've also tried changing it to: iloc[:, :].values.tolist() as these are what the other dataframes use:
# Input data from excel sheet columns
dfP = pd.DataFrame(data, columns=["Procedure"])
dfP = dfP.iloc[:, :].values.tolist()
dfP = [x for y in dfP for x in y]
dfD = pd.DataFrame(data, columns=["Date"])
dfD = dfD.iloc[:, :].values.tolist()
dfD = [x for y in dfD for x in y]
dfE = pd.DataFrame(data, columns=["Book Dur"])
dfE = dfE.iloc[:, :].values.tolist()
dfE = [x for y in dfE for x in y]
dfA = pd.DataFrame(data, columns=["Actual Dur"])
dfA = dfA.iloc[:, :].values.tolist()
dfA = [x for y in dfA for x in y]
And then I get the following error:
Traceback (most recent call last):
File "/Users/stefonirvine/Desktop/code.py", line 215, in <module>
dashboard.at["Original overtime frequency", "B"] = round(currentOvertimeCount / count, 2)
ZeroDivisionError: division by zero
However, with the change to iloc[:, :].values.tolist() the code runs and starts to solve the problem but gets caught at a division by zero error.
Stefons-MacBook-Pro:desktop stefonirvine$ python3 code.py
Starting model optimisation
Starting to solve
Success!
number of days: 0
number of cases: 0
Traceback (most recent call last):
File "/Users/stefonirvine/Desktop/code.py", line 215, in <module>
dashboard.at["Original overtime frequency", "B"] = round(currentOvertimeCount / count, 2)
ZeroDivisionError: division by zero
I am just needing help updating the .get_values() as its been reprecated. Any advice on how to get this code to work would be apperciated. I belived the error is in the first 1/2 of the code.
Here is the current code in full:
from __future__ import print_function
from ortools.sat.python import cp_model
import pandas as pd
import sys
import math
#
# Parameters
targetOvertimeFrequency = 0.2
targetUndertimeFrequency = 0
undertimeCostWeight = 1
overtimeCostWeight = 1
arguments = sys.argv
excelInputFile = 'input.xlsx'
excelOutputFile = 'output.xlsx'
overtimeBlockLength = 15 # length of overtime blocks in minutes
noOfNursesPerBlock = 2.5 # number of nurses per overtime block
overtimeSalary = 75 # pay by block for overtime nurses
# Optional: get excel filenames from command line argument
if (len(arguments) > 1):
excelInputFile = str(arguments[1])
excelOutputFile = str(arguments[2])
# Read data from an excel file
data = pd.read_excel(r'' + excelInputFile, header=4)
#
# Input data from excel sheet columns
dfP = pd.DataFrame(data, columns=["Procedure"])
dfP = dfP.iloc[:, :].values.tolist()
dfP = [x for y in dfP for x in y]
dfD = pd.DataFrame(data, columns=["Date"])
dfD = dfD.get_values()
dfD = [x for y in dfD for x in y]
dfE = pd.DataFrame(data, columns=["Book Dur"])
dfE = dfE.iloc[:, :].values.tolist()
dfE = [x for y in dfE for x in y]
dfA = pd.DataFrame(data, columns=["Actual Dur"])
dfA = dfA.iloc[:, :].values.tolist()
dfA = [x for y in dfA for x in y]
rawProcedures = [] # a list of each procedure performed
procedureTypes = [] # a list of all theprocedure codes
rawDays = [] # a list of dates for each procedure performed
days = [] # a list of the dates in the data
rawExpectedTimes = [] # a list of scheduled times for each procedure performed
rawActualTimes = [] # a list of actual times for each procedure performed
# Input data into arrays
for i in range(len(dfE)):
if str(dfE[i]) != 'nan':
if dfD[i] not in days:
days.append(dfD[i])
rawExpectedTimes.append(dfE[i])
rawDays.append(dfD[i])
rawActualTimes.append(dfA[i])
rawProcedures.append(dfP[i])
if dfP[i] not in procedureTypes:
procedureTypes.append(dfP[i])
procedureTypes.sort() # sort procedure types alphabetically
proceduresPerDay = {w: [] for w in days} # a map from a day to a list of procedures performed on that day
for i in range(len(rawDays)):
proceduresPerDay[rawDays[i]].append(rawProcedures[i])
expectedTimes = {w: [] for w in days} # a map from a day to a list of booking times for that day
for i in range(len(rawDays)):
expectedTimes[rawDays[i]].append(rawExpectedTimes[i])
actualTimes = {w: [] for w in days} # a map from a day to a list of actual procedure times for that day
for i in range(len(rawDays)):
actualTimes[rawDays[i]].append(rawActualTimes[i])
# 
# Pre-processing
originalBookTimes = {p: 0 for p in procedureTypes} # a map with average booking times for each procedue
for p in procedureTypes:
counter = 0
for i in range(len(rawDays)):
if p == rawProcedures[i]:
counter += 1
originalBookTimes[p] += rawExpectedTimes[i]
originalBookTimes[p] = int(originalBookTimes[p]/counter)
expectedTotalTime = {} # a map from a day to the sum of booking times for that day
for day in days:
expectedTotalTime[day] = 0
for time in expectedTimes[day]:
expectedTotalTime[day] += time
actualTotalTime = {} # a map from a day to the sum of actual procedure times for that day
for day in days:
actualTotalTime[day] = 0
for time in actualTimes[day]:
actualTotalTime[day] += time
currentOvertimeCount = 0 # counts how many days the room went overtime
currentUndertimeCount = 0 # counts how many days the room went undertime
for day in days:
if expectedTotalTime[day] + 0 < actualTotalTime[day]: currentOvertimeCount += 1
elif actualTotalTime[day] + 15 < expectedTotalTime[day]: currentUndertimeCount += 1
minProcedureTime = {p: 1440 for p in procedureTypes} # a map from a procedure to its minimum case time
maxProcedureTime = {p: 0 for p in procedureTypes} # a map from a procedure to its maximum case time
for i in range(len(rawDays)):
if rawActualTimes[i] < minProcedureTime[rawProcedures[i]]: minProcedureTime[rawProcedures[i]] = rawActualTimes[i]
if rawActualTimes[i] > maxProcedureTime[rawProcedures[i]]:
maxProcedureTime[rawProcedures[i]] = rawActualTimes[i]
count = len(days) # number of days
#
# Model
print("Starting model optimisation", file=sys.stderr)
# Decision variables
model = cp_model.CpModel() # Create the model
procedureSchedulingTimes = {} # Create a decision variable for the scheduling time of each procedure type
for p in procedureTypes:
procedureSchedulingTimes[p] = model.NewIntVar(int(minProcedureTime[p]),
int(maxProcedureTime[p]),
"procedureSchedulingTime" + str(p))
totalScheduledTime = {} # The total time scheduled for each day
overtimeTriggers={} # Set to 1 if overtime for each day
undertimeTriggers = {} # Set to 1 if undertime for each day
for day in days:
totalScheduledTime[day] = model.NewIntVar(0, 1000, "day"+ str(day))
overtimeTriggers[day] = model.NewIntVar(0, 1, "overtimeTrigger" + str(day))
undertimeTriggers[day] = model.NewIntVar(0, 1, "undertimeTrigger" + str(day))
overtimeCount = model.NewIntVar(0, 1000, "overtimeCount") # Number of days that went overtime
undertimeCount = model.NewIntVar(0, 1000, "undertimeCount") # Number of days that went undertime
overtimeCost = model.NewIntVar(-1000, 1000, "overtimeCost") # Overtime cost
undertimeCost = model.NewIntVar(-1000, 1000, "undertimeCost") # Undertime cost
absOvertimeCost = model.NewIntVar(0, 100, "absOvertimeCost") # !overtime costl
absUndertimeCost = model.NewIntVar(0, 100, "absUndertimeCost") # lundertime costl
finalCost = model.NewIntVar(0, 100, "finalCost") # final cost is sum of overtime and undertime costs
#-------------------
# Constraints
for day in days:
# set totalScheduledTime to the sum of scheduled time in a day
model.Add(totalScheduledTime[day] == (sum(procedureSchedulingTimes[i] for i in proceduresPerDay[day])))
# set overtime trigger to 1 if overtime
model.Add(1000 * overtimeTriggers[day] >= int(actualTotalTime[day]) - totalScheduledTime[day] - 0 - 1)
# set overime trigger to O if not overtime
model.Add(1500 * (1 - overtimeTriggers[day]) >= totalScheduledTime[day] - int(actualTotalTime[day]) + 0)
# set undertime trigger to 1 if undertime
model.Add(1000 * undertimeTriggers[day] >= totalScheduledTime[day] - 15 - int(actualTotalTime[day]) - 1)
# set undertime trigger to O if not undertime
model.Add(1000 * (1 - undertimeTriggers[day]) >= int(actualTotalTime[day]) - totalScheduledTime[day] + 15)
model.Add(overtimeCount == (sum(overtimeTriggers[day] for day in days))) # count how many days went overtime
model.Add(undertimeCount == (sum(undertimeTriggers[day] for day in days))) # count how many days went undertime
model.Add((overtimeCost == overtimeCount - int(count * targetOvertimeFrequency))) #calculate overtime cost
model.Add((undertimeCost == undertimeCount - int(count * targetUndertimeFrequency))) #calculate undertime cost
# calculate absolute value of overtime cost
model.AddAbsEquality(absOvertimeCost, int(overtimeCostWeight) * overtimeCost)
# calculate absolute value of undertime cost
model.AddAbsEquality(absUndertimeCost, int(undertimeCostWeight) * undertimeCost)
model.Add((finalCost == overtimeCost + undertimeCost)) # calculate final cost
model.Minimize(finalCost) # Objective is to minimize final cost
solver= cp_model.CpSolver()
print("Starting to solve", file=sys.stderr)
status = solver.Solve(model)
#
# Print output to console and excel sheet
if status == cp_model.OPTIMAL or status == cp_model.FEASIBLE:
print("Success!")
rows= ["Number of days","Number of cases","Original overtime frequency","Original undertime frequency","Model overtime frequency",
"Model undertime frequency","Model cases achieved", "Original Overtime Minutes Used","Model Overtime Minutes Used",
"Original Overtime Cost","Model Overtime Cost", "Original OR minutes used", "Model OR minutes used", "","Procedure Type"] + procedureTypes
dashboard= pd.DataFrame(index=rows,columns=["A", "B", "C",])
for r in rows:
dashboard.at[r,"A"]= r
#-------------------
# Output basic stats
dashboard.at["", ""] = ""
dashboard.at["Number of days", "B"] = count
print("number of days: ", count)
dashboard.at["Number of cases", "B"] = len(rawExpectedTimes)
print("number of cases: ", len(rawExpectedTimes))
dashboard.at["Original overtime frequency", "B"] = round(currentOvertimeCount / count, 2)
print("current overtime frequency: %.Of%%" % (100 * currentOvertimeCount / count))
dashboard.at["Original undertime frequency", "B"] = round(currentUndertimeCount / count,2)
print("current undertime frequency: %.Of%%" % (100 * currentUndertimeCount/ count))
dashboard.at["Model overtime frequency", "B"] = round(solver.Value(overtimeCount) /count, 2)
print("model overtime frequency: %.Of%%" % (100 * solver.Value(overtimeCount) / count))
dashboard.at["Model undertime frequency", "B"] = round(solver.Value(undertimeCount) /count, 2)
print("model undertime frequency: %.Of%%" % (100 * solver.Value(undertimeCount) / count))
#-------------------
# Output overtime minutes and overtime cost
originalOverMinutes = 0
modelOverMinutes = 0
originalCost = 0
modelCost = 0
for day in days:
if solver.Value(totalScheduledTime[day]) < actualTotalTime[day]:
modelOverMinutes += actualTotalTime[day] - solver.Value(totalScheduledTime[day])
modelCost += math.ceil((actualTotalTime[day] - solver.Value(totalScheduledTime[day])) / overtimeBlocklength) * noOfNursesPerBlock * overtimeSalary
if expectedTotalTime[day] < actualTotalTime[day]:
originalOverMinutes += actualTotalTime[day] - expectedTotalTime[day]
originalCost += math.ceil((actualTotalTime[day] - expectedTotalTime[day]) / overtimeBlocklength) * noOfNursesPerBlock * overtimeSalary
print("Original overtime minutes used: ", originalOverMinutes)
dashboard.at["Original Overtime Minutes Used", "B"] = originalOverMinutes
print("Model overtime minutes used: ", modelOverMinutes)
dashboard.at["Model Overtime Minutes Used", "B"] = modelOverMinutes
print("Original Overtime cost: ", originalCost)
dashboard.at["Original Overtime Cost", "B"] = originalCost
print("Model overtime cost: ", modelCost)
dashboard.at["Model Overtime Cost", "B"] = modelCost
print("\n")
#-------------------
# Output original and machine learning scheduling times for each procedure
dashboard.at["Procedure Type", "B"] = "Original Time"
dashboard.at["Procedure Type", "C"] = "Machine Learning Time"
for p in procedureTypes:
print(p + ": " + "%/d" % solver.Value(procedureSchedulingTimes[p]))
dashboard.at[p, "B"] = originalBookTimes[p]
dashboard.at[p, "C"] = solver.Value(procedureSchedulingTimes[p])
#-------------------
# Output cases completed with model and OR minutes used by original and machine learning methods
totalNoCases = 0
originalCases = 0
modelCases = 0
originalMinutes = 0
modelMinutes = 0
totalMinutes = 0
print("\n")
for day in days:
noOfCases = len(actualTimes[day])
totalNoCases += noOfCases
originalMinutes += actualTotalTime[day]
modelMinutes += solver.Value(totalScheduledTime[day])
ORTime = 450 - (noOfCases - 1) * 15
totalMinutes += ORTime
originalTime = (ORTime - actualTotalTime[day])
modelTime = (ORTime - solver.Value(totalScheduledTime[day]))
originalCases += noOfCases
if originalTime < 0:
for i in reversed(range(len(actualTimes[day]))):
originalTime += actualTimes[day][i]
originalCases -= 1
if originalTime > 0:
break
modelCases += noOfCases
if modelTime < 0:
for i in reversed(range(len(actualTimes[day]))):
modelTime += actualTimes[day][i]
modelCases -= 1
if modelTime > 0:
break
print("Cases achieved with machine learning model: %.Of%%" % (100 * modelCases/ totalNoCases))
dashboard.at["Model cases achieved", "B"] = round(modelCases / totalNoCases, 2)
print("OR minutes used with original model: %.Of%%" % (100 * originalMinutes/totalMinutes))
dashboard.at["Original OR minutes used", "B"] = round(originalMinutes / totalMinutes, 2)
print("OR minutes used with machine learning model: %.Of%%" % (100 * modelMinutes /
totalMinutes))
dashboard.at["Model OR minutes used", "B"] = round(modelMinutes / totalMinutes, 2)
dashboard.to_excel(r'' + excelOutputFile, index=False)
else:
print("No feasible solution found.")

Related

How to display the average price of bitcoin in Pandas and not through a loop?

y = []
n = 0
days = 1
for i in btc['Adj Close']:
averagePrice = (i + n) / days
n += i
days += 1
y.append(averagePrice)
btc['TopAverage'] = y

If btc is a pandas data frame (as it appears so), then:
btc.loc[:, 'Days'] = list(range(1, btc.shape[0] + 1))
btc.loc[:, 'n'] = btc['Adj Close'].cumsum().shift(periods=1, fill_value=0.)
btc.loc[:, 'TopAverage'] = (btc['Adj Close'] + btc['n']) / btc['Days']
reflects the logic in your code. This will add the columns 'Days' and 'n' to the data frame as well.

expectation and variance of future stock price under binary tree

Probably a over-simplified model for stock price: on each day, the price will go up by a factor 1.05 with probability 0.6 or will go down to 1/1.05 with probability 0.4. So this is a non-symmetrical binary tree. How can I analytically calculate the expectation and variance of this stock price on future date, say day 100. Also, is there any module in python to handle binary tree model like this? appreciate code to implement this.
Best regards

import random as r
s = 100 # starting value
^^Initial conditions. Simulating one day on the stock market:
def day(stock_value): #One day in the stock market
k = r.uniform(0,1)
if k < 0.6:
output = 1.05*stock_value
else:
output = stock_value/1.05
return(output)
Simulating 100 days on the stock market:
for j in range(100): #simulates 100 days in the stock market
s = day(s)
print(s)
Simulating 100 days 1000 times:
data = []
for i in range(1000):
s = [100]
for j in range(100):
s.append(day(s[j]))
data.append(s)
Converting the data to only consider the last day:
def mnnm(mat): #Makes an mxn matrix into an nxm matrix
out = []
for j in range(len(mat[0])):
out.append([])
for j in range(len(mat[0])):
for m in range(len(mat)):
out[j].append(mat[m][j])
return(out)
data = mnnm(data)
data = data[-1]
Taking a mean average:
def lst_avg(lst): #Returns the average of a list
output = 0
for j in range(len(lst)):
output+= lst[j]/len(lst)
return(output)
mean = lst_avg(data)
Variance:
import numpy as np
for h in range(len(data)):
data[h] = data[h]**2
mean_square = lst_avg(data)
variance = np.fabs(mean_square - mean**2)

The theoretical value after 1 day is (assuming value on day 0 is A)
A * 0.6 * 1.05 + 100 * 0.4/1.05
And after 100 days it's
A * (0.603 + 0.380952...)**100 so...
(in the following I use 1 as stock price on day 0.)
p1 = 0.6
p2 = 0.4
x1 = 1.05
x2 = 1/1.05
initial_value = 1
no_of_days = 100
# 1 day
expected_value_after_1_day = initial_value * ( p1*x1 + p2*x2)
print (expected_value_after_1_day, 'is the expected value of price after 1 day')
ex_squared_value_1_day = initial_value * (p1*x1**2 + p2*x2**2)
# variance can be calculated as follows
variance_day_1 = ex_squared_value_1_day - expected_value_after_1_day**2
# or an alternative calculation, summing the squares of the differences from the mean
alt_variance_day_1 = p1 * (x1 - expected_value_after_1_day) ** 2 + p2 * (x2 - expected_value_after_1_day) ** 2
print ('Variance after one day is', variance_day_1)
# 100 days
expected_value_n_days = initial_value * (p1*x1 + p2*x2) ** no_of_days
ex_squared_value_n_days = initial_value * (p1*x1**2 + p2*x2**2) ** no_of_days
ex_value_n_days_squared = expected_value_n_days ** 2
variance_n_days = ex_squared_value_n_days - ex_value_n_days_squared
print(expected_value_n_days, 'is the expected value of price after {} days'.format(no_of_days))
print(ex_squared_value_n_days, 'is the expected value of the square of the price after {} days'.format(no_of_days))
print(ex_value_n_days_squared, 'is the square of the expected value of the price after {} days'.format(no_of_days))
print(variance_n_days, 'is the variance after {} days'.format(no_of_days))
It probably looks a bit old-school, hope you don't mind!
Output
1.0109523809523808 is the expected value of price after 1 day
Variance after one day is 0.0022870748299321786
2.972144657651404 is the expected value of price after 100 days
11.046309656223373 is the expected value of the square of the price after 100 days
8.833643866005783 is the square of the expected value of the price after 100 days
2.2126657902175904 is the variance after 100 days

How to add/subract operation between two different times using timedelta?

So i'm trying know in what time of day the man will end running. When in last line of code, trying to use an operator addition this error is popping out. What i'm doing wrong?
TypeError: unsupported operand type(s) for +: 'datetime.time' and 'datetime.timedelta'
import datetime
# first run, 1 mile
pace_minutes1 = 8
pace_seconds1 = 15
distance = 1 #1 mile
b = ((pace_minutes1 * 60) + pace_seconds1) * distance
print(b)
# second run, 3 mile
pace_minutes2 = 7
pace_seconds2 = 12
distance = 3 # 3 miles
a = ((pace_minutes2 * 60) + pace_seconds2) * distance
print(a)
# third run, 1 mile run
pace_minutes3 = 8
pace_seconds3 = 15
distance = 1 # 1 mile
c = ((pace_minutes3 * 60) + pace_seconds3) * distance
print(c)
x = (a + b + c)
print(x, "seconds")
time = datetime.timedelta(seconds=x) # converse value "x" seconds to time (h,m,s)
print("All running time will gonna take",time,"seconds") # total time spent running
start_time = datetime.time(6,52,0) # time from where he started running
print("The running started in",start_time)
end_time = (start_time + time) # time when he ended running
print(end_time)

You can only add the timedelta to a datetime.datetime object, not a datetime.time object (probably in order to avoid overflows).
So simple fix, use start_time = datetime.datetime(2021, 04, 21, 6,52,0)

Please find below answer .check if it what you desired.May be you can check stack overflow answer
import datetime
# first run, 1 mile
pace_minutes1 = 8
pace_seconds1 = 15
distance = 1 #1 mile
b = ((pace_minutes1 * 60) + pace_seconds1) * distance
print(b)
# second run, 3 mile
pace_minutes2 = 7
pace_seconds2 = 12
distance = 3 # 3 miles
a = ((pace_minutes2 * 60) + pace_seconds2) * distance
print(a)
# third run, 1 mile run
pace_minutes3 = 8
pace_seconds3 = 15
distance = 1 # 1 mile
c = ((pace_minutes3 * 60) + pace_seconds3) * distance
print(c)
x = (a + b + c)
print(x, "seconds")
time = datetime.timedelta(seconds=x) # converse value "x" seconds to time (h,m,s)
print("All running time will gonna take",time,"seconds") # total time spent running
start_time = datetime.time(6,52,0) # time from where he started running
# print(type(start_time))
# print(type(time))
print("The running started in",start_time)
end_time = (dt.datetime.combine(dt.date(1,1,1),start_time) + time).time()
print(end_time)

Speeding Up Datafile Reading Program For School Project

I am in a lower-level coding class (Python) and have a major project due in three days. One of our grading criteria is program speed. My program runs in about 30 seconds, ideally it would execute in 15 or less. Here is my code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
import time
start_time = time.time()#for printing execution time
#function for appending any number of files to a dataframe
def read_data_files(pre, start, end): #reading in the data
data = pd.DataFrame()#dataframe with all the data from files
x = start
while x <= end:
filename = pre + str(x) + ".csv" #string manipulation
dpath = pd.read_csv("C:\\Users\\jacks\\Downloads\\BMEN 207 Project 1
Data\\" + filename )
for y in dpath:
dpath = dpath.rename(columns = {y: y})
data = data.append(dpath)
x += 1
return data
data = read_data_files("Data_", 5, 163) #start, end, prefix...
#converting to human time and adding to new column in dataframe
human_time = []
for i in data[' Time']:
i = int(i)
j = datetime.utcfromtimestamp(i).strftime('%Y-%m-%d %H:%M:%S')
human_time.append(j)
human_timen = np.array(human_time) #had issues here for some reason, so i
created another array to replace the time column in the dataframe
data[' Time'] = human_timen
hours = [] #for use as x-axis in plot
stdlist = [] #for use as y-axis in plot
histlist = [] #for storing magnitudes of most active hour
def magfind(row): #separate function to calculate the magnitude of each row in
each dataframe
return (row[' Acc X'] ** 2 + row[' Acc Y'] ** 2 + row[' Acc Z'] ** 2) ** .5
def filterfunction(intro1, intro2, first, last): #two different intros to deal
with the issue of '00:' versus '10:' timestamps
k = first
meanmax = 0
active = 0
while k <= last:
if 0 <= k < 7: #data from hours 0 to 6, none after
hr = intro1 + str(k) + ':'
tfilter = data[' Time'].str.contains(hr)
acc = data.loc[tfilter, [' Acc X', ' Acc Y', ' Acc Z']]
acc['magnitude'] = acc.apply(magfind, axis = 1) #creates magnitude
column using prior function, column has magnitudes for every row of every file
p = acc.loc[:, 'magnitude'].std()#finds std dev for the column and
appends to a list for graphing
m = acc.loc[:, 'magnitude'].mean()
stdlist.append(p)
elif k == 12 or 20 < k <= last: #data at 12 and beyond hour 20
hr = intro2 + str(k) + ":"
tfilter = data[' Time'].str.contains(hr)
acc = data.loc[tfilter, [' Acc X', ' Acc Y', ' Acc Z']]
acc['magnitude'] = acc.apply(magfind, axis = 1)
p = acc.loc[:, 'magnitude'].std()
m = acc.loc[:, 'magnitude'].mean()
stdlist.append(p)
else: #in the case that we are dealing with an hour that has no data
p = 0
m = 0
stdlist.append(p) #need this so that the hours with no data still
get graphed
if m > meanmax: # for determining which hour was the most active, and
appending those magnitudes to a list for histogramming
meanmax = m
active = k #most active hour
for i in acc['magnitude']:
histlist.append(i) #adding all the magnitudes for histogramming
hours.append(k)
k += 1
print("mean magnitude", meanmax)
print("most active hour", active)
return hours, stdlist, histlist
filterfunction(' 0', ' ', 0, 23)
The slow speed stems from the "filterfunction" function. What this program does is read data from over 100 files, and this function specifically sorts the data into a dataframe and analyzes by time (each individual hour) in order to calculate the data in all rows for that hour. I believe that it might be able to be sped up by changing up the way that the data is filtered to search by hour, but am not sure. The reason I have statements to dis-include certain k-values is that there are hours with no data to manipulate, which would mess up the list of standard deviation calculations as well as the plot that this data will father. Any tips or ideas for speeding this up would be greatly appreciated!

One suggestion to speed it up a bit is to remove this line since it is not being used anywhere in the program:
import matplotlib.pyplot as plt
matplotlib is a big library so removing it should improve performance.
Also I think you could get rid of numpy since it is used once only...consider using a tuple

I could not able to test because I am on mobile now. However my main idea is not making the code better or leen. I changed the functioning part of the process.
Integrated the 'multiprocessing' library(method) into your code and also calculated the system cpu cores and divide all processes between them.
Multiprocessing library detailed documentation: https://docs.python.org/2/library/multiprocessing.html
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
import psutil
from datetime import datetime
from multiprocessing import Pool
cores = psutil.cpu_count()
start_time = time.time()#for printing execution time
#function for appending any number of files to a dataframe
def read_data_files(pre, start, end): #reading in the data
data = pd.DataFrame()#dataframe with all the data from files
x = start
while x <= end:
filename = pre + str(x) + ".csv" #string manipulation
dpath = pd.read_csv("C:\\Users\\jacks\\Downloads\\BMEN 207 Project 1
Data\\" + filename )
for y in dpath:
dpath = dpath.rename(columns = {y: y})
data = data.append(dpath)
x += 1
return data
data = read_data_files("Data_", 5, 163) #start, end, prefix...
#converting to human time and adding to new column in dataframe
human_time = []
for i in data[' Time']:
i = int(i)
j = datetime.utcfromtimestamp(i).strftime('%Y-%m-%d %H:%M:%S')
human_time.append(j)
human_timen = np.array(human_time) #had issues here for some reason, so i
created another array to replace the time column in the dataframe
data[' Time'] = human_timen
hours = [] #for use as x-axis in plot
stdlist = [] #for use as y-axis in plot
histlist = [] #for storing magnitudes of most active hour
def magfind(row): #separate function to calculate the magnitude of each row in
each dataframe
return (row[' Acc X'] ** 2 + row[' Acc Y'] ** 2 + row[' Acc Z'] ** 2) ** .5
def filterfunction(intro1, intro2, first, last): #two different intros to deal
with the issue of '00:' versus '10:' timestamps
k = first
meanmax = 0
active = 0
while k <= last:
if 0 <= k < 7: #data from hours 0 to 6, none after
hr = intro1 + str(k) + ':'
tfilter = data[' Time'].str.contains(hr)
acc = data.loc[tfilter, [' Acc X', ' Acc Y', ' Acc Z']]
acc['magnitude'] = acc.apply(magfind, axis = 1) #creates magnitude
column using prior function, column has magnitudes for every row of every file
p = acc.loc[:, 'magnitude'].std()#finds std dev for the column and
appends to a list for graphing
m = acc.loc[:, 'magnitude'].mean()
stdlist.append(p)
elif k == 12 or 20 < k <= last: #data at 12 and beyond hour 20
hr = intro2 + str(k) + ":"
tfilter = data[' Time'].str.contains(hr)
acc = data.loc[tfilter, [' Acc X', ' Acc Y', ' Acc Z']]
acc['magnitude'] = acc.apply(magfind, axis = 1)
p = acc.loc[:, 'magnitude'].std()
m = acc.loc[:, 'magnitude'].mean()
stdlist.append(p)
else: #in the case that we are dealing with an hour that has no data
p = 0
m = 0
stdlist.append(p) #need this so that the hours with no data still
get graphed
if m > meanmax: # for determining which hour was the most active, and
appending those magnitudes to a list for histogramming
meanmax = m
active = k #most active hour
for i in acc['magnitude']:
histlist.append(i) #adding all the magnitudes for histogramming
hours.append(k)
k += 1
print("mean magnitude", meanmax)
print("most active hour", active)
return hours, stdlist, histlist
# Run this with a pool of 5 agents having a chunksize of 3 until finished
agents = cores
chunksize = (len(data) / cores)
with Pool(processes=agents) as pool:
pool.map(filterfunction, (' 0', ' ', 0, 23))

don't use apply, it's not vectorized. instead, use vectorized operations whenever you can. in this case, instead of doing df.apply(magfind, 1), do:
def add_magnitude(df):
df['magnitude'] = (df[' Acc X'] ** 2 + df[' Acc Y'] ** 2 + df[' Acc Z'] ** 2) ** .5

Conditionally setting values with df.loc inside a loop

I'm querying an MS Access db to retrieve a set of leases. My task is to calculate monthly totals for base rent for the next 60 months. The leases have dates related to start and end in order to calculate the correct periods in the event a lease terminates prior to 60 periods. My current challenge comes in when I attempt to increase the base rent by a certain amount whenever it's time to increment for that specific lease. I'm at a beginner level with Python/pandas so my approach is likely not optimum and the code rough looking. It's likely a vectorized approach is better suited however i'm not quite able to execute such code yet.
Data:
Lease input & output
Code:
try:
sql = 'SELECT * FROM [tbl_Leases]'
#sql = 'SELECT * FROM [Copy Of tbl_Leases]'
df = pd.read_sql(sql, conn)
#print df
#df.to_csv('lease_output.csv', index_label='IndexNo')
df_fcst_periods = pd.DataFrame()
# init increments
periods = 0
i = 0
# create empty lists to store looped info from original df
fcst_months = []
fcst_lease_num = []
fcst_base_rent = []
fcst_method = []
fcst_payment_int = []
fcst_rate_inc_amt = []
fcst_rate_inc_int = []
fcst_rent_start = []
# create array for period deltas, rent interval calc, pmt interval calc
fcst_period_delta = []
fcst_rate_int_bool = []
fcst_pmt_int_bool = []
for row in df.itertuples():
# get min of forecast period or lease ending date
min_period = min(fcst_periods, df.Lease_End_Date[i])
# count periods to loop for future periods in new df_fcst
periods = (min_period.year - currentMonth.year) * 12 + (min_period.month - currentMonth.month)
for period in range(periods):
nextMonth = (currentMonth + monthdelta(period))
period_delta = (nextMonth.year - df.Rent_Start_Date[i].year) * 12 + (nextMonth.month - df.Rent_Start_Date[i].month)
period_delta = float(period_delta)
# period delta values allow us to divide by the payment & rent intervals looking for integers
rate_int_calc = period_delta/df['Rate_Increase_Interval'][i]
pmt_int_calc = period_delta/df['Payment_Interval'][i]
# float.is_integer() method - returns bool
rate_int_bool = rate_int_calc.is_integer()
pmt_int_bool = pmt_int_calc.is_integer()
# conditional logic to handle base rent increases
if df['Forecast_Method'][i] == "Percentage" and rate_int_bool:
rate_increase = df['Base_Rent'][i] * (1 + df['Rate_Increase_Amt'][i]/100)
df.loc[df.index, "Base_Rent"] = rate_increase
fcst_base_rent.append(df['Base_Rent'][i])
print "Both True"
else:
fcst_base_rent.append(df['Base_Rent'][i])
print rate_int_bool
fcst_rate_int_bool.append(rate_int_bool)
fcst_pmt_int_bool.append(pmt_int_bool)
fcst_months.append(nextMonth)
fcst_period_delta.append(period_delta)
fcst_rent_start.append(df['Rent_Start_Date'][i])
fcst_lease_num.append(df['Lease_Number'][i])
#fcst_base_rent.append(df['Base_Rent'][i])
fcst_method.append(df['Forecast_Method'][i])
fcst_payment_int.append(df['Payment_Interval'][i])
fcst_rate_inc_amt.append(df['Rate_Increase_Amt'][i])
fcst_rate_inc_int.append(df['Rate_Increase_Interval'][i])
i += 1
df_fcst_periods['Month'] = fcst_months
df_fcst_periods['Rent_Start_Date'] = fcst_rent_start
df_fcst_periods['Lease_Number'] = fcst_lease_num
df_fcst_periods['Base_Rent'] = fcst_base_rent
df_fcst_periods['Forecast_Method'] = fcst_method
df_fcst_periods['Payment_Interval'] = fcst_payment_int
df_fcst_periods['Rate_Increase_Amt'] = fcst_rate_inc_amt
df_fcst_periods['Rate_Increase_Interval'] = fcst_rate_inc_int
df_fcst_periods['Period_Delta'] = fcst_period_delta
df_fcst_periods['Rate_Increase_Interval_bool'] = fcst_rate_int_bool
df_fcst_periods['Payment_Interval_bool'] = fcst_pmt_int_bool
except Exception, e:
print str(e)
conn.close()

I ended up initializing a variable before the periods loop which allowed me to perform a calculation when looping to obtain the correct base rents for subsequent periods.
# init base rent, rate increase amount, new rate for leases
base_rent = df['Base_Rent'][i]
rate_inc_amt = float(df['Rate_Increase_Amt'][i])
new_rate = 0
for period in range(periods):
nextMonth = (currentMonth + monthdelta(period))
period_delta = (nextMonth.year - df.Rent_Start_Date[i].year) * 12 + (nextMonth.month - df.Rent_Start_Date[i].month)
period_delta = float(period_delta)
# period delta values allow us to divide by the payment & rent intervals looking for integers
rate_int_calc = period_delta/df['Rate_Increase_Interval'][i]
pmt_int_calc = period_delta/df['Payment_Interval'][i]
# float.is_integer() method - returns bool
rate_int_bool = rate_int_calc.is_integer()
pmt_int_bool = pmt_int_calc.is_integer()
# conditional logic to handle base rent increases
if df['Forecast_Method'][i] == "Percentage" and rate_int_bool:
new_rate = base_rent * (1 + rate_inc_amt/100)
base_rent = new_rate
fcst_base_rent.append(new_rate)
elif df['Forecast_Method'][i] == "Manual" and rate_int_bool:
new_rate = base_rent + rate_inc_amt
base_rent = new_rate
fcst_base_rent.append(new_rate)
else:
fcst_base_rent.append(base_rent)
Still open for any alternative approaches though!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Google OR Tools, Surgical Booking AttributeError: 'DataFrame' - python

Related

How to display the average price of bitcoin in Pandas and not through a loop?

expectation and variance of future stock price under binary tree

How to add/subract operation between two different times using timedelta?

Speeding Up Datafile Reading Program For School Project

Conditionally setting values with df.loc inside a loop

Categories

Resources