Source containers constraints for assignment optimization in CVXPY - python

I am writing an assignment optimization (multiple knapsack) in CVXPY with multiple items from multiple source containers to multiple knapsacks. The items each have a value and there is another associated cost of assigning items from each container to each knapsack (ex. shipping cost from sources to destinations). Each item can only be assigned to one knapsack.
My question is if it's possible to write a few constraints such that
items assigned to knapsacks are in multiples of 88
each bundle of 88 items have to be from the same source container
Ex. Multiple bundles can be assigned from different source containers and the capacities of the knapsacks are also multiples of 88
example code:
import cvxpy as cp
import pandas as pd
import numpy as np
knapsack_df = pd.read_csv('knapsacks.csv')
item_df = pd.read_csv('items.csv')
assign_cost_data_df = pd.read_csv('assignment_costs.csv')
capacities = knapsack_df['capacities'] ##capacities of each knapsack
B = len(item_df) ##num_items
C = len(knapsack_df) ##num_knapsacks
tc = np.sum(knapsack_df['capacities']) ##total capacities across all the knapsacks
x = cp.variable(B,C, boolean=True)
obj_item = cp.sum(cp.multiply(item_df['value'], cp.sum(x, axis=1)))
obj_assign = cp.sum(cp.multiply(assign_cost_data_df.values, x)) ##assign_cost_data_df has shape (B,C) as well
obj_supply = cp.sum(x)
objective = cp.Minimize(obj_item + obj_assign - 100*obj_supply) ##heavily reward algorithm assigning items
constraints = []
constraints += [cp.sum(x, axis=1) <= np.ones(B)] ##each item can only be assigned once
constraints += [cp.sum(x, axis=0 <=capacities] ##cannot exceed capacity of knapsacks
## todo: constrain assignment to multiples of 88 from same container
problem = cp.Problem(objective,constraints)
problem.solve(solver='CBC')
If the constraints are not possible, is there a way to at least minimize the number of containers being used and heavily weight that in the cost function?
something like:
obj_container = 0
for i in range(x.shape[1]):
obj_container = len(set(cp.multiply(x[:,i], item_df['container'].values)))
objective = cp.Minimize(obj_item + obj_assign - 100*obj_supply + 8000*obj_container)
## container weighted more heavily than each individual item so the optimization would need to source multiple items from another container for it to be "worth".
Essentially, is there a way to find unique/distinct items in cp.multiply(x[:,i], item_df['container'].values)?

Related

Minimal cost flow network when using multiple demand/supply sets (networkx)

This is my first stackoverflow submission, so please let me know whether I am asking this question according to community guidelines.
Question in short: Is there a way to have networkx find the min cost flow network given multiple demand/supply sets?
I currently working on a network optimization problem in which I am using the nx.min_cost_flow() function to calculate the minimum cost network layout given a supply and demand set for the nodes, and maximum capacity for certain edges. The code works, but now I am trying to find a way to find the optimal network layout when given multiple supply and demand sets over time. I have tried a method that works for optimal trees (iterating through all the demand/supply sets and using the maximum capacity for each edge in each time step), but sadly this does not provide the optimal solution when using nx.min_cost_flow() as I want the model to be able to create loops as well.
I have added the relevant code below, but it might be a bit confusing as there are many elements referring to other parts of the model unrelated to the problem I am currently facing. The complete model is too large to share.
I have tried looking for a solution online without success. Hopefully you know a solution. Thanks in advance!
def full_existing(T,G,folder,rpc,routing):
fullG = G.copy()
for (i,j) in fullG.edges():
fullG[i][j]['capacity'] = 0
for (i,j) in T.edges():
if 'current' in T[i][j]:
fullG[i][j]['weight'] = T[i][j]['weight']*rpc
f_demand = open(folder+'/demand.txt','r')
for X in f_demand:
print('Demand set is called')
Y = X.split(',')
demand = {i: float(Y[i+1]) for i in range(len(Y)-1)}
G1 = fullG.copy()
for i in fullG.nodes():
if i in demand.keys():
G1.nodes[i]['demand'] = demand[i]
else:
G1.nodes[i]['demand'] = 0
for (i,j) in G1.edges():
G1[i][j]['capacity'] = inf
for (i,j) in T.edges():
if 'current' in T[i][j]:
G1[i][j]['capacity'] = T[i][j]['current']
G1[i][j]['weight'] = T[i][j]['weight']*rpc
else:
G1[i][j]['capacity'] = inf
G1 = simplex(G1)
for i,j in G1.edges():
fullG[i][j]['capacity'] = max(fullG[i][j]['capacity'],G1[i][j]['capacity'])
for (i,j) in T.edges():
if 'current' in T[i][j]:
fullG[i][j]['weight'] = T[i][j]['weight']
nx.set_edge_attributes(fullG,nx.get_edge_attributes(T,'current'),'current')
return fullG
The simplex procedure is called because the input graph is undirected.
def simplex(G):
GX = G.copy()
G1 = GX.to_directed()
ns = nx.min_cost_flow(G1)
for i,j in G1.edges():
if ns[i][j] > 0 or ns[j][i] > 0:
GX[i][j]['capacity'] = max(ns[i][j],ns[j][i])
else:
GX[i][j]['capacity'] = 0
return GX

Pyomo accesing/retrieving dual variables - shadow price with binary variables

I am pretty new to optimization in general and pyomo in particular, so I apologize in advance for any rookie mistakes.
I have defined a simple unit commitment exercise (example 3.1 from [1]) using [2] as starting point. I got the correct result and my code runs, but I have a few questions regarding how to access stuff.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import shutil
import sys
import os.path
import pyomo.environ as pyo
import pyomo.gdp as gdp #necessary if you use booleans to select active and innactive units
def bounds_rule(m, n, param='Cap_MW'):
# m because it pases the module
# n because it needs a variable from each set, in this case there was only m.N
return (0, Gen[n][param]) #returns lower and upper bounds.
def unit_commitment():
m = pyo.ConcreteModel()
m.dual = pyo.Suffix(direction=pyo.Suffix.IMPORT_EXPORT)
N=Gen.keys()
m.N = pyo.Set(initialize=N)
m.Pgen = pyo.Var(m.N, bounds = bounds_rule) #amount of generation
m.Rgen = pyo.Var(m.N, bounds = bounds_rule) #amount of generation
# m.OnOff = pyo.Var(m.N, domain=pyo.Binary) #boolean on/off marker
# objective
m.cost = pyo.Objective(expr = sum( m.Pgen[n]*Gen[n]['energy_$MWh'] + m.Rgen[n]*Gen[n]['res_$MW'] for n in m.N), sense=pyo.minimize)
# demand
m.demandP = pyo.Constraint(rule=lambda m: sum(m.Pgen[n] for n in N) == Demand['ener_MWh'])
m.demandR = pyo.Constraint(rule=lambda m: sum(m.Rgen[n] for n in N) == Demand['res_MW'])
# machine production limits
# m.lb = pyo.Constraint(m.N, rule=lambda m, n: Gen[n]['Cap_min']*m.OnOff[n] <= m.Pgen[n]+m.Rgen[n] )
# m.ub = pyo.Constraint(m.N, rule=lambda m, n: Gen[n]['Cap_MW']*m.OnOff[n] >= m.Pgen[n]+m.Rgen[n])
m.lb = pyo.Constraint(m.N, rule=lambda m, n: Gen[n]['Cap_min'] <= m.Pgen[n]+m.Rgen[n] )
m.ub = pyo.Constraint(m.N, rule=lambda m, n: Gen[n]['Cap_MW'] >= m.Pgen[n]+m.Rgen[n])
m.rc = pyo.Suffix(direction=pyo.Suffix.IMPORT)
return m
Gen = {
'GenA' : {'Cap_MW': 100, 'energy_$MWh': 10, 'res_$MW': 0, 'Cap_min': 0},
'GenB' : {'Cap_MW': 100, 'energy_$MWh': 30, 'res_$MW': 25, 'Cap_min': 0},
} #starting data
Demand = {
'ener_MWh': 130, 'res_MW': 20,
} #starting data
m = unit_commitment()
pyo.SolverFactory('glpk').solve(m).write()
m.display()
df = pd.DataFrame.from_dict([m.Pgen.extract_values(), m.Rgen.extract_values()]).T.rename(columns={0: "P", 1: "R"})
print(df)
print("Cost Function result: " + str(m.cost.expr()) + "$.")
print(m.rc.display())
print(m.dual.display())
print(m.dual[m.demandR])
da= {'duals': m.dual[m.demandP],
'uslack': m.demandP.uslack(),
'lslack': m.demandP.lslack(),
}
db= {'duals': m.dual[m.demandR],
'uslack': m.demandR.uslack(),
'lslack': m.demandR.lslack(),
}
duals = pd.DataFrame.from_dict([da, db]).T.rename(columns={0: "demandP", 1: "demandR"})
print(duals)
Here come my questions.
1-Duals/shadow-price: By definition the shadow price are the dual variables of the constraints (m.demandP and m.demandR). Is there a way to access this values and put them into a dataframe without doing that "shitty" thing I did? I mean defining manually da and db and then creating the dataframe as both dictionaries joined? I would like to do something cleaner like the df that holds the results of P and R for each generator in the system.
2-Usually, in the unit commitment problem, binary variables are used in order to "mark" or "select" active and inactive units. Hence the "m.OnOff" variable (commented line). For what I found in [3], duals don't exist in models containing binary variables. After that I rewrote the problem without including binarys. This is not a problem in this simplistic exercise in which all units run, but for larger ones. I need to be able to let the optimization decide which units will and won't run and I still need the shadow-price. Is there a way to obtain the shadow-price/duals in a problem containing binary variables?
I let the constraint definition based on binary variables also there in case someone finds it useful.
Note: The code also runs with the binary variables and gets the correct result, however I couldn't figure out how to get the shadow-price. Hence my question.
[1] Morales, J. M., Conejo, A. J., Madsen, H., Pinson, P., & Zugno, M. (2013). Integrating renewables in electricity markets: operational problems (Vol. 205). Springer Science & Business Media.
[2] https://jckantor.github.io/ND-Pyomo-Cookbook/04.06-Unit-Commitment.html
[3] Dual Variable Returns Nothing in Pyomo
To answer 1, you can dynamically get the constraint objects from your model using model.component_objects(pyo.Constraint) which will return an iterator of your constraints, which keeps your from having to hard-code the constraint names. It gets tricky for indexed variables because you have to do an extra step to get the slacks for each index, not just the constraint object. For the duals, you can iterate over the keys attribute to retrieve those values.
duals_dict = {str(key):m.dual[key] for key in m.dual.keys()}
u_slack_dict = {
# uslacks for non-indexed constraints
**{str(con):con.uslack() for con in m.component_objects(pyo.Constraint)
if not con.is_indexed()},
# indexed constraint uslack
# loop through the indexed constraints
# get all the indices then retrieve the slacks for each index of constraint
**{k:v for con in m.component_objects(pyo.Constraint) if con.is_indexed()
for k,v in {'{}[{}]'.format(str(con),key):con[key].uslack()
for key in con.keys()}.items()}
}
l_slack_dict = {
# lslacks for non-indexed constraints
**{str(con):con.lslack() for con in m.component_objects(pyo.Constraint)
if not con.is_indexed()},
# indexed constraint lslack
# loop through the indexed constraints
# get all the indices then retrieve the slacks for each index of constraint
**{k:v for con in m.component_objects(pyo.Constraint) if con.is_indexed()
for k,v in {'{}[{}]'.format(str(con),key):con[key].lslack()
for key in con.keys()}.items()}
}
# combine into a single df
df = pd.concat([pd.Series(d,name=name)
for name,d in {'duals':duals_dict,
'uslack':u_slack_dict,
'lslack':l_slack_dict}.items()],
axis='columns')
Regarding 2, I agree with #Erwin s comment about solving with the binary variables to get the optimal solution, then removing the binary restriction but fixing the variables to the optimal values to get some dual variable values.

I'm having trouble with my program running too long. I'm not sure if it's running infinitely or if it's just really slow

I'm writing a code to analyze a (8477960, 1) column vector. I am not sure if the while loops in my code are running infinitely, or if the way I've written things is just really slow.
This is a section of my code up to the first while loop, which I cannot get to run to completion.
import numpy as np
import pandas as pd
data = pd.read_csv(r'C:\Users\willo\Desktop\TF_60nm_2_2.txt')
def recursive_low_pass(rawsignal, startcoeff, endcoeff, filtercoeff):
# The current signal length
ni = len(rawsignal) # signal size
rougheventlocations = np.zeros(shape=(100000, 3))
# The algorithm parameters
# filter coefficient
a = filtercoeff
raw = np.array(rawsignal).astype(np.float)
# thresholds
s = startcoeff
e = endcoeff # for event start and end thresholds
# The recursive algorithm
# loop init
ml = np.zeros(ni)
vl = np.zeros(ni)
s = np.zeros(ni)
ml[0] = np.mean(raw) # local mean init
vl[0] = np.var(raw) # local variance init
i = 0 # sample counter
numberofevents = 0 # number of detected events
# main loop
while i < (ni - 1):
i = i + 1
# local mean low pass filtering
ml[i] = a * ml[i - 1] + (1 - a) * raw[i]
# local variance low pass filtering
vl[i] = a * vl[i - 1] + (1 - a) * np.power([raw[i] - ml[i]],2)
# local threshold to detect event start
sl = ml[i] - s * np.sqrt(vl[i])
I'm not getting any error messages, but I've let the program run for more than 10 minutes without any results, so I assume I'm doing something incorrectly.
You should try to vectorize this process rather than accessing/processing indexes (otherwise why use numpy).
The other thing is that you seem to be doing unnecessary work (unless we're not seeing the whole function).
the line:
sl = ml[i] - s * np.sqrt(vl[i])
assigns the variable sl which you're not using inside the loop (or anywhere else). This assignment performs a whole vector multiplication by s which is all zeroes. If you do need the sl variable, you should calculate it outside of the loop using the last encountered values of ml[i] and vl[i] which you can store in temporary variables instead of computing on every loop.
If ni is in the millions, this unnecessary vector multiplication (of millions of zeros) is going to be very costly.
You probably didn't mean to override the value of s = startcoeff with s = np.zeros(ni) in the first place.
In order to vectorize these calculations you will need to use np.acumulate with some customized functions.
The non-numpy equivalent would be as follows (using itertools instead):
from itertools import accumulate
ml = [np.mean(raw)]+[0]*(ni-1)
mlSums = accumulate(zip(ml,raw),lambda r,d:(a*r[0] + (1-a)*d[1],0))
ml = [v for v,_ in mlSums]
vl = [np.var(raw)]+[0]*(ni-1)
vlSums = accumulate(zip(vl,raw,ml),lambda r,d:(a*r[0] + (1-a)*(d[1]-d[2])**2,0,0))
vl = [v for v,_,_ in vlSums]
In each case, the ml / vl vectors are initialized with the base value at index zero and the rest filled with zeroes.
The accumulate(zip(... function calls go through the array and call the lambda function with the current sum in r and the paired elements in d. For the ml calculation, this corresponds to r = (ml[i-1],_) and d = (0,raw[i]).
Because accumulate ouputs the same date type as it is given as input (which are zipped tuples), the actual result is only the first value of the tuples in the mlSums/vlSums lists.
This took 9.7 seconds to process for 8,477,960 items in the lists.

Vectorizing a monte carlo simulation in python

I've recently been working on some code in python to simulate a 2 dimensional U(1) gauge theory using monte carlo methods. Essentially I have an n by n by 2 array (call it Link) of unitary complex numbers (their magnitude is one). I randomly select element of my Link array and propose a random change to the number at that site. I then compute the resulting change in the action that would occur due to that change. I then accept the change with a probability equal to min(1,exp(-dS)), where dS is the change in the action. The code for the iterator is as follows
def iteration(j1,B0):
global Link
Staple = np.zeros((2),dtype=complex)
for i0 in range(0,j1):
x1 = np.random.randint(0,n)
y1 = np.random.randint(0,n)
u1 = np.random.randint(0,1)
Linkrxp1 = np.roll(Link,-1, axis = 0)
Linkrxn1 = np.roll(Link, 1, axis = 0)
Linkrtp1 = np.roll(Link, -1, axis = 1)
Linkrtn1 = np.roll(Link, 1, axis = 1)
Linkrxp1tn1 = np.roll(np.roll(Link, -1, axis = 0),1, axis = 1)
Linkrxn1tp1 = np.roll(np.roll(Link, 1, axis = 0),-1, axis = 1)
Staple[0] = Linkrxp1[x1,y1,1]*Linkrtp1[x1,y1,0].conj()*Link[x1,y1,1].conj() + Linkrxp1tn1[x1,y1,1].conj()*Linkrtn1[x1,y1,0].conj()*Linkrtn1[x1,y1,1]
Staple[1] = Linkrtp1[x1,y1,0]*Linkrxp1[x1,y1,1].conj()*Link[x1,y1,0].conj() + Linkrxn1tp1[x1,y1,0].conj()*Linkrxn1[x1,y1,1].conj()*Linkrxn1[x1,y1,0]
uni = unitary()
Linkprop = uni*Link[x1,y1,u1]
dE3 = (Linkprop - Link[x1,y1,u1])*Staple[u1]
dE1 = B0*np.real(dE3)
d1 = np.random.binomial(1, np.minimum(np.exp(dE1),1))
d = np.random.uniform(low=0,high=1)
if d1 >= d:
Link[x1,y1,u1] = Linkprop
else:
Link[x1,y1,u1] = Link[x1,y1,u1]
At the beginning of program I call a routine called "randomize" to generate K random unitary complex numbers which have small imaginary parts and store them in an array called Cnum of length K. In the same routine I also go through my Link array and set each element to a random unitary complex number. The code is listed below.
def randommatrix():
global Cnum
global Link
for i1 in range(0,K):
C1 = np.random.normal(0,1)
Cnum[i1] = np.cos(C1) + 1j*np.sin(C1)
Cnum[i1+K] = np.cos(C1) - 1j*np.sin(C1)
for i3,i4 in itertools.product(range(0,n),range(0,n)):
C2 = np.random.uniform(low=0, high = 2*np.pi)
Link[i3,i4,0] = np.cos(C2) + 1j*np.sin(C2)
C2 = np.random.uniform(low=0, high = 2*np.pi)
Link[i3,i4,1] = np.cos(C2) + 1j*np.sin(C2)
The following routine is used during the iteration routine to get a random complex number with a small imaginary part (by retrieving a random element of the Cnum array we generated earlier).
def unitary():
I1 = np.random.randint((0),(2*K-1))
mat = Cnum[I1]
return mat
Here is an example of what the iteration routine would be used for. I've written a routine called plaquette, which calculates the mean plaquette (real part of a 1 by 1 closed loop of link variables) for a given B0. The iteration routine is being used to generate new field configurations which are independent of previous configurations. After we get a new field configuration we calculate the plaquette for said configuration. We then repeat this process j1 times using a while loop, and at the end we end up with the mean plaquette.
def Plq(j1,B0):
i5 = 0
Lboot = np.zeros(j1)
while i5<j1:
iteration(25000,B0)
Linkrxp1 = np.roll(Link,-1, axis = 0)
Linkrtp1 = np.roll(Link, -1, axis = 1)
c0 = np.real(Link[:,:,0]*Linkrxp1[:,:,1]*Linkrtp1[:,:,0].conj()*Link[:,:,1].conj())
i5 = i5 + 1
We need to define some variables before we run anything, so here's the initial variables which I define before defining any routines
K = 20000
n = 50
a = 1.0
Link = np.zeros((n,n,2),dtype = complex)
Cnum = np.zeros((2*K), dtype = complex)
This code works, but it is painfully slow. Is there a way that I can use multiprocessing or something to speed this up?
You should use cython and c data types. Another cython link. It's built for fast computation.
You could use multiprocessing, potentially, in one of two cases.
If you have one object that multiple process would need to share you would need to use Manager (see multiprocessing link), Lock, and Array to share the object between processes. However, there is no guarantee this will result in an increased speed since each process needs to lock the link to guarantee your prediction, assuming the predictions are affected by all elements in the link (if a process modifies an element at the same time another process is making a prediction for an element, the prediction wouldn't be based on the most current information).
If your predictions do not take into account the state of the other elements, i.e. it only cares about the one element, then you could break your Link array into segments and divvy chunks out to several processes in a process pool, and when done combine the segments back to one array. This would certainly save time, and you wouldn't have to use any additional multiprocessing mechanisms.

Python memory management with list comprehensions

I am trying to do some analytics against a large dictionary created by reading a file from disk. The read operation results in a stable memory footprint. I then have a method which performs some calculations based on data I copy out of that dictionary into a temporary dictionary. I do this so that all the copying and data use is scoped in the method, and would, I had hoped, disappear at the end of the method call.
Sadly, I am doing something wrong. The customerdict definition is as follows (defined at top of .py variable):
customerdict = collections.defaultdict(dict)
The format of the object is {customerid: dictionary{id: 0||1}}
There is also a similarly defined dictionary called allids.
I have a method for calculating the sim_pearson distance (modified code from Programming Collective Intelligence book), which is below.
def sim_pearson(custID1, custID2):
si = []
smallcustdict = {}
smallcustdict[custID1] = customerdict[custID1]
smallcustdict[custID2] = customerdict[custID2]
#a loop to round out the remaining allids object to fill in 0 values
for customerID, catalog in smallcustdict.iteritems():
for id in allids:
if id not in catalog:
smallcustdict[customerID][asin] = 0.0
#get the list of mutually rated items
for id in smallcustdict[custID1]:
if id in smallcustdict[custID2]:
si.append(id) # = 1
#return 0 if there are no matches
if len(si) == 0: return 0
#add up all the preferences
sum1 = sum([smallcustdict[custID1][id] for id in si])
sum2 = sum([smallcustdict[custID2][id] for id in si])
#sum up the squares
sum1sq = sum([pow(smallcustdict[custID1][id],2) for id in si])
sum2sq = sum([pow(smallcustdict[custID2][id],2) for id in si])
#sum up the products
psum = sum([smallcustdict[custID1][id] * smallcustdict[custID2][id] for id in si])
#calc Pearson score
num = psum - (sum1*sum2/len(si))
den = sqrt((sum1sq - pow(sum1,2)/len(si)) * (sum2sq - pow(sum2,2)/len(si)))
del smallcustdict
del si
del sum1
del sum2
del sum1sq
del sum2sq
del psum
if den == 0:
return 0
return num/den
Every loop through the sim_pearson method grows the memory footprint of python.exe unbounded. I tried using the "del" method to explicitly delete local scoped variables.
Looking at taskmanager, the memory is jumping up at 6-10Mb increments. Once the initial customerdict is setup, the footprint is 137Mb.
Any ideas why I am running out of memory doing it this way?
I presume the issue is here:
smallcustdict[custID1] = customerdict[custID1]
smallcustdict[custID2] = customerdict[custID2]
#a loop to round out the remaining allids object to fill in 0 values
for customerID, catalog in smallcustdict.iteritems():
for id in allids:
if id not in catalog:
smallcustdict[customerID][asin] = 0.0
The dictionaries from customerdict are being referenced in smallcustdict - so when you add to them, you they persist. This is the only point that I can see where you do anything that will persist out of scope, so I would imagine this is the problem.
Note you are making a lot of work for yourself in many places by not using list comps, doing the same thing repeatedly, and not making generic ways to do things, a better version might be as follows:
import collections
import functools
import operator
customerdict = collections.defaultdict(dict)
def sim_pearson(custID1, custID2):
#Declaring as a dict literal is nicer.
smallcustdict = {
custID1: customerdict[custID1],
custID2: customerdict[custID2],
}
# Unchanged, as I'm not sure what the intent is here.
for customerID, catalog in smallcustdict.iteritems():
for id in allids:
if id not in catalog:
smallcustdict[customerID][asin] = 0.0
#dict views are set-like, so the easier way to do what you want is the intersection of the two.
si = smallcustdict[custID1].viewkeys() & smallcustdict[custID2].viewkeys()
#if not is a cleaner way of checking for no values.
if not si:
return 0
#Made more generic to avoid repetition and wastefully looping repeatedly.
parts = [list(part) for part in zip(*((value[id] for value in smallcustdict.values()) for id in si))]
sums = [sum(part) for part in parts]
sumsqs = [sum(pow(i, 2) for i in part) for part in parts]
psum = sum(functools.reduce(operator.mul, part) for part in zip(*parts))
sum1, sum2 = sums
sum1sq, sum2sq = sumsqs
#Unchanged.
num = psum - (sum1*sum2/len(si))
den = sqrt((sum1sq - pow(sum1,2)/len(si)) * (sum2sq - pow(sum2,2)/len(si)))
#Again using if not.
if not den:
return 0
else:
return num/den
Note that this is entirely untested as the code you gave isn't a complete example. However, It should be easy enough to use as a basis for improvement.
Try changing the following two lines:
smallcustdict[custID1] = customerdict[custID1]
smallcustdict[custID2] = customerdict[custID2]
to
smallcustdict[custID1] = customerdict[custID1].copy()
smallcustdict[custID2] = customerdict[custID2].copy()
That way the changes you make to the two dictionaries do not persist in customerdict when the sim_pearson() function returns.

Categories