Optimizing Python script with multiple for loops - python

I'm super new at python and super new at trying to optimize a script for speed. I've got this problem that I've been using to teach myself how to write code and here it is:
I have a dataset with a list of products, their value and their costs. There are three different types of products (A,B,C) - there are anywhere from 30-100 products for each product type. Each product has a value and a cost. I have to select 1 product from product type A, 2 from product type B, and 3 from products type C -- once I use a product, I cannot use it again (no replacement).
My goal is to optimize the value of products given my budget constraint.
Given that I'm basically trying to create a list of combinations, I started there and wrote a few "for loops" in order to achieve that. I initially tried to do too much in the loops and change the data type to list because from my research it sounded like that it would speed it up -- it did speed it up immensely.
The problem is that I am still processing 350k records a second at best, which puts me at about 7 hours to complete if I have 30 items in list_a, 50 in list_b, and 50 in list_c.
I have created 3 lists of lists -- (list_a, list_b, and list_c) that each look like my example below for list_a. Then, I evaluate each permutation inside the for loop to see if this permutation has a higher value than the current highest value permutation and that the cost is below the constraint. If it meets those conditions, I append it to the masterlist of permutations (combo_list).
list_a = [['product1','product1_cost','product1_value'],['product2','product2_cost','product2_value'],...]
num_a = len(list_a)
num_b = len(list_b)
num_c = len(list_c)
combo_list = [[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]] # this is to create the list of lists that I will populate
a = 0 #for row numbers
b1 = 0
c1 = 0
l = 0 #of iterations
max_value = 0
for a in range(0,num_a):
for b1 in range(0,num_b):
b2 = b1 + 1 #second b
for b2 in range(b2,num_b):
for c1 in range(0,num_c):
c2 = c1 +1 #second c
for c2 in range(c2,num_C):
c3 =c2+1 #third c
for c3 in range(c3,num_C):
data = [list_a[a][0],list_a[a][1],list_a[a][2],list_b[b1][0],list_b[b1][1],list_b[b1][2],list_b[b2][0],list_b[b2][1],list_b[b2][2],list_c[c1][0],list_c[c1][1],list_c[c1][2],list_c[c2][0],list_c[c2][1],list_c[c2][2],list_c[c3][0],list_c[c3][1],list_c[c3][2]]
total_cost = data[1] + data[4] + data[7] + data[10] + data[13] + data[16]
total_value = data[2] + data[5] + data[8] + data[11] + data[14] + data[17]
data[18]=total_cost
data[19]=total_value
if total_value >= max_value and total_cost <= constraint:
combo_list.append(data)
max_value = total_value
c3 +=1
l +=1
c2 +=1
c1 +=1
b2+=1
b1 +=1
a +=1
then I turn it into a dataframe or csv
Thank you for any help.

So, I was able to figure this out using itertools combinations:
tup_b = itertools.combinations(list_b, 2)
list_b = map(list,tup_b)
df_b = pd.DataFrame(list_b)
#Extending the list
df_b['B'] = df_b[0] + df_b[1]
df_b = df_b[['B']]
#flatten list
b = df_b.values.tolist()
b = list(itertools.chain(*r))
# adding values and costs
r = len(b)
x=0
for x in range(0,r):
cost = [b[x][1] +b[x][4]]
value = [b[x][2] +b[x][5]]
r[x] = r[x] +cost +value
x +=1
#shortening list
df_b = pd.DataFrame(b)
df_b = df_b[[0,3,6,7]]
df_b.columns = ['B1','B2','cost','value']
Then, I did the same for list_c with similar structure as above, using this:
tup_c = itertools.combinations(list_c, 3)
Using this dropped by time to process from ~5 hours to 8 minutes...
Thanks all for the help.

Related

filling the gaps between list of angles (numbers)

i'll explain for simple example then go into the deep
if i have a list of number consist of
t_original = [180,174,168,166,162,94,70,80,128,131,160,180]
if we graph this so it goes down from 180 to 70 then it ups to 180 again
but if we suddenly change the fourth value (166) by 450 then the list will be
t = [180,174,168,700,162,94,70,80,128,131,160,180]
which dose not make sense in the graph
i wanna treat the fourth value (700) as a wrong value
i want to replace it with a relative value even if not as the original value but relative to the previous two elements (168,174)
i wanna do the same for the whole list if another wrong value appeared again
we can call that [Filling gaps between list of numbers]
so i'm tryig to do the same idea but for bigger example
the method i have tried
and i'll share my code with output , filtered means applied filling gap function
my code
def preprocFN(*U):
prePlst=[] # after preprocessing list
#preprocessing Fc =| 2*LF1 prev by 1 - LF2 prev by 2 |
c0 = -2 #(previous) by 2
c1 =-1 #(previous)
c2 =0 #(current)
c3 = 1 #(next)
preP = U[0] # original list
if c2 == 0:
prePlst.append(preP[0])
prePlst.append(preP[1])
c1+=2
c2+=2
c0+=2
oldlen = len(preP)
while oldlen > c2:
Equ = abs(2*preP[c1] - preP[c0]) #fn of preprocessing #removed abs()
formatted_float = "{:.2f}".format(Equ) #with .2 number only
equu = float(formatted_float) #from string float to float
prePlst.insert(c2,equu) # insert the preprocessed value to the List
c1+=1
c2+=1
c0+=1
return prePlst
with my input : https://textuploader.com/t1py9
the output will be : https://textuploader.com/t1pyk
and when printing the values higher than 180 (wrong values)
result_list = [item for item in list if item > 180]
which dosen't make sense that any joint of human can pass the angle of 180
the output was [183.6, 213.85, 221.62, 192.05, 203.39, 197.22, 188.45, 182.48, 180.41, 200.09, 200.67, 198.14, 199.44, 198.45, 200.55, 193.25, 204.19, 204.35, 200.59, 211.4, 180.51, 183.4, 217.91, 218.94, 213.79, 205.62, 221.35, 182.39, 180.62, 183.06, 180.78, 231.09, 227.33, 224.49, 237.02, 212.53, 207.0, 212.92, 182.28, 254.02, 232.49, 224.78, 193.92, 216.0, 184.82, 214.68, 182.04, 181.07, 234.68, 233.63, 182.84, 193.94, 226.8, 223.69, 222.77, 180.67, 184.72, 180.39, 183.99, 186.44, 233.35, 228.02, 195.31, 183.97, 185.26, 182.13, 207.09, 213.21, 238.41, 229.38, 181.57, 211.19, 180.05, 181.47, 199.69, 213.59, 191.99, 194.65, 190.75, 199.93, 221.43, 181.51, 181.42, 180.22]
so the filling gaps fn from proposed method dosen't do it's job
any suggestion for applying the same concept with a different way ?
Extra Info may help
the filtered graph consists of filling gap function and then applying normalize function
i don't think the problem is from the normalizing function since the output from the filling gaps function isn't correct in my opinion maybe i'm wrong but anyway i provide the normalize steps so you get how the final filtered graph has been made
fn :
My Code :
def outLiersFN(*U):
outliers=[] # after preprocessing list
#preprocessing Fc =| 2*LF1 prev by 1 - LF2 prev by 2 |
c0 = -2 #(previous) by 2 #from original
c1 =-1 #(previous) #from original
c2 =0 #(current) #from original
c3 = 1 #(next) #from original
preP = U[0] # original list
if c2 == 0:
outliers.append(preP[0])
c1+=1
c2+=1
c0+=1
c3+=1
oldlen = len(preP)
M_RangeOfMotion = 90
while oldlen > c2 :
if c3 == oldlen:
outliers.insert(c2, preP[c2]) #preP[c2] >> last element in old list
break
if (preP[c2] > M_RangeOfMotion and preP[c2] < (preP[c1] + preP[c3])/2) or (preP[c2] < M_RangeOfMotion and preP[c2] > (preP[c1] + preP[c3])/2): #Check Paper 3.3.1
Equ = (preP[c1] + preP[c3])/2 #fn of preprocessing # From third index # ==== inserting current frame
formatted_float = "{:.2f}".format(Equ) #with .2 number only
equu = float(formatted_float) #from string float to float
outliers.insert(c2,equu) # insert the preprocessed value to the List
c1+=1
c2+=1
c0+=1
c3+=1
else :
Equ = preP[c2] # fn of preprocessing #put same element (do nothing)
formatted_float = "{:.2f}".format(Equ) # with .2 number only
equu = float(formatted_float) # from string float to float
outliers.insert(c2, equu) # insert the preprocessed value to the List
c1 += 1
c2 += 1
c0 += 1
c3 += 1
return outliers
I suggest the following algorithm:
data point t[i] is considered an outlier if it deviates from the average of t[i-2], t[i-1], t[i], t[i+1], t[i+2] by more than the standard deviation of these 5 elements.
outliers are replaced by the average of the two elements around them.
import matplotlib.pyplot as plt
from statistics import mean, stdev
t = [180,174,168,700,162,94,70,80,128,131,160,180]
def smooth(t):
new_t = []
for i, x in enumerate(t):
neighbourhood = t[max(i-2,0): i+3]
m = mean(neighbourhood)
s = stdev(neighbourhood, xbar=m)
if abs(x - m) > s:
x = ( t[i - 1 + (i==0)*2] + t[i + 1 - (i+1==len(t))*2] ) / 2
new_t.append(x)
return new_t
new_t = smooth(t)
plt.plot(t)
plt.plot(new_t)
plt.show()

How to check if 2 different values are from the same list and obtaining the list name

** I modified the entire question **
I have an example list specified below and i want to find if 2 values are from the same list and i wanna know which list both the value comes from.
list1 = ['a','b','c','d','e']
list2 = ['f','g','h','i','j']
c = 'b'
d = 'e'
i used for loop to check whether the values exist in the list however not sure how to obtain which list the value actually is from.
for x,y in zip(list1,list2):
if c and d in x or y:
print(True)
Please advise if there is any work around.
First u might want to inspect the distribution of values and sizes where you can improve the result with the least effort like this:
df_inspect = df.copy()
df_inspect["size.value"] = ["size.value"].map(lambda x: ''.join(y.upper() for y in x if x.isalpha() if y != ' '))
df_inspect = df_inspect.groupby(["size.value"]).count().sort_values(ascending=False)
Then create a solution for the most occuring size category, here "Wide"
long = "adasda, 9.5 W US"
short = "9.5 Wide"
def get_intersection(s1, s2):
res = ''
l_s1 = len(s1)
for i in range(l_s1):
for j in range(i + 1, l_s1):
t = s1[i:j]
if t in s2 and len(t) > len(res):
res = t
return res
print(len(get_intersection(long, short)) / len(short) >= 0.6)
Then apply the solution to the dataframe
df["defective_attributes"] = df.apply(lambda x: len(get_intersection(x["item_name.value"], x["size.value"])) / len(x["size.value"]) >= 0.6)
Basically, get_intersection search for the longest intersection between the itemname and the size. Then takes the length of the intersection and says, its not defective if at least 60% of the size_value are also in the item_name.

How to make this for loop faster?

I know that python loops themselves are relatively slow when compared to other languages but when the correct functions are used they become much faster.
I have a pandas dataframe called "acoustics" which contains over 10 million rows:
print(acoustics)
timestamp c0 rowIndex
0 2016-01-01T00:00:12.000Z 13931.500000 8158791
1 2016-01-01T00:00:30.000Z 14084.099609 8158792
2 2016-01-01T00:00:48.000Z 13603.400391 8158793
3 2016-01-01T00:01:06.000Z 13977.299805 8158794
4 2016-01-01T00:01:24.000Z 13611.000000 8158795
5 2016-01-01T00:02:18.000Z 13695.000000 8158796
6 2016-01-01T00:02:36.000Z 13809.400391 8158797
7 2016-01-01T00:02:54.000Z 13756.000000 8158798
and there is the code I wrote:
acoustics = pd.read_csv("AccousticSandDetector.csv", skiprows=[1])
weights = [1/9, 1/18, 1/27, 1/36, 1/54]
sumWeights = np.sum(weights)
deltaAc = []
for i in range(5, len(acoustics)):
time = acoustics.iloc[i]['timestamp']
sum = 0
for c in range(5):
sum += (weights[c]/sumWeights)*(acoustics.iloc[i]['c0']-acoustics.iloc[i-c]['c0'])
print("Row " + str(i) + " of " + str(len(acoustics)) + " is iterated")
deltaAc.append([time, sum])
deltaAc = pd.DataFrame(deltaAc)
It takes a huge amount of time, how can I make it faster?
You can use diff from pandas and create all the differences for each row in an array, then multiply with your weigths and finally sum over the axis 1, such as:
deltaAc = pd.DataFrame({'timestamp': acoustics.loc[5:, 'timestamp'],
'summation': (np.array([acoustics.c0.diff(i) for i in range(5) ]).T[5:]
*np.array(weights)).sum(1)/sumWeights})
and you get the same values than what I get with your code:
print (deltaAc)
timestamp summation
5 2016-01-01T00:02:18.000Z -41.799986
6 2016-01-01T00:02:36.000Z 51.418728
7 2016-01-01T00:02:54.000Z -3.111184
First optimization, weights[c]/sumWeights could be done outside the loop.
weights_array = np.array([1/9, 1/18, 1/27, 1/36, 1/54])
sumWeights = np.sum(weights_array)
tmp = weights_array / sumWeights
...
sum += tmp[c]*...
I'm not familiar with pandas, but if you could extract your columns as 1D numpy array, it would be great for you. It might look something like:
# next lines to be tested, or find the correct way of extracting the column
c0_column = acoustics[['c0']].values
time_column = acoustics[['times']].values
...
sum = numpy.zeros(shape=(len(acoustics)-5,))
delta_ac = []
for c in range(5):
sum += tmp[c]*(c0_column[5:]-c0_column[5-c:len(acoustics)-c])
for i in range(len(acoustics)-5):
deltaAc.append([time[5+i], sum[i])
Dataframes have a great method rolling for constructing and applying windowing transformations; So, you don't need loops at all:
# df is your data frame
window_size = 5
weights = pd.np.array([1/9, 1/18, 1/27, 1/36, 1/54])
weights /= weights.sum()
df.loc[:,'deltaAc'] = df.loc[:, 'c0'].rolling(window_size).apply(lambda x: ((x[-1] - x)*weights).sum())

Generating a list of numbers based on a constraint-step procedure

I'm trying to generate a list of 8 numbers with the following for code:
import numpy as np
import pandas as pd
n2 = np.random.uniform(0.1, 1.5)
c2 = np.random.uniform(4,14)
c3 = np.random.uniform(0.1,2.9)
ic4 = np.random.uniform(0.01,1)
nc4 = np.random.uniform(0.01,1)
ic5 = np.random.uniform(0,0.2)
nc5 = np.random.uniform(0,0.01)
comp_list = []
for i in range(1):
if n2/c2 <= 0.11:
comp_list.append(c2)
comp_list.append(n2)
if c3/c2 <= 0.26:
comp_list.append(c3)
if ic4/c3 <= 0.27:
comp_list.append(ic4)
if nc4/ic4 <= 1:
comp_list.append(nc4)
if ic5/nc4 <= 0.06:
comp_list.append(ic5)
if nc5/ic5 <= 0.25:
comp_list.append(nc5)
sum = n2+c2+c3+ic4+nc4+ic5+nc5
c1 = 100-sum
comp_list.append(c1)
df = pd.Series(comp_list)
print df
However, when i run the code, the amount of numbers outputted is not consistent and can range from 3 to 5. for example, 1 run would give me:
0 1.560
1 0.251
2 0.008
3 86.665
a second run would give me:
0 12.929
1 1.015
2 2.126
3 0.093
4 0.0025
5 83.376
I have no idea why the output is not consistent.
Maybe I need to iterate through a random distribution until all the if statements are satisfied? or am I missing something obvious?
You want to enclose all that in a while loop. Roll for all your numbers. If they all satisfy your conditions, create the necessary list. As long as one of the conditions isn't satisfied, all numbers will be rerolled.
The reason why you want to reroll all numbers is so that they are all independent of each other. For example, c2 is involved in both n2/c2 and c3/c2. If n2/c2 is satisfied and you keep those two numbers, the values of c3 you can have in order to satisfy the condition c3/c2 are constrained by what you have already chosen for c2.
import numpy as np
while True:
n2 = np.random.uniform(0.1, 1.5)
c2 = np.random.uniform(4,14)
c3 = np.random.uniform(0.1,2.9)
ic4 = np.random.uniform(0.01,1)
nc4 = np.random.uniform(0.01,1)
ic5 = np.random.uniform(0,0.2)
nc5 = np.random.uniform(0,0.01)
if (n2/c2 <= 0.11 and
c3/c2 <= 0.26 and
ic4/c3 <= 0.27 and
nc4/ic4 <= 1 and
ic5/nc4 <= 0.06 and
nc5/ic5 <= 0.25):
comp_list = [n2, c2, c3, ic4, nc4, ic5, nc5]
comp_list.append(100 - sum(comp_list))
break
Edit: In order to generate a list of such lists, iterate this code as many times as necessary and append the result of comp_list each time.
big_list = []
for _ in range(10):
# while stuff
big_list.append(comp_list)
If this is something you expect to run in various locations in your code, you may as well put that in a function.
def generate_numbers():
while True:
...
return comp_list
And then you can do big_list = [generate_numbers() for _ in range(10)].

Numpy Vs nested dictionaries, which one is more efficient in terms of runtime and memory?

I am new to numpy.I have referred to the following SO question:
Why NumPy instead of Python lists?
The final comment in the above question seems to indicate that numpy is probably slower on a particular dataset.
I am working on a 1650*1650*1650 data set. These are essentially similarity values for each movie in the MovieLens data set along with the movie id.
My options are to either use a 3D numpy array or a nested dictionary. On a reduced data set of 100*100*100, the run times were not too different.
Please find the Ipython code snippet below:
for id1 in range(1,count+1):
data1 = df[df.movie_id == id1].set_index('user_id')[cols]
sim_score = {}
for id2 in range (1, count+1):
if id1 != id2:
data2 = df[df.movie_id == id2].set_index('user_id')[cols]
sim = calculatePearsonCorrUnified(data1, data2)
else:
sim = 1
sim_matrix_panel[id1]['Sim'][id2] = sim
import pdb
from math import sqrt
def calculatePearsonCorrUnified(df1, df2):
sim_score = 0
common_movies_or_users = []
for temp_id in df1.index:
if temp_id in df2.index:
common_movies_or_users.append(temp_id)
#pdb.set_trace()
n = len(common_movies_or_users)
#print ('No. of common movies: ' + str(n))
if n == 0:
return sim_score;
# Ratings corresponding to user_1 / movie_1, present in the common list
rating1 = df1.loc[df1.index.isin(common_movies_or_users)]['rating'].values
# Ratings corresponding to user_2 / movie_2, present in the common list
rating2 = df2.loc[df2.index.isin(common_movies_or_users)]['rating'].values
sum1 = sum (rating1)
sum2 = sum (rating2)
# Sum up the squares
sum1Sq = sum (np.square(rating1))
sum2Sq = sum (np.square(rating2))
# Sum up the products
pSum = sum(np.multiply(rating1, rating2))
# Calculate Pearson score
num = pSum-(sum1*sum2/n)
den = sqrt(float(sum1Sq-pow(sum1,2)/n) * float(sum2Sq-pow(sum2,2)/n))
if den==0: return 0
sim_score = (num/den)
return sim_score
What would be the best way to most precisely time the runtime with either of these options?
Any pointers would be greatly appreciated.

Categories