optimizing the for loop for faster performance - python

I have a dataframe that contains the similarity scores 100x100 for each 100 products against 100 products(data_neighbours). I have another dataframe that has the data at user and product level(1000x100). I want to go through each product for each user and get top10 similar products from data_neighbours and their corresponding similarity scores and compute a function getScore as below:
def getScore(history, similarities):
return sum(history*similarities)/sum(similarities)
for i in range(0,len(data_sims.index)):
for j in range(1,len(data_sims.columns)):
user = data_sims.index[i]
product = data_sims.columns[j]
if data.ix[i][j] == 1:
data_sims.ix[i][j] = 0
else:
product_top_names = data_neighbours.ix[product][1:10]
product_top_sims = data_ibs.ix[product].order(ascending=False)[1:10]
user_purchases = data_germany.ix[user,product_top_names]
data_sims.ix[i][j] = getScore(user_purchases,product_top_sims)
How can I optimize this loop for faster processing. The example has been cited from here: http://www.salemmarafi.com/code/collaborative-filtering-with-python/
Sample data:
Data:(1000x101) user is the 101th column:
Index user song1 song2.....
0 1 0 0
1 33 0 1
2 42 1 0
3 51 0 0
data_ibs(similarity scores)--(100x100):
song1 song2 song3 song4
song1 1.00 0.00 0.02 0.05
song2 0.00 1.00 0.05 0.03
song3 0.02 0.05 1.00 0.11
song4 0.05 0.03 0.11 1.00
data_neighbours(top10 similar songs for each song based on sorted score from data_ibs)--(100x10):
1 2 3......... 10
song1 song5 song10 song4
song2 song8 song11 song5
song3 song9 song12 song10
data germany(user level data for each song as column, except userid)--(1000x100):
index song1 song2 song3
1 0 0 0
2 1 0 0
3 0 0 1
Expected dataset(data_sims)--1000x101:
user song1 song2 song3
1 0.00 0.00 0.22
33 0.09 0.00 0.11
42 0.00 0.10 0.00
51 0.09 0.09 0.00
where if value is 1 in data for any song, basically its score is set to 0, other cases, top 10 songs are fetched from data_neighbours and corresponding scores from data_ibs. Now it is checked if those songs are already present for the user or not(1,0) in user_purchases dataset. finally similarity scores are computed for the ixj position using user_purchses(1/0 values for each top 10 song) multiply by similarity score from data_ibs and divide by the sum of total top 10 similarity scores. Repeat the same for all the user x song combination.

Related

Python find the maximum total score from a matrix

I have a 6 * 14 matrix, each element of the matrix represents a score; my goal is to find the maximum total score as well as which element was picked from each row.
Only 1 element can be selected for each row, and up to 1 element can be selected for each column.
If the element of column 14 (last column) is selected, we will stop and add the score up to that element as total score.
If the element of second column is selected, element of the next row can only be selected from the third column to the last column.
We need to start from the first row, cannot skip it and go to the next row.
For example, if x1,1 (element of first row and first column) is selected, then we go to the second row, and pick x2,3 (can be picked from 2nd to the last column), then we go to the third row and pick x3,6 (can be picked from 4th to the last column), then we go to the fourth row and pick x4,9 and the fifth row to pick x5,14. We will pause here and not go to the last row since we have chosen a value from the last column. And the total score will be x1,1 + x2,3 + x3,6 + x4,9 + x5,14 = 0.73 according to the matrix below.
appr_0:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
0.21
0.22
0.31
0.13
0.14
0.05
0.09
0.11
0.12
0.33
0.42
0.10
0.08
0.12
0.11
0.10
0.13
0.14
0.12
0.15
0.19
0.21
0.22
0.13
0.12
0.07
0.08
0.07
0.22
0.21
0.12
0.14
0.15
0.08
0.10
0.12
0.15
0.30
0.22
0.11
0.09
0.13
0.17
0.12
0.18
0.19
0.17
0.15
0.19
0.21
0.22
0.13
0.14
0.15
0.18
0.10
0.16
0.18
0.19
0.20
0.21
0.18
0.19
0.20
0.21
0.17
0.18
0.17
0.10
0.09
0.23
0.20
0.11
0.16
0.18
0.09
0.09
0.13
0.16
0.20
0.21
0.17
0.11
0.14
I have tried using iteration approach to find out the maximum score but it was very time consuming and the Python script wasn't able to run through it within a reasonable time. Just wondering if there is a way to write in another way and optimize it.
df = pd.DataFrame(columns=['days_1', 'days_2', 'days_3', 'days_4', 'days_5', 'days_6', 'score'])
max_score = 0
curr_score = 0
curr_max = 0
for j0 in range(14):
curr_score = appr_0[0, j0]
curr_max = curr_max + curr_score
max_score = max(curr_max, max_score)
for j1 in range(j0, 14):
curr_score = appr_0[1, j1]
curr_max = curr_max + curr_score
max_score = max(curr_max, max_score)
for j2 in range(j1, 14):
curr_score = appr_0[2, j2]
curr_max = curr_max + curr_score
max_score = max(curr_max, max_score)
for j3 in range(j2, 14):
curr_score = appr_0[3, j3]
curr_max = curr_max + curr_score
max_score = max(curr_max, max_score)
for j4 in range(j3, 14):
curr_score = appr_0[4, j4]
curr_max = curr_max + curr_score
max_score = max(curr_max, max_score)
for j5 in range(j4, 14):
curr_score = appr_0[5, j5]
curr_max = curr_max + curr_score
max_score = max(curr_max, max_score)
df = df.append(pd.DataFrame([[j0, j1, j2, j3, j4, j5, max_score]], columns = df.columns))
df_max_record = df.loc[df['score'] == df['score'].max()]
Expected output df_max_record will look like (faked data):
days_1
days_2
days_3
days_4
days_5
days_6
score
2
3
7
9
10
13
0.95

returning function calls and other info when optimization solves

I am benchmarking multiple problems for multiple systems using Gekko, and I would like to get my code to return the function calls, iterations, and time it takes to solve. I know that the solver automatically prints all of this data but is there an object or attribute that can be returned to allow my function to return the numeric values?
Here is an example of how the code is set up.
def model(plot=False):
t = np.linspace(0, 1, 101)
m = GEKKO(remote=False); m.time=t
fe = m.Param(np.cos(2*np.pi*t)+3)
de = m.Var(fe[0])
e = m.CV(0); e.STATUS=1; e.SPHI=e.SPLO=0; e.WSPHI=1000; e.WSPLO=1
der = m.MV(0, lb=-1, ub=1); der.STATUS=1
m.Equations([de.dt() == der, e == fe-de])
m.options.IMODE=6; m.solve()
if plot:
import matplotlib.pyplot as plt
plt.plot(t, fe)
plt.plot(t, de)
plt.plot(t, der)
plt.show()
return m.fcalls
if __name__ == "__main__":
model(plot=True)
The objective function, iterations, solve time, and solution status are available in Gekko with:
m.options.OBJFCNVAL
m.options.ITERATIONS
m.options.SOLVETIME
m.options.APPSTATUS
You could return these as a list as I've done with summary.
from gekko import GEKKO
import numpy as np
def model(plot=False):
t = np.linspace(0, 1, 101)
m = GEKKO(remote=False); m.time=t
fe = m.Param(np.cos(2*np.pi*t)+3)
de = m.Var(fe[0])
e = m.CV(0); e.STATUS=1; e.SPHI=e.SPLO=0; e.WSPHI=1000; e.WSPLO=1
der = m.MV(0, lb=-1, ub=1); der.STATUS=1
m.Equations([de.dt() == der, e == fe-de])
m.options.DIAGLEVEL=1
m.options.SOLVER=1
m.options.IMODE=6; m.solve()
if plot:
import matplotlib.pyplot as plt
plt.plot(t, fe)
plt.plot(t, de)
plt.plot(t, der)
plt.savefig('result.png')
return [m.options.OBJFCNVAL,\
m.options.ITERATIONS,\
m.options.SOLVETIME,\
m.options.APPSTATUS]
if __name__ == "__main__":
summary = model(plot=True)
print(summary)
If you want function calls, it is a little more complicated because there are different types of function calls. There are function calls for the objective function and constraints, function calls for 1st derivatives, and function calls for 2nd derivatives. You can get a complete report of all the subroutine calls and the individuals and cumulative times for each by setting m.options.DIAGLEVEL=1 or higher. Here is the solver output for this problem:
Number of state variables: 1900
Number of total equations: - 1800
Number of slack variables: - 0
---------------------------------------
Degrees of freedom : 100
----------------------------------------------
Dynamic Control with APOPT Solver
----------------------------------------------
Iter Objective Convergence
0 9.81590E+01 1.00000E+00
1 7.62224E+01 4.00000E-10
2 7.62078E+01 1.10674E-02
3 7.62078E+01 1.00000E-10
4 7.62078E+01 8.32667E-17
5 7.62078E+01 8.32667E-17
Successful solution
---------------------------------------------------
Solver : APOPT (v1.0)
Solution time : 0.5382 sec
Objective : 76.20778997271815
Successful solution
---------------------------------------------------
Some solvers, like IPOPT, don't have the iterations readily available from the API so they are always reported as zero. With APOPT, the summary list is [76.207789973, 5, 0.5253, 1]. The timing and function call report is after the solver summary.
Timer # 1 0.70/ 1 = 0.70 Total system time
Timer # 2 0.54/ 1 = 0.54 Total solve time
Timer # 3 0.05/ 9 = 0.01 Objective Calc: apm_p
Timer # 4 0.00/ 5 = 0.00 Objective Grad: apm_g
Timer # 5 0.02/ 9 = 0.00 Constraint Calc: apm_c
Timer # 6 0.00/ 0 = 0.00 Sparsity: apm_s
Timer # 7 0.00/ 0 = 0.00 1st Deriv #1: apm_a1
Timer # 8 0.00/ 5 = 0.00 1st Deriv #2: apm_a2
Timer # 9 0.02/ 200 = 0.00 Custom Init: apm_custom_init
Timer # 10 0.00/ 200 = 0.00 Mode: apm_node_res::case 0
Timer # 11 0.00/ 600 = 0.00 Mode: apm_node_res::case 1
Timer # 12 0.00/ 200 = 0.00 Mode: apm_node_res::case 2
Timer # 13 0.00/ 400 = 0.00 Mode: apm_node_res::case 3
Timer # 14 0.00/ 4800 = 0.00 Mode: apm_node_res::case 4
Timer # 15 0.00/ 2000 = 0.00 Mode: apm_node_res::case 5
Timer # 16 0.00/ 0 = 0.00 Mode: apm_node_res::case 6
Timer # 17 0.00/ 5 = 0.00 Base 1st Deriv: apm_jacobian
Timer # 18 0.02/ 5 = 0.00 Base 1st Deriv: apm_condensed_jacobian
Timer # 19 0.00/ 1 = 0.00 Non-zeros: apm_nnz
Timer # 20 0.00/ 0 = 0.00 Count: Division by zero
Timer # 21 0.00/ 0 = 0.00 Count: Argument of LOG10 negative
Timer # 22 0.00/ 0 = 0.00 Count: Argument of LOG negative
Timer # 23 0.00/ 0 = 0.00 Count: Argument of SQRT negative
Timer # 24 0.00/ 0 = 0.00 Count: Argument of ASIN illegal
Timer # 25 0.00/ 0 = 0.00 Count: Argument of ACOS illegal
Timer # 26 0.00/ 1 = 0.00 Extract sparsity: apm_sparsity
Timer # 27 0.00/ 17 = 0.00 Variable ordering: apm_var_order
Timer # 28 0.00/ 1 = 0.00 Condensed sparsity
Timer # 29 0.00/ 0 = 0.00 Hessian Non-zeros
Timer # 30 0.00/ 3 = 0.00 Differentials
Timer # 31 0.00/ 0 = 0.00 Hessian Calculation
Timer # 32 0.00/ 0 = 0.00 Extract Hessian
Timer # 33 0.00/ 1 = 0.00 Base 1st Deriv: apm_jac_order
Timer # 34 0.06/ 1 = 0.06 Solver Setup
Timer # 35 0.40/ 1 = 0.40 Solver Solution
Timer # 36 0.00/ 23 = 0.00 Number of Variables
Timer # 37 0.00/ 12 = 0.00 Number of Equations
Timer # 38 0.05/ 17 = 0.00 File Read/Write
Timer # 39 0.00/ 1 = 0.00 Dynamic Init A
Timer # 40 0.02/ 1 = 0.02 Dynamic Init B
Timer # 41 0.02/ 1 = 0.02 Dynamic Init C
Timer # 42 0.00/ 1 = 0.00 Init: Read APM File
Timer # 43 0.00/ 1 = 0.00 Init: Parse Constants
Timer # 44 0.00/ 1 = 0.00 Init: Model Sizing
Timer # 45 0.00/ 1 = 0.00 Init: Allocate Memory
Timer # 46 0.00/ 1 = 0.00 Init: Parse Model
Timer # 47 0.00/ 1 = 0.00 Init: Check for Duplicates
Timer # 48 0.00/ 1 = 0.00 Init: Compile Equations
Timer # 49 0.00/ 1 = 0.00 Init: Check Uninitialized
Timer # 50 0.00/ 205 = 0.00 Evaluate Expression Once
Timer # 51 0.00/ 0 = 0.00 Sensitivity Analysis: LU Factorization
Timer # 52 0.00/ 0 = 0.00 Sensitivity Analysis: Gauss Elimination
Timer # 53 0.00/ 0 = 0.00 Sensitivity Analysis: Total Time
Timers 3, 4, and 5 are probably most relevant to your question. They are objective function requests, 1st derivative requests, and constraint evaluation requests.

Swap and group column names in a pandas DataFrame

I have a data frame with some quantitative data and one qualitative data. I would like to use describe to compute stats and group by column using the qualitative data. But I do not obtain the order I want for the level. Hereafter is an example:
df = pd.DataFrame({k: np.random.random(10) for k in "ABC"})
df["qual"] = 5 * ["init"] + 5 * ["final"]
The DataFrame looks like:
A B C qual
0 0.298217 0.675818 0.076533 init
1 0.015442 0.264924 0.624483 init
2 0.096961 0.702419 0.027134 init
3 0.481312 0.910477 0.796395 init
4 0.166774 0.319054 0.645250 init
5 0.609148 0.697818 0.151092 final
6 0.715744 0.067429 0.761562 final
7 0.748201 0.803647 0.482738 final
8 0.098323 0.614257 0.232904 final
9 0.033003 0.590819 0.943126 final
Now I would like to group by the qual column and compute statistical descriptors using describe. I did the following:
ddf = df.groupby("qual").describe().transpose()
ddf.unstack(level=0)
And I got
qual final init
A B C A B C
count 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000
mean 0.440884 0.554794 0.514284 0.211741 0.574539 0.433959
std 0.347138 0.284931 0.338057 0.182946 0.274135 0.355515
min 0.033003 0.067429 0.151092 0.015442 0.264924 0.027134
25% 0.098323 0.590819 0.232904 0.096961 0.319054 0.076533
50% 0.609148 0.614257 0.482738 0.166774 0.675818 0.624483
75% 0.715744 0.697818 0.761562 0.298217 0.702419 0.645250
max 0.748201 0.803647 0.943126 0.481312 0.910477 0.796395
I am close to what I want but I would like to swap and group the column index such as:
A B C
qual initial final initial final initial final
Is there a way to do it ?
Use columns.swaplevel and then sort_index by level=0 and axis='columns':
ddf = df.groupby('qual').describe().T.unstack(level=0)
ddf.columns = ddf.columns.swaplevel(0,1)
ddf = ddf.sort_index(level=0, axis='columns')
Or in one line using DataFrame.swaplevel instead of index.swaplevel:
ddf = ddf.swaplevel(0,1, axis=1).sort_index(level=0, axis='columns')
A B C
qual final init final init final init
count 5.00 5.00 5.00 5.00 5.00 5.00
mean 0.44 0.21 0.55 0.57 0.51 0.43
std 0.35 0.18 0.28 0.27 0.34 0.36
min 0.03 0.02 0.07 0.26 0.15 0.03
25% 0.10 0.10 0.59 0.32 0.23 0.08
50% 0.61 0.17 0.61 0.68 0.48 0.62
75% 0.72 0.30 0.70 0.70 0.76 0.65
max 0.75 0.48 0.80 0.91 0.94 0.80
Try ddf.stack().unstack(level=[0,2]), inplace of ddf.unstack(level=0)

finding nearest neighbors of pdb models using kd-tree

I am stuck trying to query nearest neighbors of models from a pdb file, using scipy’s kd-tree. I have currently implemented a brute force approach where I compare each model's rmsd value to every other model. I would like to speed up the time it takes to find each model nearest neighbors by using kd-tree.
For reference, a sample of the pdb file I am working with has multiple models in a single file:
MODEL 5
HETATM 1 C1 SIN A 0 13.542 -2.290 0.745 1.00 0.00 C
HETATM 2 O1 SIN A 0 14.446 -2.652 0.010 1.00 0.00 O
HETATM 3 O2 SIN A 0 12.378 -2.189 0.395 1.00 0.00 O
...
TER 627 NH2 A 39
ENDMDL
MODEL 6
HETATM 1 C1 SIN A 0 11.762 2.281 -7.835 1.00 0.00 C
ATOM 26 C TRP A 2 11.341 6.316 -0.847 1.00 0.00 C
ATOM 27 O TRP A 2 11.074 6.179 0.330 1.00 0.00 O
ATOM 28 CB TRP A 2 13.182 7.844 -1.538 1.00 0.00 C
ATOM 29 CG TRP A 2 12.069 8.524 -2.259 1.00 0.00 C
...
HETATM 626 HN2 NH2 A 39 3.093 9.404 -6.782 1.00 0.00 H
TER 627 NH2 A 39
ENDMDL
MODEL 7
HETATM 1 C1 SIN A 0 -16.074 -1.515 -4.262 1.00 0.00 C
HETATM 2 O1 SIN A 0 -16.968 -1.910 -4.992 1.00 0.00 O
...
ATOM 18 OD1 ASP A 1 -12.877 3.426 -8.525 1.00 0.00 O
ATOM 19 OD2 ASP A 1 -13.484 1.785 -9.782 1.00 0.00 O
TER 627 NH2 A 39
ENDMDL
My initial attempt was to represent each model as a list, that has a list of atom coordinates, and each 3D atom coordinate is represented by a list:
print(model_coord)
[
[[1.4579, 0.0, 0.0],... ,[-5.5, 21.5529, 23.7390]],
[[16.5450, 3.3699, 10.1888], ... ,[-0.0963, 24.510883331298828, 20.2952]],
[[17.6256, 2.5858, 12.4808],... ,[-11.6052, 13.1031, 23.8958]]
]
I then received the following error when creating kdtree object:
kdtree = scipy.spatial.KDTree(model_coord)
File "/Library/Python/2.7/site-packages/scipy/spatial/kdtree.py", line 235, in __init__
self.n, self.m = np.shape(self.data)
ValueError: too many values to unpack
However, converting model_coord into panada dataframes allowed me to obtain the n by m requirement to create the kdtree object, where each row represents a model and the column 3D atom coordinates:
model_df = pd.DataFrame(model_coord)
print(model_df.to_string())
0 1 2 ...
0 [1.45799, 0.0, 0.0] [3.9140, 2.8670, 0.4530] [7.590, 3.7990, 0.1850] ...
1 [16.5450, 3.3699, 10.1888] [15.9148, 1.9402, 13.6552] [14.4702, 2.6485, 17.0995] ...
2 [17.6256, 2.5858, 12.4808] [16.4266, 2.2781, 16.0749] [12.6480, 2.6846, 16.0066] …
Here is my attempt to query nearest neighbor of radius with a model, where epsilon is the radius:
kdtree = scipy.spatial.KDTree(model_df)
for index, model in model_df.iterrows():
model_nn_dist, model_nn_ids = kdtree.query(model,distance_upper_bound=epsilon)
Received the following error due to the coordinates being a list object:
model_nn_dist, model_nn_ids=kdtree.query(model,distance_upper_bound=epsilon)
File "/Library/Python/2.7/site-packages/scipy/spatial/kdtree.py", line 521, in query
hits = self.__query(x, k=k, eps=eps, p=p,distance_upper_bound=distance_upper_bound)
File "/Library/Python/2.7/site-packages/scipy/spatial/kdtree.py", line 320, in __query
side_distances = np.maximum(0,np.maximum(x-self.maxes,self.mins-x))
TypeError: unsupported operand type(s) for -: 'list' and ‘list'
Attempted to resolve this by converting the atom coordinates into numpy array, however, this is the error I receive:
model_nn_dist, model_nn_ids = kdtree.query(model,distance_upper_bound=epsilon)
File "/Library/Python/2.7/site-packages/scipy/spatial/kdtree.py", line 521, in query
hits = self.__query(x, k=k, eps=eps, p=p, distance_upper_bound=distance_upper_bound)
File "/Library/Python/2.7/site-packages/scipy/spatial/kdtree.py", line 320, in __query
side_distances = np.maximum(0,np.maximum(x-self.maxes,self.mins-x))
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I am wondering if there is a better approach or a more suitable data structure to query nearest neighbors of models or sets of coordinates, using kd-trees.

Pairwise Elements Using Python - Calculating Average of individual elements of array

So I have a query; I am accessing an API that gives the following response:
[["22014",201939,"0021401229","APR 15 2015",Team1 vs. Team2","W",
19,4,10,0.4,2,4,0.5,0,0,0,2,2,4,7,5,0,2,1,10,14,1],["22014",201939,"0021401","APR
13 2015",Team1 vs. Team3","W",
15,4,13,0.4,2,8,0.5,0,0,0,2,2,4,7,5,0,8,1,12,14,1],["22014",201939,"0021401192","APR
11 2015",Team1 vs. Team4","W",
22,5,10,0.4,2,6,0.5,0,0,0,2,2,4,7,5,0,2,1,8,14,1]]
I could just as easily have 16 different variables that I assign zero to, then print them out like the following example:
sum_pts = 0
for n in range(0,len(shot_data)): #range of games; these lengths vary per player
sum_pts= sum_pts+float(json.dumps(shots_array[n][24]))
print sum_pts/float(len(shots_array))
Output:
>>>
23.75
But I'd rather not create 16 different variables that calculate the average of the individual elements in this list. I'm looking for an easier way that I could get the average of Team1
I would like it the output to eventually be, so that I can apply this to infinite number of players or individual stats:
Team1 AVGPTS AVGAST AVGSTL AVGREB...
23.75 5.3 2.1 3.2
Or it could be:
Player1 AVGPTS AVGAST AVGSTL AVGREB ...
23.75 5.3 2.1 3.2 ...
To get the averages of the last 16 entries of each entry, you could use the following approach, this avoids the need to define multiple variables for each column:
data = [
["22014",201939,"0021401229","APR 15 2015", "Team1 vs. Team2","W", 19,4,10,0.4,2,4,0.5,0,0,0,2,2,4,7,5,0,2,1,10,14,1],
["22014",201939,"0021401","APR 13 2015","Team1 vs. Team3","W", 15,4,13,0.4,2,8,0.5,0,0,0,2,2,4,7,5,0,8,1,12,14,1],
["22014",201939,"0021401192","APR 11 2015","Team1 vs. Team4","W", 22,5,10,0.4,2,6,0.5,0,0,0,2,2,4,7,5,0,2,1,8,14,1]]
length = float(len(data))
values = []
for entry in data:
values.append(entry[6:])
values = zip(*values)
averages = [sum(v) / length for v in values]
for col in averages:
print "{:.2f} ".format(col),
This would display:
18.67 4.33 11.00 0.40 2.00 6.00 0.50 0.00 0.00 0.00 2.00 2.00 4.00 7.00 5.00 0.00 4.00 1.00 10.00 14.00 1.00
Note, your data is missing an opening quote before each Team1 vs Team2.

Categories