Optimization for loop in a huge list of tuples - python

I have a list of tuples called permuted_trucks with size 38320568 and each tuple inside the list has 7 values and i am trying to insert into another list the sum of values of all tuples.
In the code below, cargo_list is a list that contain the name of the cargos (np array size 7) and distances_df is a (44, 7) pd dataframe
Truck list is a np array of size 49 with the same values of the tuple.. The tuple represent a combination of all 49 trucks picking 7 different products
I am running this loop:
for i in range(0, len(permuted_trucks)):
total_distance = 0.0
for j in range(0, len(cargo_list)):
truck_index = np.where(truck_list == permuted_trucks[i][j])[0][0]
total_distance += distances_df.iloc[truck_index][j]
if (total_distance < 0 or total_distance < best_distance):
best_distance = total_distance
best_distance_index = i
all_distances_index.append(i)
all_distances.append(total_distance)
The problem is that it's totally slow.. and I am looking for a solution to optimize it.
Can someone help?
Example of the tuple:
[('Hartford',
'Bey',
'Empire',
'James',
'Ibrahim',
'John',
'Macomb'),
('Home',
'Bey',
'Empire',
'James',
'Ibrahim',
'John',
'Robert'),
('Horse',
'Bey',
'Empire',
'James',
'Ibrahim',
'John',
'Viking')]
This line below represent the sum of the distance contained in the dataframe distances_df
total_distance += distances_df.iloc[truck_index][j]
the output for all_distances would be an array of size 38320568 and the totaldistance as values like
[ 34125, 21252, 13232, 512313, ..... 31231]

One straightaway improvement in speed can be achieved by using numpy array instead of data frame. Data frames are slow.
Edit: Added code to show the differences of different executions.
Importing required modules
import time as time
import pandas as pd
import numpy as np
As we don't have complete information. This is what I am using to show the difference.
##All arrays
permuted_trucks = np.random.randint(7,size=(100000,7))
cargo_list = np.random.randint(7,size=(1,7))
truck_list = np.random.randint(7,size=(49,1))
##converting arrays to lists to show difference between list and arrays
permuted_trucks_list = permuted_trucks.tolist()
cargo_list_list = cargo_list.tolist()
truck_list_list = truck_list.tolist()
##array
distances_df_array= np.random.randint(7,size=(44,7))
##dataframe
distances_df = pd.DataFrame(distances_df_array)
Now lets see your original execution list and dataframes.
#Time taken for lists and dataframe
start_time = time.time()
all_distances = []
all_distances_index = []
best_distance = 10
for i in range(0, len(permuted_trucks_list)):
total_distance = 0.0
for j in range(0, len(cargo_list_list)):
truck_index = np.where(truck_list_list == permuted_trucks[i][j])[0][0]
total_distance += distances_df.iloc[truck_index][j] #using dataframe
if (total_distance < 0 or total_distance < best_distance):
best_distance = total_distance
best_distance_index = i
all_distances_index.append(i)
all_distances.append(total_distance)
end_time = time.time()
print("time taken for list and data frame : {}".format(end_time-start_time))
Output:time taken for list and data frame : 20.7517249584198
lets see what happens when we use list and array (avoid dataframe).
#Time taken for lists and array
start_time = time.time()
all_distances = []
all_distances_index = []
best_distance = 10
for i in range(0, len(permuted_trucks_list)):
total_distance = 0.0
for j in range(0, len(cargo_list_list)):
truck_index = np.where(truck_list_list == permuted_trucks[i][j])[0][0]
total_distance += distances_df_array[truck_index,j] #using array here
if (total_distance < 0 or total_distance < best_distance):
best_distance = total_distance
best_distance_index = i
all_distances_index.append(i)
all_distances.append(total_distance)
end_time = time.time()
print("time taken for list and array : {}".format(end_time-start_time))
Output: time taken for list and array : 3.075411319732666
You can see a drastic improvement in the execution time
Finally, also check on all arrays.
#Time taken for numpy array without vectorization
start_time = time.time()
all_distances_array = np.zeros((100000,1))
all_distances_index_array = np.zeros((100000,1))
best_distance = 10
for i in range(0, len(permuted_trucks)):
total_distance = 0.0
for j in range(0, len(cargo_list)):
truck_index = np.where(truck_list == permuted_trucks[i][j])[0][0]
total_distance += distances_df_array[truck_index,j] #using array here
if (total_distance < 0 or total_distance < best_distance):
best_distance = total_distance
best_distance_index = i
all_distances_index_array[i]
all_distances_array[i]
end_time = time.time()
print("time taken for numpy arrays : {}".format(end_time-start_time))
Output: time taken for numpy arrays : 1.1893165111541748
Now you see the difference, how slow data frames are. Numpys can be much faster than that if you can implement vectorization. But that can be checked only with the original data.

Related

Cannot combine lists from output

I have the following program, it seems that the amp and period at the end print out a list of list(see below). And I am unable to plot them (I want to plot period against amp)
I have tried methods in How to make a flat list out of list of lists? to combine the output of amp and period so that they are plot-table, but nothing worked.
import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import solve_ivp
def derivatives(t,y,q,F):
return [y[1], -np.sin(y[0])-q*y[1]+F*np.sin((2/3)*t)]
t = np.linspace(0.0, 100, 10000)
#initial conditions
theta0 = np.linspace(0.0,np.pi,100)
q = 0.0 #alpha / (mass*g), resistive term
F = 0.0 #G*np.sin(2*t/3)
for i in range (0,100):
sol = solve_ivp(derivatives, (0.0,100.0), (theta0[i], 0.0), method = 'RK45', t_eval = t,args = (q,F))
velocity = sol.y[1]
time = sol.t
zero_cross = 0
value = []
amp = []
period = []
for k in range (len(velocity)-1):
if (velocity[k+1]*velocity[k]) < 0:
zero_cross += 1
value.append(k)
else:
zero_cross += 0
zero_cross = zero_cross - zero_cross % 2 # makes the total number of zero-crossings even
if zero_cross != 0:
amp.append(theta0[i])
# period calculated using the time evolved between the first and last zero-crossing detected
period.append((2*(time[value[zero_cross - 1]] - time[value[0]])) / (zero_cross -1))
If I print out amp inside the loop, it displays as follows:
[0.03173325912716963]
[0.06346651825433926]
[0.0951997773815089]
[0.12693303650867852]
[0.15866629563584814]
[0.1903995547630178]
[0.2221328138901874]
[0.25386607301735703]
[0.28559933214452665]
[0.3173325912716963]
[0.3490658503988659]
[0.3807991095260356]
[0.4125323686532052]
[0.4442656277803748]
[0.47599888690754444]
[0.5077321460347141]
[0.5394654051618837]
[0.5711986642890533]
[0.6029319234162229]
[0.6346651825433925]
[0.6663984416705622]
[0.6981317007977318]
[0.7298649599249014]
[0.7615982190520711]
[0.7933314781792408]
[0.8250647373064104]
[0.85679799643358]
[0.8885312555607496]
[0.9202645146879193]
[0.9519977738150889]
[0.9837310329422585]
[1.0154642920694281]
[1.0471975511965979]
[1.0789308103237674]
[1.110664069450937]
[1.1423973285781066]
[1.1741305877052763]
[1.2058638468324459]
[1.2375971059596156]
[1.269330365086785]
[1.3010636242139548]
[1.3327968833411243]
[1.364530142468294]
[1.3962634015954636]
[1.4279966607226333]
[1.4597299198498028]
[1.4914631789769726]
[1.5231964381041423]
[1.5549296972313118]
[1.5866629563584815]
[1.618396215485651]
[1.6501294746128208]
[1.6818627337399903]
[1.71359599286716]
[1.7453292519943295]
[1.7770625111214993]
[1.8087957702486688]
[1.8405290293758385]
[1.872262288503008]
[1.9039955476301778]
[1.9357288067573473]
[1.967462065884517]
[1.9991953250116865]
[2.0309285841388562]
[2.0626618432660258]
[2.0943951023931957]
[2.126128361520365]
[2.1578616206475347]
[2.1895948797747042]
[2.221328138901874]
[2.2530613980290437]
[2.284794657156213]
[2.3165279162833827]
[2.3482611754105527]
[2.379994434537722]
[2.4117276936648917]
[2.443460952792061]
[2.475194211919231]
[2.5069274710464007]
[2.53866073017357]
[2.57039398930074]
[2.6021272484279097]
[2.633860507555079]
[2.6655937666822487]
[2.6973270258094186]
[2.729060284936588]
[2.7607935440637577]
[2.792526803190927]
[2.824260062318097]
[2.8559933214452666]
[2.887726580572436]
[2.9194598396996057]
[2.9511930988267756]
[2.982926357953945]
[3.0146596170811146]
[3.141592653589793]
[Finished in 3.822s]
I am not sure what type of output that is and how to handle, any help would be appreciated!
You are declaring the lists inside the loop, which means they will be reset to empty at every iteration. Consider declaring amp, period, and any array that should be set to empty only once (as initial state) before the loop, like so:
#initialize arrays, executes only once before the loop
amp = []
period = []
for i in range (0,100):
#your logic here, plus appending values to `amp` and `period`
#now `amp` and `period` should contain all desired values

Append a 2D array while looping through it

I want to store certain values in a 2D array. In the below code. I want sT to be total. When the inner loop runs the values to be stored in rows and then next column when the outer loop increment happens.
class pricing_lookback:
def __init__(self,spot,rate,sigma,time,sims,steps):
self.spot = spot
self.rate = rate
self.sigma = sigma
self.time = time
self.sims = sims
self.steps = steps
self.dt = self.time/self.steps
def call_floatingstrike(self):
simulationS = np.array([])
simulationSt = np.array([])
call2 = np.array([])
total = np.empty(shape=[self.steps, self.sims])
for j in range(self.sims):
sT = self.spot
pathwiseminS = np.array([])
for i in range(self.steps):
phi= np.random.normal()
sT *= np.exp((self.rate-0.5*self.sigma*self.sigma)*self.dt + self.sigma*phi*np.sqrt(self.dt))
pathwiseminS = np.append(pathwiseminS, sT)
np.append(total,[[j,sT]])###This should store values in rows of j column
#print (pathwiseminS)
#tst1 = np.append(tst1, pathwiseminS[1])
call2 = np.append(call2, max(pathwiseminS[self.steps-1]-self.spot,0))
#print (pathwiseminS[self.steps-1])
#print(call2)
simulationSt = np.append(simulationSt,pathwiseminS[self.steps-1])
simulationS = np.append(simulationS,min(pathwiseminS))
call = max(np.average(simulationSt) - np.average(simulationS),0)
return call, total#,call2,
Here is a simple example of what I think you are trying to do:
for i in range(5):
row = np.random.rand(5,)
if i == 0:
my_array = row
else:
my_array = np.vstack((my_array, row))
print(row)
However, this is not very efficient with memory, especially if you are dealing with large arrays, as this has to allocate new memory on every loop. It would be much better to preallocate an empty array and then populate it if possible.
To answer the question of how to append a column, it would be something like this:
import numpy as np
x = np.random.rand(5, 4)
column_to_append = np.random.rand(5,)
np.insert(x, x.shape[1], column_to_append, axis=1)
Again, this is not memory efficient and should be avoided whenever possible. Preallocation is much better.

Speeding a numpy correlation program using the fact that lists are sorted

I am currently using python and numpy for calculations of correlations between 2 lists: data_0 and data_1. Each list contains respecively sorted times t0 and t1.
I want to calculate all the events where 0 < t1 - t0 < t_max.
for time_0 in np.nditer(data_0):
delta_time = np.subtract(data_1, np.full(data_1.size, time_0))
delta_time = delta_time[delta_time >= 0]
delta_time = delta_time[delta_time < time_max]
Doing so, as the list are sorted, I am selecting a subarray of data_1 of the form data_1[index_min: index_max].
So I need in fact to find two indexes to get what I want.
And what's interesting is that when I go to the next time_0, as data_0 is also sorted, I just need to find the new index_min / index_max such as new_index_min >= index_min / new_index_max >= index_max.
Meaning that I don't need to scann again all the data_1.
(data list from scratch).
I have implemented such a solution not using the numpy methods (just with while loop) and it gives me the same results as before but not as fast than before (15 times longer!).
I think as normally it requires less calculation, there should be a way to make it faster using numpy methods but I don't know how to do it.
Does anyone have an idea?
I am not sure if I am super clear so if you have any questions, do not hestitate.
Thank you in advance,
Paul
Here is a vectorized approach using argsort. It uses a strategy similar to your avoid-full-scan idea:
import numpy as np
def find_gt(ref, data, incl=True):
out = np.empty(len(ref) + len(data) + 1, int)
total = (data, ref) if incl else (ref, data)
out[1:] = np.argsort(np.concatenate(total), kind='mergesort')
out[0] = -1
split = (out < len(data)) if incl else (out >= len(ref))
if incl:
out[~split] -= len(data)
split[0] = False
return np.maximum.accumulate(np.where(split, -1, out))[split] + 1
def find_intervals(ref, data, span, incl=(True, True)):
index_min = find_gt(ref, data, incl[0])
index_max = len(ref) - find_gt(-ref[::-1], -span-data[::-1], incl[1])[::-1]
return index_min, index_max
ref = np.sort(np.random.randint(0,20000,(10000,)))
data = np.sort(np.random.randint(0,20000,(10000,)))
span = 2
idmn, idmx = find_intervals(ref, data, span, (True, True))
print('checking')
for d,mn,mx in zip(data, idmn, idmx):
assert mn == len(ref) or ref[mn] >= d
assert mn == 0 or ref[mn-1] < d
assert mx == len(ref) or ref[mx] > d+span
assert mx == 0 or ref[mx-1] <= d+span
print('ok')
It works by
indirectly sorting both sets together
finding for each time in one set the preceding time in the other
this is done using maximum.reduce
the preceding steps are applied twice, the second time the times in
one set are shifted by span

optimizing indexing and retrieval of elements in numpy arrays in Python?

I'm trying to optimize the following code, potentially by rewriting it in Cython: it simply takes a low dimensional but relatively long numpy arrays, looks into of its columns for 0 values, and marks those as -1 in an array. The code is:
import numpy as np
def get_data():
data = np.array([[1,5,1]] * 5000 + [[1,0,5]] * 5000 + [[0,0,0]] * 5000)
return data
def get_cols(K):
cols = np.array([2] * K)
return cols
def test_nonzero(data):
K = len(data)
result = np.array([1] * K)
# Index into columns of data
cols = get_cols(K)
# Mark zero points with -1
idx = np.nonzero(data[np.arange(K), cols] == 0)[0]
result[idx] = -1
import time
t_start = time.time()
data = get_data()
for n in range(5000):
test_nonzero(data)
t_end = time.time()
print (t_end - t_start)
data is the data. cols is the array of columns of data to look for non-zero values (for simplicity, I made it all the same column). The goal is to compute a numpy array, result, which has a 1 value for each row where the column of interest is non-zero, and -1 for the rows where the corresponding columns of interest have a zero.
Running this function 5000 times on a not-so-large array of 15,000 rows by 3 columns takes about 20 seconds. Is there a way this can be sped up? It appears that most of the work goes into finding the nonzero elements and retrieving them with indices (the call to nonzero and subsequent use of its index.) Can this be optimized or is this the best that can be done?
How could a Cython implementation gain speed on this?
cols = np.array([2] * K)
That's going to be really slow. That's create a very large python list and then converts it into a numpy array. Instead, do something like:
cols = np.ones(K, int)*2
That'll be way faster
result = np.array([1] * K)
Here you should do:
result = np.ones(K, int)
That will produce the numpy array directly.
idx = np.nonzero(data[np.arange(K), cols] == 0)[0]
result[idx] = -1
The cols is an array, but you can just pass a 2. Furthermore, using nonzero adds an extra step.
idx = data[np.arange(K), 2] == 0
result[idx] = -1
Should have the same effect.

comparing large vectors in python

I have two large vectors (~133000 values) of different length. They are each sortet from small to large values. I want to find values that are similar within a given tolerance. This is my solution but it is very slow. Is there a way to speed this up?
import numpy as np
for lv in range(np.size(vector1)):
for lv_2 in range(np.size(vector2)):
if np.abs(vector1[lv_2]-vector2[lv])<.02:
print(vector1[lv_2],vector2[lv],lv,lv_2)
break
Your algorithm is far from optimal. You compare way too much values. Assume you are at a certain position in vector1 and the current value in vector2 is already more than 0.02 bigger. Why would you compare the rest of vector2?
Start with something like
pos1 = 0
pos2 = 0
Now compare the values at those postions in your vectors. If the difference is too big, move the position of the smaller one fowared and check again. Continue until you reach the end of one vector.
haven't tested it, but the following should work. The idea is to exploit the fact that the vectors are sorted
lv_1, lv_2 = 0,0
while lv_1 < len(vector1) and lv_2 < len(vector2):
if np.abs(vector1[lv_2]-vector2[lv_1])<.02:
print(vector1[lv_2],vector2[lv_1],lv_1,lv_2)
lv_1 += 1
lv_2 += 1
elif vector1[lv_1] < vector2[lv_2]: lv_1 += 1
else: lv_2 += 1
The following code gives a nice increase in performance that depends upon how dense the numbers are. Using a set of 1000 random numbers, sampled uniformly between 0 and 100, it runs about 30 times faster than your implementation.
pos_1_start = 0
for i in range(np.size(vector1)):
for j in range(pos1_start, np.size(vector2)):
if np.abs(vector1[i] - vector2[j]) < .02:
results1 += [(vector1[i], vector2[j], i, j)]
else:
if vector2[j] < vector1[i]:
pos1_start += 1
else:
break
The timing:
time new method: 0.112464904785
time old method: 3.59720897675
Which is produced by the following script:
import random
import numpy as np
import time
# initialize the vectors to be compared
vector1 = [random.uniform(0, 40) for i in range(1000)]
vector2 = [random.uniform(0, 40) for i in range(1000)]
vector1.sort()
vector2.sort()
# the arrays that will contain the results for the first method
results1 = []
# the arrays that will contain the results for the second method
results2 = []
pos1_start = 0
t_start = time.time()
for i in range(np.size(vector1)):
for j in range(pos1_start, np.size(vector2)):
if np.abs(vector1[i] - vector2[j]) < .02:
results1 += [(vector1[i], vector2[j], i, j)]
else:
if vector2[j] < vector1[i]:
pos1_start += 1
else:
break
t1 = time.time() - t_start
print "time new method:", t1
t = time.time()
for lv1 in range(np.size(vector1)):
for lv2 in range(np.size(vector2)):
if np.abs(vector1[lv1]-vector2[lv2])<.02:
results2 += [(vector1[lv1], vector2[lv2], lv1, lv2)]
t2 = time.time() - t_start
print "time old method:", t2
# sort the results
results1.sort()
results2.sort()
print np.allclose(results1, results2)

Categories