comparing large vectors in python

comparing large vectors in python - python

I have two large vectors (~133000 values) of different length. They are each sortet from small to large values. I want to find values that are similar within a given tolerance. This is my solution but it is very slow. Is there a way to speed this up?
import numpy as np
for lv in range(np.size(vector1)):
for lv_2 in range(np.size(vector2)):
if np.abs(vector1[lv_2]-vector2[lv])<.02:
print(vector1[lv_2],vector2[lv],lv,lv_2)
break

Your algorithm is far from optimal. You compare way too much values. Assume you are at a certain position in vector1 and the current value in vector2 is already more than 0.02 bigger. Why would you compare the rest of vector2?
Start with something like
pos1 = 0
pos2 = 0
Now compare the values at those postions in your vectors. If the difference is too big, move the position of the smaller one fowared and check again. Continue until you reach the end of one vector.

haven't tested it, but the following should work. The idea is to exploit the fact that the vectors are sorted
lv_1, lv_2 = 0,0
while lv_1 < len(vector1) and lv_2 < len(vector2):
if np.abs(vector1[lv_2]-vector2[lv_1])<.02:
print(vector1[lv_2],vector2[lv_1],lv_1,lv_2)
lv_1 += 1
lv_2 += 1
elif vector1[lv_1] < vector2[lv_2]: lv_1 += 1
else: lv_2 += 1

The following code gives a nice increase in performance that depends upon how dense the numbers are. Using a set of 1000 random numbers, sampled uniformly between 0 and 100, it runs about 30 times faster than your implementation.
pos_1_start = 0
for i in range(np.size(vector1)):
for j in range(pos1_start, np.size(vector2)):
if np.abs(vector1[i] - vector2[j]) < .02:
results1 += [(vector1[i], vector2[j], i, j)]
else:
if vector2[j] < vector1[i]:
pos1_start += 1
else:
break
The timing:
time new method: 0.112464904785
time old method: 3.59720897675
Which is produced by the following script:
import random
import numpy as np
import time
# initialize the vectors to be compared
vector1 = [random.uniform(0, 40) for i in range(1000)]
vector2 = [random.uniform(0, 40) for i in range(1000)]
vector1.sort()
vector2.sort()
# the arrays that will contain the results for the first method
results1 = []
# the arrays that will contain the results for the second method
results2 = []
pos1_start = 0
t_start = time.time()
for i in range(np.size(vector1)):
for j in range(pos1_start, np.size(vector2)):
if np.abs(vector1[i] - vector2[j]) < .02:
results1 += [(vector1[i], vector2[j], i, j)]
else:
if vector2[j] < vector1[i]:
pos1_start += 1
else:
break
t1 = time.time() - t_start
print "time new method:", t1
t = time.time()
for lv1 in range(np.size(vector1)):
for lv2 in range(np.size(vector2)):
if np.abs(vector1[lv1]-vector2[lv2])<.02:
results2 += [(vector1[lv1], vector2[lv2], lv1, lv2)]
t2 = time.time() - t_start
print "time old method:", t2
# sort the results
results1.sort()
results2.sort()
print np.allclose(results1, results2)

Related

How to ensure minimum euclidean distance in a list of tuples

I have an extremely large list of coordinates in the form of a list of tuples.
data = [(1,1),(1,11),(1,21),(11,1),(21,1),(11,11),(11,21),(21,11),(21,21),(1,2),(2,1)]
The list of tuple is actually being formed by a for loop with an append command like so:
data = []
for i in source: # where i a tuple of form (x,y)
data.append(i)
Is there an approach to ensure euclidean distance between all tuples is above a certain threshold? In this example there is a very small distance between (1,1),(1,2),(2,1). In this scenario I would like to keep only one of the 3 tuples. Resulting in either one of these new list of tuples:
data = [(1,1),(1,11),(1,21),(11,1),(21,1),(11,11),(11,21),(21,11),(21,21)]
data = [(2,1),(1,11),(1,21),(11,1),(21,1),(11,11),(11,21),(21,11),(21,21)]
data = [(1,2),(1,11),(1,21),(11,1),(21,1),(11,11),(11,21),(21,11),(21,21)]
I have a brute force algorithm that iterates through the list but there should be a more elegant way or quicker way to do this? Or is there any other methods to speed up this operation? I am expecting lists of ~70k up to 500k tuples.
My method:
from scipy.spatial.distance import euclidean
data = [(1,1),(1,11),(1,21),(11,1),(21,1),(11,11),(11,21),(21,11),(21,21),(1,2),(2,1)]
new_data = []
while len(data) >0:
check = data.pop()
flag = True
for i in data:
if euclidean(check,i) < 5:
flag = False
break
else:
pass
if flag == True:
new_data.append(check)
else:
flag = True
Additional points:
Although the list of tuples is coming from some iterative function, the order of tuples is uncertain.
Actual number of tuples is unknown until end of for loop.
I would rather avoid multiprocessing/multithreading for speed up in this scenario.
If necessary I can put up some timings but I dont think its necessary.
The solution I have right now is time O(n(n-1)/2) and space complexity of O(n) I think so any improvement would be better.

You can organize your 2D data/tuples using a Quadtree.
Quadtrees are the two-dimensional analog of octrees and are most often used to partition a two-dimensional space by recursively subdividing it into four quadrants or regions.

you can use numpy try this :
import numpy as np
data = [(1,1),(1,11),(1,21),(11,1),(21,1),(11,11),(11,21),(21,11),(21,21),(1,2),(2,1)]
start_time = time.time()
#transform to numpy array
a = np.array(data)
subs = a[:,None] - a
#calculate ecludien distance between all element
dist=np.sqrt(np.einsum('ijk,ijk->ij',subs,subs))
#replace 0 to 5 because distance distance between identic element will be 0
dist=np.where(dist == 0, 5, dist)
#select element where distance sup to 5
dist_bool=[dist[:,0] < 5]
#select element where distance sup to 5 are false
a=a[dist_bool[0] == False]
print("--- %s seconds ---" % (time.time() - start_time))#got --- 0.00020575523376464844 seconds ---
when we compare to your soltion :
start_time = time.time()
new_data = []
while len(data) >0:
check = data.pop()
flag = True
for i in data:
if euclidean(check,i) < 5:
flag = False
break
else:
pass
if flag == True:
new_data.append(check)
else:
flag = True
print("--- %s seconds ---" % (time.time() - start_time))# got ---0.001013040542602539 seconds ---

Cannot combine lists from output

I have the following program, it seems that the amp and period at the end print out a list of list(see below). And I am unable to plot them (I want to plot period against amp)
I have tried methods in How to make a flat list out of list of lists? to combine the output of amp and period so that they are plot-table, but nothing worked.
import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import solve_ivp
def derivatives(t,y,q,F):
return [y[1], -np.sin(y[0])-q*y[1]+F*np.sin((2/3)*t)]
t = np.linspace(0.0, 100, 10000)
#initial conditions
theta0 = np.linspace(0.0,np.pi,100)
q = 0.0 #alpha / (mass*g), resistive term
F = 0.0 #G*np.sin(2*t/3)
for i in range (0,100):
sol = solve_ivp(derivatives, (0.0,100.0), (theta0[i], 0.0), method = 'RK45', t_eval = t,args = (q,F))
velocity = sol.y[1]
time = sol.t
zero_cross = 0
value = []
amp = []
period = []
for k in range (len(velocity)-1):
if (velocity[k+1]*velocity[k]) < 0:
zero_cross += 1
value.append(k)
else:
zero_cross += 0
zero_cross = zero_cross - zero_cross % 2 # makes the total number of zero-crossings even
if zero_cross != 0:
amp.append(theta0[i])
# period calculated using the time evolved between the first and last zero-crossing detected
period.append((2*(time[value[zero_cross - 1]] - time[value[0]])) / (zero_cross -1))
If I print out amp inside the loop, it displays as follows:
[0.03173325912716963]
[0.06346651825433926]
[0.0951997773815089]
[0.12693303650867852]
[0.15866629563584814]
[0.1903995547630178]
[0.2221328138901874]
[0.25386607301735703]
[0.28559933214452665]
[0.3173325912716963]
[0.3490658503988659]
[0.3807991095260356]
[0.4125323686532052]
[0.4442656277803748]
[0.47599888690754444]
[0.5077321460347141]
[0.5394654051618837]
[0.5711986642890533]
[0.6029319234162229]
[0.6346651825433925]
[0.6663984416705622]
[0.6981317007977318]
[0.7298649599249014]
[0.7615982190520711]
[0.7933314781792408]
[0.8250647373064104]
[0.85679799643358]
[0.8885312555607496]
[0.9202645146879193]
[0.9519977738150889]
[0.9837310329422585]
[1.0154642920694281]
[1.0471975511965979]
[1.0789308103237674]
[1.110664069450937]
[1.1423973285781066]
[1.1741305877052763]
[1.2058638468324459]
[1.2375971059596156]
[1.269330365086785]
[1.3010636242139548]
[1.3327968833411243]
[1.364530142468294]
[1.3962634015954636]
[1.4279966607226333]
[1.4597299198498028]
[1.4914631789769726]
[1.5231964381041423]
[1.5549296972313118]
[1.5866629563584815]
[1.618396215485651]
[1.6501294746128208]
[1.6818627337399903]
[1.71359599286716]
[1.7453292519943295]
[1.7770625111214993]
[1.8087957702486688]
[1.8405290293758385]
[1.872262288503008]
[1.9039955476301778]
[1.9357288067573473]
[1.967462065884517]
[1.9991953250116865]
[2.0309285841388562]
[2.0626618432660258]
[2.0943951023931957]
[2.126128361520365]
[2.1578616206475347]
[2.1895948797747042]
[2.221328138901874]
[2.2530613980290437]
[2.284794657156213]
[2.3165279162833827]
[2.3482611754105527]
[2.379994434537722]
[2.4117276936648917]
[2.443460952792061]
[2.475194211919231]
[2.5069274710464007]
[2.53866073017357]
[2.57039398930074]
[2.6021272484279097]
[2.633860507555079]
[2.6655937666822487]
[2.6973270258094186]
[2.729060284936588]
[2.7607935440637577]
[2.792526803190927]
[2.824260062318097]
[2.8559933214452666]
[2.887726580572436]
[2.9194598396996057]
[2.9511930988267756]
[2.982926357953945]
[3.0146596170811146]
[3.141592653589793]
[Finished in 3.822s]
I am not sure what type of output that is and how to handle, any help would be appreciated!

You are declaring the lists inside the loop, which means they will be reset to empty at every iteration. Consider declaring amp, period, and any array that should be set to empty only once (as initial state) before the loop, like so:
#initialize arrays, executes only once before the loop
amp = []
period = []
for i in range (0,100):
#your logic here, plus appending values to `amp` and `period`
#now `amp` and `period` should contain all desired values

Optimization for loop in a huge list of tuples

I have a list of tuples called permuted_trucks with size 38320568 and each tuple inside the list has 7 values and i am trying to insert into another list the sum of values of all tuples.
In the code below, cargo_list is a list that contain the name of the cargos (np array size 7) and distances_df is a (44, 7) pd dataframe
Truck list is a np array of size 49 with the same values of the tuple.. The tuple represent a combination of all 49 trucks picking 7 different products
I am running this loop:
for i in range(0, len(permuted_trucks)):
total_distance = 0.0
for j in range(0, len(cargo_list)):
truck_index = np.where(truck_list == permuted_trucks[i][j])[0][0]
total_distance += distances_df.iloc[truck_index][j]
if (total_distance < 0 or total_distance < best_distance):
best_distance = total_distance
best_distance_index = i
all_distances_index.append(i)
all_distances.append(total_distance)
The problem is that it's totally slow.. and I am looking for a solution to optimize it.
Can someone help?
Example of the tuple:
[('Hartford',
'Bey',
'Empire',
'James',
'Ibrahim',
'John',
'Macomb'),
('Home',
'Bey',
'Empire',
'James',
'Ibrahim',
'John',
'Robert'),
('Horse',
'Bey',
'Empire',
'James',
'Ibrahim',
'John',
'Viking')]
This line below represent the sum of the distance contained in the dataframe distances_df
total_distance += distances_df.iloc[truck_index][j]
the output for all_distances would be an array of size 38320568 and the totaldistance as values like
[ 34125, 21252, 13232, 512313, ..... 31231]

One straightaway improvement in speed can be achieved by using numpy array instead of data frame. Data frames are slow.
Edit: Added code to show the differences of different executions.
Importing required modules
import time as time
import pandas as pd
import numpy as np
As we don't have complete information. This is what I am using to show the difference.
##All arrays
permuted_trucks = np.random.randint(7,size=(100000,7))
cargo_list = np.random.randint(7,size=(1,7))
truck_list = np.random.randint(7,size=(49,1))
##converting arrays to lists to show difference between list and arrays
permuted_trucks_list = permuted_trucks.tolist()
cargo_list_list = cargo_list.tolist()
truck_list_list = truck_list.tolist()
##array
distances_df_array= np.random.randint(7,size=(44,7))
##dataframe
distances_df = pd.DataFrame(distances_df_array)
Now lets see your original execution list and dataframes.
#Time taken for lists and dataframe
start_time = time.time()
all_distances = []
all_distances_index = []
best_distance = 10
for i in range(0, len(permuted_trucks_list)):
total_distance = 0.0
for j in range(0, len(cargo_list_list)):
truck_index = np.where(truck_list_list == permuted_trucks[i][j])[0][0]
total_distance += distances_df.iloc[truck_index][j] #using dataframe
if (total_distance < 0 or total_distance < best_distance):
best_distance = total_distance
best_distance_index = i
all_distances_index.append(i)
all_distances.append(total_distance)
end_time = time.time()
print("time taken for list and data frame : {}".format(end_time-start_time))
Output:time taken for list and data frame : 20.7517249584198
lets see what happens when we use list and array (avoid dataframe).
#Time taken for lists and array
start_time = time.time()
all_distances = []
all_distances_index = []
best_distance = 10
for i in range(0, len(permuted_trucks_list)):
total_distance = 0.0
for j in range(0, len(cargo_list_list)):
truck_index = np.where(truck_list_list == permuted_trucks[i][j])[0][0]
total_distance += distances_df_array[truck_index,j] #using array here
if (total_distance < 0 or total_distance < best_distance):
best_distance = total_distance
best_distance_index = i
all_distances_index.append(i)
all_distances.append(total_distance)
end_time = time.time()
print("time taken for list and array : {}".format(end_time-start_time))
Output: time taken for list and array : 3.075411319732666
You can see a drastic improvement in the execution time
Finally, also check on all arrays.
#Time taken for numpy array without vectorization
start_time = time.time()
all_distances_array = np.zeros((100000,1))
all_distances_index_array = np.zeros((100000,1))
best_distance = 10
for i in range(0, len(permuted_trucks)):
total_distance = 0.0
for j in range(0, len(cargo_list)):
truck_index = np.where(truck_list == permuted_trucks[i][j])[0][0]
total_distance += distances_df_array[truck_index,j] #using array here
if (total_distance < 0 or total_distance < best_distance):
best_distance = total_distance
best_distance_index = i
all_distances_index_array[i]
all_distances_array[i]
end_time = time.time()
print("time taken for numpy arrays : {}".format(end_time-start_time))
Output: time taken for numpy arrays : 1.1893165111541748
Now you see the difference, how slow data frames are. Numpys can be much faster than that if you can implement vectorization. But that can be checked only with the original data.

Multiprocessing with python application to shave running time that is currently 36 hours

I am currently working on a data mining project that is creating a similarity matrix that is 18000x18000
Here are the two methods which build the matrix
def CreateSimilarityMatrix(dbSubsetData, distancePairsList):
global matrix
matrix = [ [0.0 for y in range(dbSubsetData.shape[0])] for x in range(dbSubsetData.shape[0])]
for i in range(len(dbSubsetData)): #record1
SimilarityArray = []
start = time.time()
for j in range(i+1, len(dbSubsetData)): #record2
Similarity = GetDistanceBetweenTwoRecords(dbSubsetData, i, j, distancePairsList)
#The similarities are all very small numbers which might be why the preference value needs to be so precise.
#Let's multiply the value by a scalar 10 to give the values more range.
matrix[i][j] = Similarity * 10.0
matrix[j][i] = Similarity * 10.0
end = time.time()
return matrix
def GetDistanceBetweenTwoRecords(dbSubsetData, i, j, distancePairsList):
Record1 = dbSubsetData.iloc[i]
Record2 = dbSubsetData.iloc[j]
columns = dbSubsetData.columns
distancer = 0.0
distancec = 0.0
for i in range(len(Record1)):
columnName = columns[i]
Record1Value = Record1[i]
Record2Value = Record2[i]
if(Record1Value != Record2Value):
ob = distancePairsList[distancePairsDict[columnName]-1]
if(ob.attributeType == "String"):
strValue = Record1Value+":"+Record2Value
strValue2 = Record2Value+":"+Record1Value
if strValue in ob.distancePairs:
val = ((ob.distancePairs[strValue])**2)
val = val * -1
distancec = distancec + val
elif strValue2 in ob.distancePairs:
val = ((ob.distancePairs[strValue2])**2)
val = val * -1
distancec = distancec + val
elif(ob.attributeType == "Number"):
val = ((Record1Value - Record2Value)*ob.getSignificance())**2
val = val * -1
distancer = distancer + val
distance = distancer + distancec
return distance
Each iteration is looping 18000x19 times (18000 for each row and 19 times for each attribute). The total number of iterations is (18000x18000x19)/2 since it is symmetric and therefore I only have to do one half of the matrix. This will take around 36 hours to complete, which is a timeframe I obviously want to shave down.
I figured Multiprocessing is the trick. Since each row is independently generating numbers and fitting them to the matrix, I could run multiprocess with CreateSimilarityMatrix. So I created this in the function which will create my processes
matrix = [ [0.0 for y in range(SubsetDBNormalizedAttributes.shape[0])] for x in range(SubsetDBNormalizedAttributes.shape[0])]
if __name__ == '__main__':
procs = []
for i in range(4):
proc = Process(target=CreateSimilarityMatrix, args=(SubsetDBNormalizedAttributes, distancePairsList, i, 4))
procs.append(proc)
proc.start()
proc.join()
CreateSimilarityMatrix is now changed to
def CreateSimilarityMatrix(dbSubsetData, distancePairsList, counter=0, iteration=1):
global Matrix
for i in range(counter, len(dbSubsetData), iteration): #record1
SimilarityArray = []
start = time.time()
for j in range(i+1, len(dbSubsetData)): #record2
Similarity = GetDistanceBetweenTwoRecords(dbSubsetData, i, j, distancePairsList)
#print("Similarity Between Records",i,":",j," is ", Similarity)
#The similarities are all very small numbers which might be why the preference value needs to be so precise.
#Let's multiply the value by a scalar 10 to give the values more range.
Matrix[i][j] = Similarity * 10.0
Matrix[j][i] = Similarity * 10.0
end = time.time()
print("Iteration",i,"took",end-start,"(s)")
Currently this goes s-l-o-w. It's really slow. It takes minutes to start one process, then it takes minutes to start the next one. I thought these were supposed to run concurrently? Is my application of the process incorrect?

If you are using CPython, there is something called the global interpreter lock (GIL) that makes it difficult to actually multithread while making things faster, and can instead slow it down substantially.
If you are dealing with matrices, use numpy, which is definitely a lot faster than regular Python.

Speeding a numpy correlation program using the fact that lists are sorted

I am currently using python and numpy for calculations of correlations between 2 lists: data_0 and data_1. Each list contains respecively sorted times t0 and t1.
I want to calculate all the events where 0 < t1 - t0 < t_max.
for time_0 in np.nditer(data_0):
delta_time = np.subtract(data_1, np.full(data_1.size, time_0))
delta_time = delta_time[delta_time >= 0]
delta_time = delta_time[delta_time < time_max]
Doing so, as the list are sorted, I am selecting a subarray of data_1 of the form data_1[index_min: index_max].
So I need in fact to find two indexes to get what I want.
And what's interesting is that when I go to the next time_0, as data_0 is also sorted, I just need to find the new index_min / index_max such as new_index_min >= index_min / new_index_max >= index_max.
Meaning that I don't need to scann again all the data_1.
(data list from scratch).
I have implemented such a solution not using the numpy methods (just with while loop) and it gives me the same results as before but not as fast than before (15 times longer!).
I think as normally it requires less calculation, there should be a way to make it faster using numpy methods but I don't know how to do it.
Does anyone have an idea?
I am not sure if I am super clear so if you have any questions, do not hestitate.
Thank you in advance,
Paul

Here is a vectorized approach using argsort. It uses a strategy similar to your avoid-full-scan idea:
import numpy as np
def find_gt(ref, data, incl=True):
out = np.empty(len(ref) + len(data) + 1, int)
total = (data, ref) if incl else (ref, data)
out[1:] = np.argsort(np.concatenate(total), kind='mergesort')
out[0] = -1
split = (out < len(data)) if incl else (out >= len(ref))
if incl:
out[~split] -= len(data)
split[0] = False
return np.maximum.accumulate(np.where(split, -1, out))[split] + 1
def find_intervals(ref, data, span, incl=(True, True)):
index_min = find_gt(ref, data, incl[0])
index_max = len(ref) - find_gt(-ref[::-1], -span-data[::-1], incl[1])[::-1]
return index_min, index_max
ref = np.sort(np.random.randint(0,20000,(10000,)))
data = np.sort(np.random.randint(0,20000,(10000,)))
span = 2
idmn, idmx = find_intervals(ref, data, span, (True, True))
print('checking')
for d,mn,mx in zip(data, idmn, idmx):
assert mn == len(ref) or ref[mn] >= d
assert mn == 0 or ref[mn-1] < d
assert mx == len(ref) or ref[mx] > d+span
assert mx == 0 or ref[mx-1] <= d+span
print('ok')
It works by
indirectly sorting both sets together
finding for each time in one set the preceding time in the other
this is done using maximum.reduce
the preceding steps are applied twice, the second time the times in
one set are shifted by span

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

comparing large vectors in python - python

Related

How to ensure minimum euclidean distance in a list of tuples

Cannot combine lists from output

Optimization for loop in a huge list of tuples

Multiprocessing with python application to shave running time that is currently 36 hours

Speeding a numpy correlation program using the fact that lists are sorted

Categories

Resources