Related
I have a very large ndarray A, and a sorted list of points k (a small list, about 30 points).
For every element of A, I want to determine the closest element in the list of points k, together with the index. So something like:
>>> A = np.asarray([3, 4, 5, 6])
>>> k = np.asarray([4.1, 3])
>>> values, indices
[3, 4.1, 4.1, 4.1], [1, 0, 0, 0]
Now, the problem is that A is very very large. So I can't do something inefficient like adding one dimension to A, take the abs difference to k, and then take the minimum of each column.
For now I have been using np.searchsorted, as shown in the second answer here: Find nearest value in numpy array but even this is too slow. This is the code I used (modified to work with multiple values):
def find_nearest(A,k):
indicesClosest = np.searchsorted(k, A)
flagToReduce = indicesClosest==k.shape[0]
modifiedIndicesToAvoidOutOfBoundsException = indicesClosest.copy()
modifiedIndicesToAvoidOutOfBoundsException[flagToReduce] -= 1
flagToReduce = np.logical_or(flagToReduce,
np.abs(A-k[indicesClosest-1]) <
np.abs(A - k[modifiedIndicesToAvoidOutOfBoundsException]))
flagToReduce = np.logical_and(indicesClosest > 0, flagToReduce)
indicesClosest[flagToReduce] -= 1
valuesClosest = k[indicesClosest]
return valuesClosest, indicesClosest
I then thought of using scipy.spatial.KDTree:
>>> d = scipy.spatial.KDTree(k)
>>> d.query(A)
This turns out to be much slower than the searchsorted solution.
On the other hand, the array A is always the same, only k changes. So it would be beneficial to use some auxiliary structure (like a "inverse KDTree") on A, and then query the results on the small array k.
Is there something like that?
Edit
At the moment I am using a variant of np.searchsorted that requires the array A to be sorted. We can do this in advance as a pre-processing step, but we still have to restore the original order after computing the indices. This variant is about twice as fast as the one above.
A = np.random.random(3000000)
k = np.random.random(30)
indices_sort = np.argsort(A)
sortedA = A[indices_sort]
inv_indices_sort = np.argsort(indices_sort)
k.sort()
def find_nearest(sortedA, k):
midpoints = k[:-1] + np.diff(k)/2
idx_aux = np.searchsorted(sortedA, midpoints)
idx = []
count = 0
final_indices = np.zeros(sortedA.shape, dtype=int)
old_obj = None
for obj in idx_aux:
if obj != old_obj:
idx.append((obj, count))
old_obj = obj
count += 1
old_idx = 0
for idx_A, idx_k in idx:
final_indices[old_idx:idx_A] = idx_k
old_idx = idx_A
final_indices[old_idx:] = len(k)-1
indicesClosest = final_indices[inv_indices_sort] #<- this takes 90% of the time
return k[indicesClosest], indicesClosest
The line that takes so much time is the line that brings the indices back to their original order.
Update:
The builtin function numpy.digitize can actually do exactly what you need. Only a small trick is required: digitize assigns values to bins. We can convert k to bins by sorting the array and setting the bin borders exactly in the middle between adjacent elements.
import numpy as np
A = np.asarray([3, 4, 5, 6])
k = np.asarray([4.1, 3, 1]) # added another value to show that sorting/binning works
ki = np.argsort(k)
ks = k[ki]
i = np.digitize(A, (ks[:-1] + ks[1:]) / 2)
indices = ki[i]
values = ks[i]
print(values, indices)
# [ 3. 4.1 4.1 4.1] [1 0 0 0]
Old answer:
I would take a brute-force approach to perform one vectorized pass over A for each element in k and update those locations where the current element improves the approximation.
import numpy as np
A = np.asarray([3, 4, 5, 6])
k = np.asarray([4.1, 3])
err = np.zeros_like(A) + np.inf # keep track of error over passes
values = np.empty_like(A, dtype=k.dtype)
indices = np.empty_like(A, dtype=int)
for i, v in enumerate(k):
d = np.abs(A - v)
mask = d < err # only update where v is closer to A
values[mask] = v
indices[mask] = i
err[mask] = d[mask]
print(values, indices)
# [ 3. 4.1 4.1 4.1] [1 0 0 0]
This approach requires three temporary variables of same size as A, so it will fail if not enough memory is available.
So, after some work and an idea from the scipy mailing list, I think that in my case (with a constant A and slowly varying k), the best way to do this is to use the following implementation.
class SearchSorted:
def __init__(self, tensor, use_k_optimization=True):
'''
use_k_optimization requires storing 4x the size of the tensor.
If use_k_optimization is True, the class will assume that successive calls will be made with similar k.
When this happens, we can cut the running time significantly by storing additional variables. If it won't be
called with successive k, set the flag to False, as otherwise would just consume more memory for no
good reason
'''
self.indices_sort = np.argsort(tensor)
self.sorted_tensor = tensor[self.indices_sort]
self.inv_indices_sort = np.argsort(self.indices_sort)
self.use_k_optimization = use_k_optimization
self.previous_indices_results = None
self.prev_idx_A_k_pair = None
def query(self, k):
midpoints = k[:-1] + np.diff(k) / 2
idx_count = np.searchsorted(self.sorted_tensor, midpoints)
idx_A_k_pair = []
count = 0
old_obj = 0
for obj in idx_count:
if obj != old_obj:
idx_A_k_pair.append((obj, count))
old_obj = obj
count += 1
if not self.use_k_optimization or self.previous_indices_results is None:
#creates the index matrix in the sorted case
final_indices = self._create_indices_matrix(idx_A_k_pair, self.sorted_tensor.shape, len(k))
#and now unsort it to match the original tensor position
indicesClosest = final_indices[self.inv_indices_sort]
if self.use_k_optimization:
self.prev_idx_A_k_pair = idx_A_k_pair
self.previous_indices_results = indicesClosest
return indicesClosest
old_indices_unsorted = self._create_indices_matrix(self.prev_idx_A_k_pair, self.sorted_tensor.shape, len(k))
new_indices_unsorted = self._create_indices_matrix(idx_A_k_pair, self.sorted_tensor.shape, len(k))
mask = new_indices_unsorted != old_indices_unsorted
self.prev_idx_A_k_pair = idx_A_k_pair
self.previous_indices_results[self.indices_sort[mask]] = new_indices_unsorted[mask]
indicesClosest = self.previous_indices_results
return indicesClosest
#staticmethod
def _create_indices_matrix(idx_A_k_pair, matrix_shape, len_quant_points):
old_idx = 0
final_indices = np.zeros(matrix_shape, dtype=int)
for idx_A, idx_k in idx_A_k_pair:
final_indices[old_idx:idx_A] = idx_k
old_idx = idx_A
final_indices[old_idx:] = len_quant_points - 1
return final_indices
The idea is to sort the array A beforehand, then use searchsorted of A on the midpoints of k. This gives the same information as before, in that it tells us exactly which points of A are closer to which points of k. The method _create_indices_matrix will create the full indices array from these informations, and then we will unsort it to recover the original order of A. To take advantage of slowly varying k, we save the last indices and we determine which indices we have to change; we then change only those. For slowly varying k, this produces superior performance (at a quite bigger memory cost, however).
For random matrix A of 5 million elements and k of about 30 elements, and repeating the experiments 60 times, we get
Function search_sorted1; 15.72285795211792s
Function search_sorted2; 13.030786037445068s
Function query; 2.3306031227111816s <- the one with use_k_optimization = True
Function query; 4.81286096572876s <- with use_k_optimization = False
scipy.spatial.KDTree.query is too slow, and I don't time it (above 1 minute, though). This is the code used to do the timing; contains also the implementation of search_sorted1 and 2.
import numpy as np
import scipy
import scipy.spatial
import time
A = np.random.rand(10000*500) #5 million elements
k = np.random.rand(32)
k.sort()
#first attempt, detailed in the answer, too
def search_sorted1(A, k):
indicesClosest = np.searchsorted(k, A)
flagToReduce = indicesClosest == k.shape[0]
modifiedIndicesToAvoidOutOfBoundsException = indicesClosest.copy()
modifiedIndicesToAvoidOutOfBoundsException[flagToReduce] -= 1
flagToReduce = np.logical_or(flagToReduce,
np.abs(A-k[indicesClosest-1]) <
np.abs(A - k[modifiedIndicesToAvoidOutOfBoundsException]))
flagToReduce = np.logical_and(indicesClosest > 0, flagToReduce)
indicesClosest[flagToReduce] -= 1
return indicesClosest
#taken from #Divakar answer linked in the comments under the question
def search_sorted2(A, k):
indicesClosest = np.searchsorted(k, A, side="left").clip(max=k.size - 1)
mask = (indicesClosest > 0) & \
((indicesClosest == len(k)) | (np.fabs(A - k[indicesClosest - 1]) < np.fabs(A - k[indicesClosest])))
indicesClosest = indicesClosest - mask
return indicesClosest
def kdquery1(A, k):
d = scipy.spatial.cKDTree(k, compact_nodes=False, balanced_tree=False)
_, indices = d.query(A)
return indices
#After an indea on scipy mailing list
class SearchSorted:
def __init__(self, tensor, use_k_optimization=True):
'''
Using this requires storing 4x the size of the tensor.
If use_k_optimization is True, the class will assume that successive calls will be made with similar k.
When this happens, we can cut the running time significantly by storing additional variables. If it won't be
called with successive k, set the flag to False, as otherwise would just consume more memory for no
good reason
'''
self.indices_sort = np.argsort(tensor)
self.sorted_tensor = tensor[self.indices_sort]
self.inv_indices_sort = np.argsort(self.indices_sort)
self.use_k_optimization = use_k_optimization
self.previous_indices_results = None
self.prev_idx_A_k_pair = None
def query(self, k):
midpoints = k[:-1] + np.diff(k) / 2
idx_count = np.searchsorted(self.sorted_tensor, midpoints)
idx_A_k_pair = []
count = 0
old_obj = 0
for obj in idx_count:
if obj != old_obj:
idx_A_k_pair.append((obj, count))
old_obj = obj
count += 1
if not self.use_k_optimization or self.previous_indices_results is None:
#creates the index matrix in the sorted case
final_indices = self._create_indices_matrix(idx_A_k_pair, self.sorted_tensor.shape, len(k))
#and now unsort it to match the original tensor position
indicesClosest = final_indices[self.inv_indices_sort]
if self.use_k_optimization:
self.prev_idx_A_k_pair = idx_A_k_pair
self.previous_indices_results = indicesClosest
return indicesClosest
old_indices_unsorted = self._create_indices_matrix(self.prev_idx_A_k_pair, self.sorted_tensor.shape, len(k))
new_indices_unsorted = self._create_indices_matrix(idx_A_k_pair, self.sorted_tensor.shape, len(k))
mask = new_indices_unsorted != old_indices_unsorted
self.prev_idx_A_k_pair = idx_A_k_pair
self.previous_indices_results[self.indices_sort[mask]] = new_indices_unsorted[mask]
indicesClosest = self.previous_indices_results
return indicesClosest
#staticmethod
def _create_indices_matrix(idx_A_k_pair, matrix_shape, len_quant_points):
old_idx = 0
final_indices = np.zeros(matrix_shape, dtype=int)
for idx_A, idx_k in idx_A_k_pair:
final_indices[old_idx:idx_A] = idx_k
old_idx = idx_A
final_indices[old_idx:] = len_quant_points - 1
return final_indices
mySearchSorted = SearchSorted(A, use_k_optimization=True)
mySearchSorted2 = SearchSorted(A, use_k_optimization=False)
allFunctions = [search_sorted1, search_sorted2,
mySearchSorted.query,
mySearchSorted2.query]
print(np.array_equal(mySearchSorted.query(k), kdquery1(A, k)[1]))
print(np.array_equal(mySearchSorted.query(k), search_sorted2(A, k)[1]))
print(np.array_equal(mySearchSorted2.query(k), search_sorted2(A, k)[1]))
if __name__== '__main__':
num_to_average = 3
for func in allFunctions:
if func.__name__ == 'search_sorted3':
indices_sort = np.argsort(A)
sA = A[indices_sort].copy()
inv_indices_sort = np.argsort(indices_sort)
else:
sA = A.copy()
if func.__name__ != 'query':
func_to_use = lambda x: func(sA, x)
else:
func_to_use = func
k_to_use = k
start_time = time.time()
for idx_average in range(num_to_average):
for idx_repeat in range(10):
k_to_use += (2*np.random.rand(*k.shape)-1)/100 #uniform between (-1/100, 1/100)
k_to_use.sort()
indices = func_to_use(k_to_use)
if func.__name__ == 'search_sorted3':
indices = indices[inv_indices_sort]
val = k[indices]
end_time = time.time()
total_time = end_time-start_time
print('Function {}; {}s'.format(func.__name__, total_time))
I'm sure that it still possible to do better (I use a loot of space for SerchSorted class, so we could probably save something). If you have any ideas for an improvement, please let me know!
This is a continuation from here.
I have a class B which holds some data (a and b fields).
I am loading these data to the Criteria class in order to pass through the criteria and change accordingly calling the functions avg, func, calcs.
Then , I call the function P to start running the program with the data.
The code:
import numpy as np
class B():
def __init__(self, a,b):
self.a = a
self.b = b
def __repr__(self):
return 'B(%s, %s)'%(self.a, self.b)
class Criteria():
def __init__(self, method, minimum, maximum, measures=None):
self.method = method
self.minimum = minimum
self.maximum = maximum
self.measures = measures
def __repr__(self):
if self.measures is None:
measures = 'measures: None'
else:
measures = [' measures:[']
for m in self.measures:
measures.append(' {}'.format(m))
measures.append(' ]')
measures = '\n'+ '\n'.join(measures)
return 'C({0.method},{0.minimum},{0.maximum}, {1})'.format(self, measures)
def calcs(self):
""" Update the `a` attribute from B class object according to
conditions
"""
if self.measures is not None:
for x in self.measures:
if (x.a > self.minimum and x.a < self.maximum):
x.a = 999
return self.measures
def avg(self, calcs=None):
"""Return the average of values"""
if calcs is None:
calcs = self.measures
if calcs is None:
return 'none'
elif len(calcs)==0:
return '[]'
else:
return np.average([x.a for x in calcs])
def func(self,calcs=None):
"""Return the minimum is input is array or multiple by 1000
if input is a number"""
if calcs is None:
calcs = self.measures
if calcs is None:
return 'none'
#elif len(calcs) == 0:
# return '[]'
else:
if isinstance(calcs, np.ndarray):
return np.amin([x.a for x in calcs])
else:
return calcs*1000
def P(alist):
# use these variables to hold the result of the corresponding execution
# use them in the list loop in order to be able to obtain a result from a `Criteria` evaluation
# and use it as input to the next
last_calcs_values = None
last_calcs_avg = None
last_calcs_m = None
for c in alist:
if c.method=='V':
last_calcs_values = c.calcs()
#print('calcs', last_calcs_values)
yield last_calcs_values
if c.method=='AVG':
if c.measures is None:
last_calcs_avg = c.avg(last_calcs_values)
else:
last_calcs_avg = c.avg()
#print('AVG', last_calcs_avg)
yield last_calcs_avg
if c.method == 'M':
if c.measures is None:
last_calcs_m = c.func(last_calcs_avg)
else:
last_calcs_m = c.func()
#print('M',last_calcs_m)
yield last_calcs_m
If I use the data:
b1 = np.array([B(10, 0.1), B(200,.5)])
c1 = Criteria('V', 1, 100, b1)
c2 = Criteria('AVG', 22, 220, None)
c3 = Criteria('M', 22, 220, None)
alist = [c1,c2,c3]
for i in P(alist):
print(i)
I receive:
[B(999, 0.1) B(200, 0.5)]
599.5
599500.0
which is correct.But it works because the code in P function is hardcoded..
So,
1) My data , c1,c2,c3 uses the methods V,AVG,M in that order.
So, in the P function , I used :
yield last_calcs_values
last_calcs_avg = c.avg(last_calcs_values)
yield last_calcs_avg
last_calcs_m = c.func(last_calcs_avg)
yield last_calcs_m
the same order (hardcoded).
My question is how can I use this code for any order.I must somehow check what the previous method value is and put that in the argument (instead of putting c.func(last_calcs_avg)
2) Inside the func I have commented out the lines:
#elif len(calcs) == 0:
# return '[]'
because if I run the code , it gives : object of type 'numpy.float64' has no len() .
I tried to check similar with the check I have later in func :
if isinstance(calcs, np.ndarray):
but with no success.
3) Is there a way to obtain only the last result?
So, instead of :
[B(999, 0.1) B(200, 0.5)]
599.5
599500.0
obtain :
599500.0
If I read P correctly it maintains 3 'state' variables. The 3 if clauses can called in any order (and I think could have been written with a if,ifthen,ifthen,else syntax since c.method only matches one (for each c.
But the order of the objects in the list will determine values. An AVG object will use what ever values were set by the last V object. Similarly the M will use the last AVG. If pass an AVG first it will use the initial None values in its calc.
So sequence of V1, AVG, M will use B values that were set by V1.
In this sequence V1, AVG, V2, M, M uses values from the last AVG which was dependent on V1;
avg should not be returning strings if its values are being used by func, or at least the tests must match. [] is an empty list with len() zero, but '[]' is a 2 character string, with len 2.
Similarly None is a unique value that you test with is None, while 'none' is a 4 character string. I used strings like that in earlier questions simply because we were printing the results of avg. At the time we weren't using them for further calculations.
If you want to make sure that AVG and M use values from the last V, you need to add some logic:
lastV, lastA, lastM = None,None,None
if c.method=='V':
lastV = <newV>
lastA, lastM = None,None
elif c.method=='A':
if lastV is None:
error
else:
lastA = <new A based on lastV>
elif c.method=='M':
if lastA is None:
error
<or update lastA>
else:
lastM = <new M based on lastA>
else:
error unknown c.method
So I am using None to indicate that the values are not valid. In which case it should either raise an error, or it should calculate new values. Done right it should ensure that both AVG and M will produce values based on the latest V.
From your pastebin:
def avg(self, calcs=None):
"""Return the average of values"""
if calcs is None: # fun called without argument
calcs = self.measures # get value stored in self
if calcs is None: # in case that too was None
return '[]' # I would return None or []
# '[]' is a useless string
else:
if hasattr(calcs,'__len__'):
return np.average([x.a for x in calcs])
else:
return np.average(calcs)
What works in np.average() that doesn't have a len? len(np.arange(10)) runs, but doesn't have the a attribute
In [603]: avg(None,calcs=np.arange(10))
....
<ipython-input-602-48c9f6b255e1> in <listcomp>(.0)
8 else:
9 if hasattr(calcs,'__len__'):
---> 10 return np.average([x.a for x in calcs])
11 else:
12 return np.average(calcs)
AttributeError: 'numpy.int32' object has no attribute 'a'
That __len__ between an array or list with B objects and other lists or arrays. Maybe refine this to test dtype? Or a try/except?
def avg(self, calcs=None):
"""Return the average of values"""
....
else:
try:
return np.average([x.a for x in calcs])
except AttributeError:
return np.average(calcs)
In [606]: avg(None,calcs=np.arange(10))
Out[606]: 4.5
And list or array of B objects works:
In [609]: alist = [B(1,2),B(2,4),B(3,3)]
In [610]: avg(None, alist)
Out[610]: 2.0
In [611]: avg(None, np.array(alist))
Out[611]: 2.0
I don't understand what you mean by not hardcoding P
Is there a way to obtain only the last result?
Yes. Obtain all the results, but only print the last one:
for i in P(alist): pass
print(i)
I have put together some code from this link which is nicely colour coded with 4 minor changes to fix some errors. I also used some code from 2 previous forums.
What the code is supposed to do is calculate the semantic similarity between consecutive sentences across a whole text then display all the similarity values obtained like this;
'the yellow door.', 'The red hammer' 0.65
'pink fox in the woods.', 'commander fox is blue.' 0.32
Here is the code;
ALPHA = 0.2
BETA = 0.45
ETA = 0.4
PHI = 0.2
DELTA = 0.85
brown_freqs = dict()
N = 0
######################### word similarity ##########################
def get_best_synset_pair(word_1, word_2):
"""
Choose the pair with highest path similarity among all pairs.
Mimics pattern-seeking behavior of humans.
"""
max_sim = -1.0
synsets_1 = wn.synsets(word_1)
synsets_2 = wn.synsets(word_2)
if len(synsets_1) == 0 or len(synsets_2) == 0:
return None, None
else:
max_sim = -1.0
best_pair = None, None
for synset_1 in synsets_1:
for synset_2 in synsets_2:
sim = wn.path_similarity(synset_1, synset_2)
if sim > max_sim:
max_sim = sim
best_pair = synset_1, synset_2
return best_pair
def length_dist(synset_1, synset_2):
l_dist = sys.maxint
if synset_1 is None or synset_2 is None:
return 0.0
if synset_1 == synset_2:
# if synset_1 and synset_2 are the same synset return 0
l_dist = 0.0
else:
wset_1 = set([str(x.name()) for x in synset_1.lemmas()])
wset_2 = set([str(x.name()) for x in synset_2.lemmas()])
if len(wset_1.intersection(wset_2)) > 0:
# if synset_1 != synset_2 but there is word overlap, return 1.0
l_dist = 1.0
else:
# just compute the shortest path between the two
l_dist = synset_1.shortest_path_distance(synset_2)
if l_dist is None:
l_dist = 0.0
# normalize path length to the range [0,1]
return math.exp(-ALPHA * l_dist)
def hierarchy_dist(synset_1, synset_2):
h_dist = sys.maxint
if synset_1 is None or synset_2 is None:
return h_dist
if synset_1 == synset_2:
# return the depth of one of synset_1 or synset_2
h_dist = max([x[1] for x in synset_1.hypernym_distances()])
else:
# find the max depth of least common subsumer
hypernyms_1 = {x[0]:x[1] for x in synset_1.hypernym_distances()}
hypernyms_2 = {x[0]:x[1] for x in synset_2.hypernym_distances()}
lcs_candidates = set(hypernyms_1.keys()).intersection(
set(hypernyms_2.keys()))
if len(lcs_candidates) > 0:
lcs_dists = []
for lcs_candidate in lcs_candidates:
lcs_d1 = 0
if lcs_candidate in hypernyms_1:
lcs_d1 = hypernyms_1[lcs_candidate]
lcs_d2 = 0
if lcs_candidate in hypernyms_2:
lcs_d2 = hypernyms_2[lcs_candidate]
lcs_dists.append(max([lcs_d1, lcs_d2]))
h_dist = max(lcs_dists)
else:
h_dist = 0
return ((math.exp(BETA * h_dist) - math.exp(-BETA * h_dist)) /
(math.exp(BETA * h_dist) + math.exp(-BETA * h_dist)))
def word_similarity(word_1, word_2):
synset_pair = get_best_synset_pair(word_1, word_2)
return (length_dist(synset_pair[0], synset_pair[1]) *
hierarchy_dist(synset_pair[0], synset_pair[1]))
######################### sentence similarity ##########################
def most_similar_word(word, word_set):
max_sim = -1.0
sim_word = ""
for ref_word in word_set:
sim = word_similarity(word, ref_word)
if sim > max_sim:
max_sim = sim
sim_word = ref_word
return sim_word, max_sim
def info_content(lookup_word):
global N
if N == 0:
# poor man's lazy evaluation
for sent in brown.sents():
for word in sent:
word = word.lower()
if not word in brown_freqs:
brown_freqs[word] = 0
brown_freqs[word] = brown_freqs[word] + 1
N = N + 1
lookup_word = lookup_word.lower()
n = 0 if not lookup_word in brown_freqs else brown_freqs[lookup_word]
return 1.0 - (math.log(n + 1) / math.log(N + 1))
def semantic_vector(words, joint_words, info_content_norm):
sent_set = set(words)
semvec = np.zeros(len(joint_words))
i = 0
for joint_word in joint_words:
if joint_word in sent_set:
# if word in union exists in the sentence, s(i) = 1 (unnormalized)
semvec[i] = 1.0
if info_content_norm:
semvec[i] = semvec[i] * math.pow(info_content(joint_word), 2)
else:
# find the most similar word in the joint set and set the sim value
sim_word, max_sim = most_similar_word(joint_word, sent_set)
semvec[i] = PHI if max_sim > PHI else 0.0
if info_content_norm:
semvec[i] = semvec[i] * info_content(joint_word) * info_content(sim_word)
i = i + 1
return semvec
def semantic_similarity(sentence_1, sentence_2, info_content_norm):
words_1 = nltk.word_tokenize(sentence_1)
words_2 = nltk.word_tokenize(sentence_2)
joint_words = set(words_1).union(set(words_2))
vec_1 = semantic_vector(words_1, joint_words, info_content_norm)
vec_2 = semantic_vector(words_2, joint_words, info_content_norm)
return np.dot(vec_1, vec_2.T) / (np.linalg.norm(vec_1) * np.linalg.norm(vec_2))
######################### word order similarity ##########################
def word_order_vector(words, joint_words, windex):
wovec = np.zeros(len(joint_words))
i = 0
wordset = set(words)
for joint_word in joint_words:
if joint_word in wordset:
# word in joint_words found in sentence, just populate the index
wovec[i] = windex[joint_word]
else:
# word not in joint_words, find most similar word and populate
# word_vector with the thresholded similarity
sim_word, max_sim = most_similar_word(joint_word, wordset)
if max_sim > ETA:
wovec[i] = windex[sim_word]
else:
wovec[i] = 0
i = i + 1
return wovec
def word_order_similarity(sentence_1, sentence_2):
"""
Computes the word-order similarity between two sentences as the normalized
difference of word order between the two sentences.
"""
words_1 = nltk.word_tokenize(sentence_1)
words_2 = nltk.word_tokenize(sentence_2)
joint_words = list(set(words_1).union(set(words_2)))
windex = {x[1]: x[0] for x in enumerate(joint_words)}
r1 = word_order_vector(words_1, joint_words, windex)
r2 = word_order_vector(words_2, joint_words, windex)
return 1.0 - (np.linalg.norm(r1 - r2) / np.linalg.norm(r1 + r2))
######################### overall similarity ##########################
def similarity(sentence_1, sentence_2, info_content_norm):
"""
Calculate the semantic similarity between two sentences. The last
parameter is True or False depending on whether information content
normalization is desired or not.
"""
return DELTA * semantic_similarity(sentence_1, sentence_2, info_content_norm) + \
(1.0 - DELTA) * word_order_similarity(sentence_1, sentence_2)
THIS IS THE LOOPING PART
with open ("C:\\Users\\Lenovo2\\Desktop\\Test123.txt", "r") as sentence_file:
# Initialize a list to hold the results
results = []
# Loop until we hit the end of the file
while True:
# Read two lines
x = sentence_file.readline()
y = sentence_file.readline()
# Check if we've reached the end of the file, if so, we're done
if not y:
# Break out of the infinite loop
break
else:
# The .rstrip('\n') removes the newline character from each line
x = x.rstrip('\n')
y = y.rstrip('\n')
# Calculate your similarity value
similarity_value = similarity(x, y, True)
# Add the two lines and similarity value to the results list
results.append([x, y, similarity_value])
# Loop through the pairs in the results list and print them
for pair in results:
print(pair)
When I run the code on a text file I get an error code and instead of obtaining a number for the value of similarity between sentences, I get nan;
Warning (from warnings module):
File "C:\Users\Lenovo2\Desktop\Semantic Analysis (1).py", line 191
return np.dot(vec_1, vec_2.T) / (np.linalg.norm(vec_1) * np.linalg.norm(vec_2))
RuntimeWarning: invalid value encountered in double_scalars
In a previous forum, I understood that this error probably meant that I was dividing by zero so we have a zero vector. I am pretty much stuck there and with limited python experience, I don't know how to fix the program easily and without changing too much.
My guess is that you are passing an empty string. Do you have any blank lines in your text? You don't strip the newline until after checking for empty string, so a string containing only a newline will not be caught.
Since you appear to be on Windows, there also might be '\r\n' style newlines, so your rstrip might not work as expected.
I'd recommend adding the following modification (also do a print for debugging):
# Loop until we hit the end of the file
while True:
# Read two lines, removing trailing whitespace
x = sentence_file.readline().rstrip()
y = sentence_file.readline().rstrip()
# Check if we've reached the end of the file, if so, we're done
if not x or not y:
# Break out of the infinite loop
break
else:
print(x, y)
# Calculate your similarity value
similarity_value = similarity(x, y, True)
# Add the two lines and similarity value to the results list
results.append([x, y, similarity_value])
Note that the code appears to have a bug, because you are not comparing sentences pairwise. That is, if you had sentences [a, b, c, d], you are only comparing (a, b) and (c, d), but you really want to compare (a, b), (b, c), (c, d).
You can clean this up a bit by using the itertools library:
from itertools import pairwise
lines = open ("C:\\Users\\Lenovo2\\Desktop\\Test123.txt", "r")
for a, b in pairwise(lines):
x = a.rstrip()
y = b.rstrip()
# ... rest unchanged
Edit: I removed my explanation part since it was wrong but I still could not be able to convert it.
I was studying list and dictionaries in python and I came accross this code.
x = min(minValue, key=lambda b: min([a( \
myFunction(5,b),c) for c in something]))
What is the logical equivalent of this ? It seems simple but I do not get same thing when I try to write it with a different code. . How can write this differently without whole key and lambda thing
Seems like my explanation was wrong. Here is the updated code I try.
for b in minValue:
for c in something:
minimum=min(myFunction(5,b),c)
result=min(minimum)
return result
Note: By logical equivalent I do not mean the provided code should calculate this exactly like the code I gave but it should have the same output.
I' not sure but it can be something like this
def example(minValue):
data = []
for b in minValue:
values = []
for c in something:
values.append( a(myFunction(5,b),c) )
result = min(values)
# keep `result` and `b` which gives this `result`
data.append( [result, b] )
# find minimal `result` and `b` which gives this `result`
x = min(data) # x = [result, b]
# return `b`
return x[1]
#---
x = example(minValue)
EDIT: there can be problem with min(data) because min will be comparing result and b and oryginal version compare only result. It may need version without min() but with:
if result < min_result:
min_result = result
min_b = b
EDIT:
def key(val):
min_c = something[0]
min_result = a(myFunction(5,val),min_c)
for c in something[1:]:
result = a(myFunction(5,val),c)
if result < min_result:
min_result = result
#min_c = c
return min_result
def example(minValue):
min_b = minValue[0]
min_result = key(min_b)
for b in minValue[1:]:
result = key(b)
if result < min_result:
min_result = result
min_b = b
return min_b
#---
x = example(minValue)
I have a class that was taking in lists of 1's and 0's and performing GF(2) finite field arithmetic operations. It used to work until I tried to make it take the input in polynomial format. As for how the finite arithmetic will be done after fixing the regex issue, I was thinking about overloading the operators.
The actual code in parsePolyToListInput(input) works when outside the class. The problem seems to be in the regex, which errors that it will only take in a string (this makes sense), but does not seem to initialize with self.expr as a parameter (that's a problem). The #staticmethod just before the initialization was an attempt to salvage the unbound error as it the polynomial was passed in, but this is apparently completely wrong. Just to save you time if you decide to look at any of the arithmetic operations, modular inverse does not work (seems to be due to the formatting issue after every iteration of that while loop for division in the function and what the return type is):
import re
class gf2poly:
#binary arithemtic on polynomials
##staticmethod
def __init__(self,expr):
self.expr = expr
#self.expr = [int(i) for i in expr]
self.expr = gf2poly.parsePolyToListInput(self.expr)
def convert(self): #to clarify the input if necessary
convertToString = str(self.expr)
print "expression is %s"%(convertToString)
def id(self): #returns modulus 2 (1,0,0,1,1,....) for input lists
return [int(self.expr[i])%2 for i in range(len(self.expr))]
def listToInt(self): #converts list to integer for later use
result = gf2poly.id(self)
return int(''.join(map(str,result)))
def prepBinary(a,b): #converts to base 2 and orders min and max for use
a = gf2poly.listToInt(a); b = gf2poly.listToInt(b)
bina = int(str(a),2); binb = int(str(b),2)
a = min(bina,binb); b = max(bina,binb);
return a,b
#staticmethod
def outFormat(raw):
raw = str(raw[::-1]); g = [] #reverse binary string for enumeration
[g.append(i) for i,c in enumerate(raw) if c == '1']
processed = "x**"+' + x**'.join(map(str, g[::-1]))
if len(g) == 0: return 0 #return 0 if list empty
return processed #returns result in gf(2) polynomial form
def parsePolyToListInput(poly):
c = [int(i.group(0)) for i in re.finditer(r'\d+', poly)] #re.finditer returns an iterator
#m = max(c)
return [1 if x in c else 0 for x in xrange(max(c), -1, -1)]
#return d
def add(self,other): #accepts 2 lists as parameters
a = gf2poly.listToInt(self); b = gf2poly.listToInt(other)
bina = int(str(a),2); binb = int(str(b),2)
m = bina^binb; z = "{0:b}".format(m)
return z #returns binary string
def subtract(self,other): #basically same as add() but built differently
result = [self.expr[i] ^ other.expr[i] for i in range(len(max(self.expr,other.expr)))]
return int(''.join(map(str,result)))
def multiply(a,b): #a,b are lists like (1,0,1,0,0,1,....)
a,b = gf2poly.prepBinary(a,b)
g = []; bitsa = "{0:b}".format(a)
[g.append((b<<i)*int(bit)) for i,bit in enumerate(bitsa)]
m = reduce(lambda x,y: x^y,g); z = "{0:b}".format(m)
return z #returns product of 2 polynomials in gf2
def divide(a,b): #a,b are lists like (1,0,1,0,0,1,....)
a,b = gf2poly.prepBinary(a,b)
bitsa = "{0:b}".format(a); bitsb = "{0:b}".format(b)
difflen = len(str(bitsb)) - len(str(bitsa))
c = a<<difflen; q=0
while difflen >= 0 and b != 0: #a is divisor, b is dividend, b/a
q+=1<<difflen; b = b^c # b/a because of sorting in prep
lendif = abs(len(str(bin(b))) - len(str(bin(c))))
c = c>>lendif; difflen -= lendif
r = "{0:b}".format(b); q = "{0:b}".format(q)
return r,q #returns r remainder and q quotient in gf2 division
def remainder(a,b): #separate function for clarity when calling
r = gf2poly.divide(a,b)[0]; r = int(str(r),2)
return "{0:b}".format(r)
def quotient(a,b): #separate function for clarity when calling
q = gf2poly.divide(a,b)[1]; q = int(str(q),2)
return "{0:b}".format(q)
def extendedEuclideanGF2(a,b): # extended euclidean. a,b are GF(2) polynomials in list form
inita,initb=a,b; x,prevx=0,1; y,prevy = 1,0
while sum(b) != 0:
q = gf2poly.quotient(a,b);
q = list(q); q = [int(x) for x in q]
#q = list(q);
#q = tuple([int(i) for i in q])
q = gf2poly(q)
a,b = b,gf2poly.remainder(a,b);
#a = map(list, a);
#b = [list(x) for x in a];
#a = [int(x) for x in a]; b = [int(x) for x in b];
b = list(b); b = [int(x) for x in b]
#b = list(b);
#b = tuple([int(i) for i in b])
b = gf2poly(b)
#x,prevx = (prevx-q*x, x);
#y,prevy=(prevy-q*y, y)
print "types ",type(q),type(a),type(b)
#q=a//b; a,b = b,a%b; x,prevx = (prevx-q*x, x); y,prevy=(prevy-q*y, y)
#print("%d * %d + %d * %d = %d" % (inita,prevx,initb,prevy,a))
return a,prevx,prevy # returns gcd of (a,b), and factors s and t
def modular_inverse(a,mod): # where a,mod are GF(2) polynomials in list form
gcd,s,t = gf2poly.extendedEuclideanGF2(a,mod); mi = gf2poly.remainder(s,mod)
#gcd,s,t = ext_euc_alg_i(a,mod); mi = s%mod
if gcd !=1: return False
#print ("%d * %d mod %d = 1"%(a,mi,mod))
return mi # returns modular inverse of a,mod
I usually test it with this input:
a = x**14 + x**1 + x**0
p1 = gf2poly(a)
b = x**6 + x**2 + x**1
p2 = gf2poly(b)
The first thing you might notice about my code is that it's not very good. There are 2 reasons for that:
1) I wrote it so that the 1st version could do work in the finite field GF(2), and output in polynomial format. Then the next versions were supposed to be able to take polynomial inputs, and also perform the crucial 'modular inverse' function which is not working as planned (this means it's actually not working at all).
2) I'm teaching myself Python (I'm actually teaching myself programming overall), so any constructive criticism from pro Python programmers is welcome as I'm trying to break myself of beginner habits as quickly as possible.
EDIT:
Maybe some more of the code I've been testing with will help clarify what works and what doesn't:
t1 = [1,1,1]; t2 = [1,0,1]; t3 = [1,1]; t4 = [1, 0, 1, 1, 1, 1, 1]
t5 = [1,1,1,1]; t6 = [1,1,0,1]; t7 = [1,0,1,1,0]
f1 = gf2poly(t1); f2 = gf2poly(t2); f3 = gf2poly(t3); f4 = gf2poly(t4)
f5 = gf2poly(t5);f6 = gf2poly(t6);f7 = gf2poly(t7)
##print "subtract: ",a.subtract(b)
##print "add: ",a.add(b)
##print "multiply: ",gf2poly.multiply(f1,f3)
##print "multiply: ",gf2poly.multiply(f1,f2)
##print "multiply: ",gf2poly.multiply(f3,f4)
##print "degree a: ",a.degree()
##print "degree c: ",c.degree()
##print "divide: ",gf2poly.divide(f1,b)
##print "divide: ",gf2poly.divide(f4,a)
##print "divide: ",gf2poly.divide(f4,f2)
##print "divide: ",gf2poly.divide(f2,a)
##print "***********************************"
##print "quotient: ",gf2poly.quotient(f2,f5)
##print "remainder: ",gf2poly.remainder(f2,f5)
##testq = gf2poly.quotient(f4,f2)
##testr = gf2poly.remainder(f4,f2)
##print "quotient: ",testq,type(testq)
##print "remainder: ",testr,type(testr)
##print "***********************************"
##print "outFormat testp: ",gf2poly.outFormat(testq)
##print "outFormat testr: ",gf2poly.outFormat(testr)
##print "***********************************"
#print "gf2poly.modular_inverse(): ",gf2poly.modular_inverse(f2,f3)
print "p1 ",p1 #,type(f2),type(f3)
#print "parsePolyToListInput ",gf2poly.parsePolyToListInput(a)
Part of your problem is that you haven't declared self as an argument for parsePolyToListInput. When you call a method, the instance you call it on is implicitly bound as the first argument. Naming the first argument self is a convention, not a strict requirement - the instance is being bound to poly, which you then try to run a regexp over.
It looks me like there's some confusion in your design here about what's behavior of individual instances of the class and what's class-level or module-level behavior. In Python, it's perfectly acceptable to leave something that doesn't take an instance of a class as a parameter defined as a module-level function rather than shoehorning it in awkwardly. parsePolyToListInput might be one such function.
Your add implementation, similarly, has a comment saying it "accepts 2 lists as parameters". In fact, it's going to get a gf2poly instance as its first argument - this is probably right if you're planning to do operator overloading, but it means the second argument should also be a gf2poly instance as well.
EDIT:
Yeah, your example code shows a breakdown between class behavior and instance behavior. Either your multiply call should look something like this:
print "multiply: ",f1.multiply(f3)
Or multiply shouldn't be a method at all:
gfpoly.py:
def multiply(f1, f2):
a,b = prepBinary(a,b)
g = []; bitsa = "{0:b}".format(a)
[g.append((b<<i)*int(bit)) for i,bit in enumerate(bitsa)]
m = reduce(lambda x,y: x^y,g); z = "{0:b}".format(m)
return z #returns product of 2 polynomials in gf2
That latter approach is, for instance, how the standard math library does things.
The advantage of defining a multiplication method is that you could name it appropriately (http://docs.python.org/2/reference/datamodel.html#special-method-names) and use it with the * operator:
print "multiply: ",f1 *f3