Increasing Speed of Fuzzy Matching words on two lists - python

I have a list of about 500 items on one list. I'd like to replace all fuzzy-matched items in that list with the smallest length item.
Is there a way to speed up my implementation of fuzzy match?
Note: I posted a similar question before, but I'm reframing it due to lack of response.
My implementation:
def find_fuzzymatch_samelist(list1, list2, cutoff=90):
"""
#list1 = list(ds1.Title)
#list2 = list(ds1.Title)
"""
matchdict = defaultdict(list)
for i, u in enumerate(list1):
for i1, u1 in enumerate(list2):
#Since list orders are the same, this makes sure this isn't the same item.
if i != i1:
if fuzz.partial_token_sort_ratio(u, u1) >= cutoff:
pair = (u, u1)
#Because there are potential duplicates, I have to make the key constant.
#Otherwise, putting list1 as the key will result in both duplicate items
#serving as the key.
"""
Potential problem:
• what if there are diffrent shortstr?
"""
shortstr = min(pair, key=len)
longstr = max(pair, key=len)
matchdict[shortstr].append(longstr)
return matchdict

I will assume you have installed python-Levenshtein, that will give you a 4x speed up.
Optimising the loop and the dictionary access:
def find_fuzzymatch_samelist(list1, list2, cutoff=90):
matchdict = dict()
for i1, i2 in itertools.permutations(range(len(list1), repeat=2)
u1 = list1[i1]
u2 = list2[i2]
if fuzz.partial_token_sort_ratio(u, u1) >= cutoff:
shortstr = min(u1, u2, key=len)
longstr = max(u1, u2, key=len)
matchdict.get(shortstr, list).append(longstr)
return matchdict
This is as fast as it gets besides the fuzz call. If you read the source, you see that some preprocessing is done for each string, in every iteration. We can do it all at once:
def _asciionly(s):
if PY3:
return s.translate(translation_table)
else:
return s.translate(None, bad_chars)
def full_pre_process(s, force_ascii=False):
s = _asciionly(s)
# Keep only Letters and Numbres (see Unicode docs).
string_out = StringProcessor.replace_non_letters_non_numbers_with_whitespace(s)
# Force into lowercase.
string_out = StringProcessor.to_lower_case(string_out)
# Remove leading and trailing whitespaces.
string_out = StringProcessor.strip(string_out)
out = ''.join(sorted(string_out))
out.strip()
return out
def find_fuzzymatch_samelist(list1, list2, cutoff=90):
matchdict = dict()
if list1 is not list2:
list1 = [full_pre_process(each) for each in list1]
list2 = [full_pre_process(each) for each in list2]
else:
# If you are comparing a list to itself, we don't want to overwrite content.
list1 = [full_pre_process(each) for each in list1]
list2 = list1
for i1, i2 in itertools.permutations(range(len(list1), repeat=2)
u1 = list1[i1]
u2 = list2[i2]
if fuzz.partial_ratio(u, u1) >= cutoff:
pair = (u1, u2)
shortstr = min(pair, key=len)
longstr = max(pair, key=len)
matchdict.get(shortstr, list).append(longstr)
return matchdict

Related

How to remove overlapping item in a nested list?

I am trying to delete the overlapping values in a nested list.
The data looks like this:
[[22, 37, 'foobar'], [301, 306, 'foobar'],[369, 374, 'foobar'], [650, 672, 'foobar'], [1166, 1174, 'foobar'],[1469, 1477, 'foobar'],[2237, 2245, 'foobar'],[2702, 2724, 'foobar'],[3426, 3446, 'foobar'],[3505, 3513, 'foobar'],[3756, 3764, 'foobar'],[69524, 69535, 'foobar'],[3812, 3820, 'foobar'],[4034, 4057, 'foobar'],[4318, 4347, 'foobar'],[58531, 58548, 'foobar'],[4552, 4574, 'foobar'],[4854, 4861, 'foobar'],[5769, 5831, 'foobar'], [5976, 5986, 'foobar'],[6541, 6558, 'foobar'],[6541, 6608, 'foobar'],[7351, 7364, 'foobar'],[7351, 7364, 'foobar'], [7764, 7770, 'foobar'],[58540, 58548, 'foobar'],[69524, 69556, 'foobar']]
There are some overlapping values in index 0 and 1 of across the list. Such as:
[6541, 6558, 'foobar'] overlaps with [6541, 6608, 'foobar']
[7351, 7364, 'foobar'] overlaps with [7351, 7364, 'foobar']
[58531, 58548, 'foobar'] overlaps with [58540, 58548, 'foobar']
[69524, 69535, 'foobar'] overlaps with [69524, 69556, 'foobar']
I am trying to go through the list and remove shorter first instance of the overlapping values. If
[6541, 6558, 'foobar'] overlaps with [6541, 6608, 'foobar'] I want to keep [6541, 6608, 'foobar'] and remove [6541, 6558, 'foobar'] from the list.
So far i tried:
def clean_span(adata):
data = adata.copy()
rem_idx = []
for i in range(len(data)-1):
if data[i][0] in data[i+1] or data[i][1] in data[i+1]:
print(" {} overlaps with {}".format(data[i], data[i+1]))
rem_idx.append(i)
for i in rem_idx:
del data[i]
return data
But this code always leaves some overlapping values behind.
It is same with this approach as well.
def clean_span(adata):
data = adata.copy()
new_data = []
for i in range(len(data)-1):
if data[i][0] in data[i+1] or data[i][1] in data[i+1]:
print(" {} overlaps with {}".format(data[i], data[i+1]))
new_data.append(data[i+1])
else:
new_data.append(data[i])
return new_data
I would appreciate your help to solve this problem.
def clean_span(adata):
# to perform O(1) search for index 0 and 1
# you can just have one dictionary if indexes don't matter
d0 = dict()
d1 = dict()
r = []
for a in adata:
if a[0] in d0:
print(str(a) + " overlaps with " + str(d0[a[0]]))
elif a[1] in d1:
print(str(a) + " overlaps with " + str(d1[a[1]]))
else:
r.append(a)
d0[a[0]] = a
d1[a[1]] = a
return r
Keep 2 dictionaries: one for first element and one for second. Then while iterating on the data, check if any of the keys exist in the respective dictionaries -- if the key is found, it is an overlap; else it is not.
Problem in your code:
if data[i][0] in data[i+1] or data[i][1] in data[i+1]:
print(" {} overlaps with {}".format(data[i], data[i+1]))
new_data.append(data[i+1])
else:
new_data.append(data[i])
In the if statement, you add i + 1 to the new_data. So naturally when the loop increments to i + 1, it goes into the else and it adds the overlapped element back to the list.
Side note:
if data[i][0] in data[i+1] or data[i][1] in data[i+1]:
You are trying to search the element in the entire list here. Making your time complexity O(nk).
Overlapping can be found with set.intersection.
import itertools as it
l = # list
# merge the first two entries of the sublist into a from-to set values
m = ((set(range(p[0], p[1]+1)), p[2]) for p in l)
# combine each element of the list to check overlapping
new_l = []
for p1, p2 in it.combinations(m, 2):
s1, l1 = p1
s2, l2 = p2
if set.intersection(s1, s2):
m1, M1 = min(s1), max(s1)
m2, M2 = min(s2), max(s2)
# choose the biggest one
if M2-m2 > M1-m1:
new_l.append((m2, M2, l2))
else:
new_l.append((m1, M1, l1))
print(sorted(new_l, key=lambda p: p[0]))
I convert your list to dict type. The keys are foobar1,foobar2,... then first items in nested list are value of new dict. The second loop and in result variable has duplicated values in
rev_dict.The third loop find matches item then remove item from your list.
l_ist = [[22, 37, 'foobar'], ....]
ini_dict = {}
rev_dict = {}
for i in range(len(lis)):
ini_dict[l_ist[i][2]+str(i)] = l_ist[i][0]
for key, value in ini_dict.items():
rev_dict.setdefault(value, set()).add(key)
result = [key for key, values in rev_dict.items()if len(values) > 1] # duplicated values in rev_dic
for i in result:
for j in range(len(lis)) :
if i == l_ist[j][0]:
l_ist[j].remove(i)
print(l_ist)

Find duplicates in a list of strings differing only in upper and lower case writing

I have a list of strings that contains 'literal duplicates' and 'pseudo-duplicates' which differ only in lower- and uppercase writing. I am looking for a function that treats all literal duplicates as one group, returns their indices, and finds all pseudo-duplicates for these elements, again returning their indices.
Here's an example list:
a = ['bar','bar','foo','Bar','Foo','Foo']
And this is the output I am looking for (a list of lists of lists):
dupe_list = [[[0,1],[3]],[[2],[4,5]]]
Explanation: 'bar' appears twice at the indexes 0 and 1 and there is one pseudo-duplicate 'Bar' at index 3. 'foo' appears once at index 2 and there are two pseudo-duplicates 'Foo' at indexes 4 and 5.
Here is one solution (you didn't clarify what the logic of list items will be and i considered that you want the items in lower format as they are met from left to right in the list, let me know if it must be different):
d={i:[[], []] for i in set(k.lower() for k in a)}
for i in range(len(a)):
if a[i] in d.keys():
d[a[i]][0].append(i)
else:
d[a[i].lower()][1].append(i)
result=list(d.values())
Output:
>>> print(result)
[[[0, 1], [3]], [[2], [4, 5]]]
Here's how I would achieve it. But you should consider using a dictionary and not a list of list of list. Dictionaries are excellent data structures for problems like this.
#default argument vars
a = ['bar','bar','foo','Bar','Foo','Foo']
#initalize a dictionary to count occurances
a_dict = {}
for i in a:
a_dict[i] = None
#loop through keys in dictionary, which is values from a_list
#loop through the items from list a
#if the item is exact match to key, add index to list of exacts
#if the item is similar match to key, add index to list of similars
#update the dictionary key's value
for k, v in a_dict.items():
index_exact = []
index_similar = []
for i in range(len(a)):
print(a[i])
print(a[i] == k)
if a[i] == str(k):
index_exact.append(i)
elif a[i].lower() == str(k):
index_similar.append(i)
a_dict[k] = [index_exact, index_similar]
#print out dictionary values to assure answer
print(a_dict.items())
#segregate values from dictionary to its own list.
dup_list = []
for v in a_dict.values():
dup_list.append(v)
print(dup_list)
Here is the solution. I have handled the situation where if there are only pseudo duplicates present or only literal duplicates present
a = ['bar', 'bar', 'foo', 'Bar', 'Foo', 'Foo', 'ka']
# Dictionaries to store the positions of words
literal_duplicates = dict()
pseudo_duplicates = dict()
for index, item in enumerate(a):
# Treates words as literal duplicates if word is in smaller case
if item.islower():
if item in literal_duplicates:
literal_duplicates[item].append(index)
else:
literal_duplicates[item] = [index]
# Handle if only literal_duplicates present
if item not in pseudo_duplicates:
pseudo_duplicates[item] = []
# Treates words as pseudo duplicates if word is in not in smaller case
else:
item_lower = item.lower()
if item_lower in pseudo_duplicates:
pseudo_duplicates[item_lower].append(index)
else:
pseudo_duplicates[item_lower] = [index]
# Handle if only pseudo_duplicates present
if item not in literal_duplicates:
literal_duplicates[item_lower] = []
# Form final list from the dictionaries
dupe_list = [[v, pseudo_duplicates[k]] for k, v in literal_duplicates.items()]
Here is the simple and easy to understand answer for you
a = ['bar','bar','foo','Bar','Foo','Foo']
dupe_list = []
ilist = []
ilist2 =[]
samecase = -1
dupecase = -1
for i in range(len(a)):
if a[i] != 'Null':
ilist = []
ilist2 = []
for j in range(i+1,len(a)):
samecase = -1
dupecase = -1
# print(a)
if i not in ilist:
ilist.append(i)
if a[i] == a[j]:
# print(a[i],a[j])
samecase = j
a[j] = 'Null'
elif a[i] == a[j].casefold():
# print(a[i],a[j])
dupecase = j
a[j] = 'Null'
# print(samecase)
# print(ilist,ilist2)
if samecase != -1:
ilist.append(samecase)
if dupecase != -1:
ilist2.append(dupecase)
dupe_list.append([ilist,ilist2])
a[i]='Null'
print(dupe_list)

method to return subset of a list of elements with some predefined properties?

I have a list A with elements representing relations of the form ["item1", "relationstype" ,"item2"].
I want to write a function which return a list of all "relationstype" such that if ["item1", "relationstype" ,"item2"] is in A, then ["item2", "relationstype" ,"item1"] is also in A.
For rexample, if A=[["item1", "relationstype1" ,"item2"],["item3", "relationstype2" ,"item2"],["item2", "relationstype1" ,"item1"],["item2", "relationstype2" ,"item3"],["item3", "relationstype2" ,"item4"]], then the method should return ["relationstype1"].
this is what I tried:
def find_symmetric_realations(A):
relation_dict = {}
symmetric_realations = set()
for elem in A:
relationstype = elem[1]
if relationstype not in relation_dict:
relation_dict[relationstype] = [(elem[0], elem[2])] # pout relation in dic
else:
if (elem[2],elem[0]) in relation_dict[relationstype]:
continue
else:
relation_dict[relationstype].append((elem[0], elem[2]))
# print(relation_dict[list(relation_dict.keys())[0]])
for elem in relation_dict:
if all((b,a) in relation_dict[elem] for (a,b) in relation_dict[elem]):
symmetric_realations.add(elem)
return list(symmetric_realations)
Update
Based on comments and edits to the question, the original answer was incorrect as it only considered if there was a matching pair for a given relationship, rather than requiring all elements for that relationship to have a matching pair. This function solves that problem:
def symmetric_relationships(A):
A = set(tuple(e) for e in A)
rels = set(r for (_, r, _) in A)
return [r for r in rels if all((i2, rel, i1) in A for (i1, rel, i2) in A if rel == r)]
For example:
A = [
["item1", "relationstype1", "item2"],
["item3", "relationstype2", "item2"],
["item2", "relationstype1", "item1"]
]
print(symmetric_relationships(A))
A.append(["item3", "relationstype1", "item1"])
print(symmetric_relationships(A))
A.append(["item2", "relationstype2", "item3"])
print(symmetric_relationships(A))
A.append(["item1", "relationstype1", "item3"])
print(symmetric_relationships(A))
Output:
['relationstype1']
[]
['relationstype2']
['relationstype1', 'relationstype2']
Original Answer
You can brute-force this with a list comprehension:
r = [r for i1, r, i2 in A if [i2, r, i1] in A]
this will give
['relationstype1', 'relationstype1']
which you can convert to a unique-valued list with
list(set(r))
If item1 is never exactly the same as item2 you can also skip the last step by adding an i1 < i2 test to the list comprehension:
r = [r for i1, r, i2 in A if [i2, r, i1] in A and i1 < i2]
Performance wise you can probably improve it by converting A to a set (after first converting it to tuples):
A = set(tuple (e) for e in A)
r = [r for i1, r, i2 in A if (i2, r, i1) in A and i1 < i2]
The following function finds symmetric relations in your list:
def find_symmetric_realations(A):
relation_dict = {}
symmetric_realations = set()
for elem in A:
relationstype = elem[1]
if relationstype not in relation_dict:
relation_dict[relationstype] = [(elem[0], elem[2])]
else:
if (elem[2],elem[0]) in relation_dict[relationstype]:
symmetric_realations.add(relationstype)
else:
relation_dict[relationstype].append((elem[0], elem[2]))
return symmetric_realations
for index,list1 in enumerate(A):
for list2 in A[index:]:
check = all(item in list1 for item in list2)
if check==True and list1!=list2:
print(list1[1])
Explanation: On looping over enumerate(A), we get index & single lists. 'check' stores True if all elements of list1 are in list2(from 2nd loop). If check==true but list1 isn't equal to list2, print the middle value i.e the relationship

Given a linear order completely represented by a list of tuples of strings, output the order as a list of strings

Given pairs of items of form [(a,b),...] where (a,b) means a > b, for example:
[('best','better'),('best','good'),('better','good')]
I would like to output a list of form:
['best','better','good']
This is very hard for some reason. Any thoughts?
======================== code =============================
I know why it doesn't work.
def to_rank(raw):
rank = []
for u,v in raw:
if u in rank and v in rank:
pass
elif u not in rank and v not in rank:
rank = insert_front (u,v,rank)
rank = insert_behind(v,u,rank)
elif u in rank and v not in rank:
rank = insert_behind(v,u,rank)
elif u not in rank and v in rank:
rank = insert_front(u,v,rank)
return [[r] for r in rank]
# #Use: insert word u infront of word v in list of words
def insert_front(u,v,words):
if words == []: return [u]
else:
head = words[0]
tail = words[1:]
if head == v: return [u] + words
else : return ([head] + insert_front(u,v,tail))
# #Use: insert word u behind word v in list of words
def insert_behind(u,v,words):
words.reverse()
words = insert_front(u,v,words)
words.reverse()
return words
=================== Update ===================
Per suggestion of many, this is a straight forward topological sort setting, I ultimately decided to use the code from this source: algocoding.wordpress.com/2015/04/05/topological-sorting-python/
which solved my problem.
def go_topsort(graph):
in_degree = { u : 0 for u in graph } # determine in-degree
for u in graph: # of each node
for v in graph[u]:
in_degree[v] += 1
Q = deque() # collect nodes with zero in-degree
for u in in_degree:
if in_degree[u] == 0:
Q.appendleft(u)
L = [] # list for order of nodes
while Q:
u = Q.pop() # choose node of zero in-degree
L.append(u) # and 'remove' it from graph
for v in graph[u]:
in_degree[v] -= 1
if in_degree[v] == 0:
Q.appendleft(v)
if len(L) == len(graph):
return L
else: # if there is a cycle,
return []
RockBilly's solution also work in my case, because in my setting, for every v < u, we are guaranteed to have a pair (u,v) in our list. So his answer is not very "computer-sciency", but it gets the job done in this case.
If you have a complete grammar specified then you can simply count up the items:
>>> import itertools as it
>>> from collections import Counter
>>> ranks = [('best','better'),('best','good'),('better','good')]
>>> c = Counter(x for x, y in ranks)
>>> sorted(set(it.chain(*ranks)), key=c.__getitem__, reverse=True)
['best', 'better', 'good']
If you have an incomplete grammar then you can build a graph and dfs all paths to find the longest. This isn't very inefficient, as I haven't thought about that yet :):
def dfs(graph, start, end):
stack = [[start]]
while stack:
path = stack.pop()
if path[-1] == end:
yield path
continue
for next_state in graph.get(path[-1], []):
if next_state in path:
continue
stack.append(path+[next_state])
def paths(ranks):
graph = {}
for n, m in ranks:
graph.setdefault(n,[]).append(m)
for start, end in it.product(set(it.chain(*ranks)), repeat=2):
yield from dfs(graph, start, end)
>>> ranks = [('black', 'dark'), ('black', 'dim'), ('black', 'gloomy'), ('dark', 'gloomy'), ('dim', 'dark'), ('dim', 'gloomy')]
>>> max(paths(ranks), key=len)
['black', 'dim', 'dark', 'gloomy']
>>> ranks = [('a','c'), ('b','a'),('b','c'), ('d','a'), ('d','b'), ('d','c')]
>>> max(paths(ranks), key=len)
['d', 'b', 'a', 'c']
What you're looking for is topological sort. You can do this in linear time using depth-first search (pseudocode included in the wiki I linked)
Here is one way. It is based on using the complete pairwise rankings to make an old-style (early Python 2) cmp function and then using functools.cmp_to_key to convert it to a key suitable for the Python 3 approach to sorting:
import functools
def sortByRankings(rankings):
def cmp(x,y):
if x == y:
return 0
elif (x,y) in rankings:
return -1
else:
return 1
items = list({x for y in rankings for x in y})
items.sort(key = functools.cmp_to_key(cmp))
return items
Tested like:
ranks = [('a','c'), ('b','a'),('b','c'), ('d','a'), ('d','b'), ('d','c')]
print(sortByRankings(ranks)) #prints ['d', 'b', 'a', 'c']
Note that to work correctly, the parameter rankings must contain an entry for each pair of distinct items. If it doesn't, you would first need to compute the transitive closure of the pairs that you do have before you feed it to this function.
You can take advantage of the fact that the lowest ranked item in the list will never appear at the start of any tuple. You can extract this lowest item, then remove all elements which contain this lowest item from your list, and repeat to get the next lowest.
This should work even if you have redundant elements, or have a sparser list than some of the examples here. I've broken it up into finding the lowest ranked item, and then the grunt work of using this to create a final ranking.
from copy import copy
def find_lowest_item(s):
#Iterate over set of all items
for item in set([item for sublist in s for item in sublist]):
#If an item does not appear at the start of any tuple, return it
if item not in [x[0] for x in s]:
return item
def sort_by_comparison(s):
final_list = []
#Make a copy so we don't mutate original list
new_s = copy(s)
#Get the set of all items
item_set = set([item for sublist in s for item in sublist])
for i in range(len(item_set)):
lowest = find_lowest_item(new_s)
if lowest is not None:
final_list.insert(0, lowest)
#For the highest ranked item, we just compare our current
#ranked list with the full set of items
else:
final_list.insert(0,set(item_set).difference(set(final_list)).pop())
#Update list of ranking tuples to remove processed items
new_s = [x for x in new_s if lowest not in x]
return final_list
list_to_compare = [('black', 'dark'), ('black', 'dim'), ('black', 'gloomy'), ('dark', 'gloomy'), ('dim', 'dark'), ('dim', 'gloomy')]
sort_by_comparison(list_to_compare)
['black', 'dim', 'dark', 'gloomy']
list2 = [('best','better'),('best','good'),('better','good')]
sort_by_comparison(list2)
['best', 'better', 'good']
list3 = [('best','better'),('better','good')]
sort_by_comparison(list3)
['best', 'better', 'good']
If you do sorting or create a dictionary from the list items, you are going to miss the order as #Rockybilly mentioned in his answer. I suggest you to create a list from the tuples of the original list and then remove duplicates.
def remove_duplicates(seq):
seen = set()
seen_add = seen.add
return [x for x in seq if not (x in seen or seen_add(x))]
i = [(5,2),(1,3),(1,4),(2,3),(2,4),(3,4)]
i = remove_duplicates(list(x for s in i for x in s))
print(i) # prints [5, 2, 1, 3, 4]
j = [('excellent','good'),('excellent','great'),('great','good')]
j = remove_duplicates(list(x for s in j for x in s))
print(j) # prints ['excellent', 'good', 'great']
See reference: How do you remove duplicates from a list in whilst preserving order?
For explanation on the remove_duplicates() function, see this stackoverflow post.
If the list is complete, meaning has enough information to do the ranking(Also no duplicate or redundant inputs), this will work.
from collections import defaultdict
lst = [('best','better'),('best','good'),('better','good')]
d = defaultdict(int)
for tup in lst:
d[tup[0]] += 1
d[tup[1]] += 0 # To create it in defaultdict
print sorted(d, key = lambda x: d[x], reverse=True)
# ['best', 'better', 'good']
Just give them points, increment the left one each time you encounter it in the list.
Edit: I do think the OP has a determined type of input. Always have tuple count of combination nCr(n, 2). Which makes this a correct solution. No need to complain about the edge cases, which I already knew posting the answer(and mentioned it).

One way to get the speed of set.difference() but use fuzzy matching?

I am able to compare two lists with embedded for loops, but the speed of this is quite slow. Is there a way to use set.difference() or some other technique to increase the speed of finding potential fuzzy matches between two lists?
here's my sample
matchdict = dicttype
if isinstance(matchdict, collections.defaultdict):
for i, u in enumerate(list1):
for i1, u1 in enumerate(list2):
if func(u, u1) >= cutoff:
pair = (u, u1)
#print pair
#shortstr = min(pair, key=len)
#longstr = max(pair, key=len)
#matchdict[shortstr] = longstr
matchdict[u].append(u1)
elif isinstance(matchdict, dict):
for i, u in enumerate(list1):
for i1, u1 in enumerate(list2):
if func(u, u1) >= cutoff:
pair = (u, u1)
#print pair
shortstr = min(pair, key=len)
longstr = max(pair, key=len)
#matchdict[shortstr] = longstr
matchdict[u] = u1
print 'format is: list1 : list2'
to compare two lists you can try this ...
sorted(list1) == sorted(list2)
in order to get difference between lists try this ...
list(set(list1) - set(list2)) + list(set(list2) - set(list1))

Categories