I have two lists at hand, each consisting of tuples in form [(lemma, PoS, token)]. List A is generated by a lemmatizer from input text; List B is from a preexisting database.
The objective is to: check if a token from List A is found in List B, then:
if found, return corresponding lemmas from both lists and check if they are the same, write YES if so, NO if not;
if not found, return corresponding lemma from List A.
Ideally, the output should be something like:
Token from List A
Lemma from list A
Token found in B?
Lemma from List B?
Same lemma?
sé
ég
YES
ég
YES
skór
skó
YES
skór
NO
Keplerur
Kepler
NO
n/a
n/a
So, I tried doing that with for loops cause we didn't really learn much else:
for l, pos, t in list_a:
for seq in list_b:
if t == seq[2]:
if pos == seq[1]:
if l == seq[0]:
comparison.append((t,l,"YES",seq[0],"YES"))
elif pos == seq[1]:
if l != seq[0]:
comparison.append((t,l,"YES",seq[0],"NO"))
elif pos != seq[1]:
comparison.append((t,l,"NO","na,"na"))
elif t != seq[2]:
comparison.append((t,l,"NO","na","na"))
You see, whilst List A is quite short (~120 tuples from the text I am testing on), List B is pre-existing and has >6M elements. For-looping through it for each item on List A is not going to be efficient, I guess. Apparently, my laptop can't complete executing this code anyway, so I can't even test it.
What could I do? I have a feeling a fundamentally different approach is needed here.
UPD: I have, after about an hour of trial and error, come up with this solution:
for l, pos, t in lemtok:
r = next((a for a, b in enumerate(binlistfinal) if b[2] == t), None)
ind = r
if r == None:
comparison.append((t, l, "NO", "n/a", "n/a"))
else:
if pos == binlistfinal[ind][1]:
if l == binlistfinal[ind][0]:
comparison.append((t, l, "YES", binlistfinal[ind][0], "YES"))
elif pos != binlistfinal[ind][1]:
comparison.append((t, l, "NO", "n/a", "n/a"))
elif pos == binlistfinal[ind][1]:
if l != binlistfinal[ind][0]:
comparison.append((t, l, "YES", binlistfinal[ind][0], "NO"))
print(comparison)
WHERE:
lemtok is List A, binlistfinal is List B
So your current code is O(NM) where N is the size of list A and M is the size of list B.
Assuming list B is not sorted nor does it contain unique tokens, you are going to have to at least iterate through all elements of B once. This makes it O(M).
If we can get the access time of list A down to O(1), then your runtime will still be O(M). Now I will just assume all of the tokens in A are unique since the list is quite short. If it is not, you can just generate a dictionary with tokens as the key, and a list of values that the token is associated with and adjust from there.
Sorting list B will not actually help because best case that will be O(MlogM) which is worse.
Now your code might look like the following:
old_a = ...
B = ...
A = {token: (lemma, pos) for token, lemma, pos in old_A}
B_tokens_in_A = set() # store a list of the tokens B tokens found in A which is equal to the A tokens found in B.
for lemma, pos, token in B:
if token in A:
B_tokens_in_A.add(token)
if A[token][0] == lemma:
# return lemma and write yes
else:
# return lemmas and write no
# now we need to return the lemmas from list A whose tokens were not in list B
for token, val in A.items():
if token not in B_tokens_in_A:
# return lemma from A
Related
I have a giant list of words corpus and a particular word w. I know the index of every occurence of w in the corpus. I want to look at an n sized window around every occurence of w and create a dictionary of other words that occur within that window. The dictionary is a mapping from int to list[str] where the key is how many positions away from my target word I am, either to the left (negative) or to the right (positive), and the value is a list of words at that position.
For example, if I have the corpus: ["I", "made", "burgers", "Jack", "made", "sushi"]; my word is "made" and I am looking at a window of size 1, then I ultimately want to return {-1: ["I", "Jack"], 1: ["burgers", "sushi"]}.
There are two problems that can occur. My window may go out of bounds (if I looked at a window of size 2 in the above example) and I can encounter the same word multiple times in that window, which are cases I want to ignore. I have written the following code which seems to work, but I want to make this cleaner.
def find_neighbor(word: str, corpus: list[str], n: int = 1) -> dict[int, list[str]]:
mapping = {k: [] for k in list(range(-n,n+1)) if k != 0}
idxs = [k for k, v in enumerate(corpus) if v == word]
for idx in idxs:
for i in [x for x in range(-n,n+1) if x != 0]:
try:
item = corpus[idx+i]
if item != word:
mapping[i].append(corpus[item])
except IndexError:
continue
return mapping
Is there a way to incorporate options and pattern matching so that I can remove the try block and have something like this...
match corpus[idx+i]
case None: continue; # If it doesn't exist (out of bounds), continue / i can also break
case word: continue; # If it is the word itself, continue
case _: mapping[i].append(corpus[item]) # Otherwise, add it to the dictionary
Introduce a helper function that returns corpus[i] if i is a legal index and None otherwise:
corpus = ["foo", "bar", "baz"]
def get(i):
return corpus[i] if i<len(corpus) else None
print([get(0), get(1), get(2), get(3)])
The result of the above is:
['foo', 'bar', 'baz', None]
Now you can write:
match get(idx+i)
case None: something
case word: something
case _: something
I am trying to solve leetcode problem(https://leetcode.com/problems/word-ladder/description/):
Given two words (beginWord and endWord), and a dictionary's word list, find the length of shortest transformation sequence from beginWord to endWord, such that:
Only one letter can be changed at a time.
Each transformed word must exist in the word list. Note that beginWord is not a transformed word.
Note:
Return 0 if there is no such transformation sequence.
All words have the same length.
All words contain only lowercase alphabetic characters.
You may assume no duplicates in the word list.
You may assume beginWord and endWord are non-empty and are not the same.
Input:
beginWord = "hit",
endWord = "cog",
wordList = ["hot","dot","dog","lot","log","cog"]
Output:
5
Explanation:
As one shortest transformation is "hit" -> "hot" -> "dot" -> "dog" ->
"cog", return its length 5.
import queue
class Solution:
def isadjacent(self,a, b):
count = 0
n = len(a)
for i in range(n):
if a[i] != b[i]:
count += 1
if count > 1:
return False
if count == 1:
return True
def ladderLength(self,beginWord, endWord, wordList):
word_queue = queue.Queue(maxsize=0)
word_queue.put((beginWord,1))
while word_queue.qsize() > 0:
queue_last = word_queue.get()
index = 0
while index != len(wordList):
if self.isadjacent(queue_last[0],wordList[index]):
new_len = queue_last[1]+1
if wordList[index] == endWord:
return new_len
word_queue.put((wordList[index],new_len))
wordList.pop(index)
index-=1
index+=1
return 0
Can someone suggest how to optimise it and prevent the error!
The basic idea is to find the adjacent words faster. Instead of considering every word in the list (even one that has already been filtered by word length), construct each possible neighbor string and check whether it is in the dictionary. To make those lookups fast, make sure the word list is stored in something like a set that supports fast membership tests.
To go even faster, you could store two sorted word lists, one sorted by the reverse of each word. Then look for possibilities involving changing a letter in the first half in the reversed list and for the latter half in the normal list. All the existing neighbors can then be found without making any non-word strings. This can even be extended to n lists, each sorted by omitting one letter from all the words.
Given a string str and a list of variable-length prefixes p, I want to find all possible prefixes found at the start of str, allowing for up to k mismatches and wildcards (dot character) in str.
I only want to search at the beginning of the string and need to do this efficiently for len(p) <= 1000; k <= 5 and millions of strs.
So for example:
str = 'abc.efghijklmnop'
p = ['abc', 'xxx', 'xbc', 'abcxx', 'abcxxx']
k = 1
result = ['abc', 'xbc', 'abcxx'] #but not 'xxx', 'abcxxx'
Is there an efficient algorithm for this, ideally with a python implementation already available?
My current idea would be to walk through str character by character and keep a running tally of each prefix's mismatch count.
At each step, I would calculate a new list of candidates which is the list of prefixes that do not have too many mismatches.
If I reach the end of a prefix it gets added to the returned list.
So something like this:
def find_prefixes_with_mismatches(str, p, k):
p_with_end = [prefix+'$' for prefix in p]
candidates = list(range(len(p)))
mismatches = [0 for _ in candidates]
result = []
for char_ix in range(len(str)):
#at each iteration we build a new set of candidates
new_candidates = []
for prefix_ix in candidates:
#have we reached the end?
if p_with_end[prefix_ix][char_ix] == '$':
#then this is a match
result.append(p[prefix_ix])
#do not add to new_candidates
else:
#do we have a mismatch
if str[char_ix] != p_with_end[prefix_ix][char_ix] and str[char_ix] != '.' and p_with_end[prefix_ix][char_ix] != '.':
mismatches[prefix_ix] += 1
#only add to new_candidates if the number is still not >k
if mismatches[prefix_ix] <= k:
new_candidates.append(prefix_ix)
else:
#if not, this remains a candidate
new_candidates.append(prefix_ix)
#update candidates
candidates = new_candidates
return result
But I'm not sure if this will be any more efficient than simply searching one prefix after the other, since it requires rebuilding this list of candidates at every step.
I do not know of something that does exactly this.
But if I were to write it, I'd try constructing a trie of all possible decision points, with an attached vector of all states you wound up in. You would then take each string, walk the trie until you hit a final matched node, then return the precompiled vector of results.
If you've got a lot of prefixes and have set k large, that trie may be very big. But if you're amortizing creating it against running it on millions of strings, it may be worthwhile.
I have "n" number of strings as input, which i separate into possible subsequences into a list like below
If the Input is : aa, b, aa
I create a list like the below(each list having the subsequences of the string):
aList = [['a', 'a', 'aa'], ['b'], ['a', 'a', 'aa']]
I would like to find the combinations of palindromes across the lists in aList.
For eg, the possible palindromes for this would be 5 - aba, aba, aba, aba, aabaa
This could be achieved by brute force algorithm using the below code:
d = []
def isPalindrome(x):
if x == x[::-1]: return True
else: return False
for I in itertools.product(*aList):
a = (''.join(I))
if isPalindrome(a):
if a not in d:
d.append(a)
count += 1
But this approach is resulting in a timeout when the number of strings and the length of the string are bigger.
Is there a better approach to the problem ?
Second version
This version uses a set called seen, to avoid testing combinations more than once.
Note that your function isPalindrome() can simplified to single expression, so I removed it and just did the test in-line to avoid the overhead of an unnecessary function call.
import itertools
aList = [['a', 'a', 'aa'], ['b'], ['a', 'a', 'aa']]
d = []
seen = set()
for I in itertools.product(*aList):
if I not in seen:
seen.add(I)
a = ''.join(I)
if a == a[::-1]:
d.append(a)
print('d: {}'.format(d))
Current approach has disadvantage and that most of generated solutions are finally thrown away when checked that solution is/isn't palindrome.
One Idea is that once you pick solution from one side, you can immediate check if there is corresponding solution in last group.
For example lets say that your space is this
[["a","b","c"], ... , ["b","c","d"]]
We can see that if you pick "a" as first pick, there is no "a" in last group and this exclude all possible solutions that would be tried other way.
For larger input you could probably get some time gain by grabbing words from the first array, and compare them with the words of the last array to check that these pairs still allow for a palindrome to be formed, or that such a combination can never lead to one by inserting arrays from the remaining words in between.
This way you probably cancel out a lot of possibilities, and this method can be repeated recursively, once you have decided that a pair is still in the running. You would then save the common part of the two words (when the second word is reversed of course), and keep the remaining letters separate for use in the recursive part.
Depending on which of the two words was longer, you would compare the remaining letters with words from the array that is next from the left or from the right.
This should bring a lot of early pruning in the search tree. You would thus not perform the full Cartesian product of combinations.
I have also written the function to get all substrings from a given word, which you probably already had:
def allsubstr(str):
return [str[i:j+1] for i in range(len(str)) for j in range(i, len(str))]
def getpalindromes_trincot(aList):
def collectLeft(common, needle, i, j):
if i > j:
return [common + needle + common[::-1]] if needle == needle[::-1] else []
results = []
for seq in aRevList[j]:
if seq.startswith(needle):
results += collectRight(common+needle, seq[len(needle):], i, j-1)
elif needle.startswith(seq):
results += collectLeft(common+seq, needle[len(seq):], i, j-1)
return results
def collectRight(common, needle, i, j):
if i > j:
return [common + needle + common[::-1]] if needle == needle[::-1] else []
results = []
for seq in aList[i]:
if seq.startswith(needle):
results += collectLeft(common+needle, seq[len(needle):], i+1, j)
elif needle.startswith(seq):
results += collectRight(common+seq, needle[len(seq):], i+1, j)
return results
aRevList = [[seq[::-1] for seq in seqs] for seqs in aList]
return collectRight('', '', 0, len(aList)-1)
# sample input and call:
input = ['already', 'days', 'every', 'year', 'later'];
aList = [allsubstr(word) for word in input]
result = getpalindromes_trincot(aList)
I did a timing comparison with the solution that martineau posted. For the sample data I have used, this solution is about 100 times faster:
See it run on repl.it
Another Optimisation
Some gain could also be found in not repeating the search when the first array has several entries with the same string, like the 'a' in your example data. The results that include the second 'a' will obviously be the same as for the first. I did not code this optimisation, but it might be an idea to improve the performance even more.
I have a list of sublists each of which consists of one or more strings. I am comparing each string in one sublist to every other string in the other sublists. This consists of writing two for loops. However, my data set is ~5000 sublists, which means my program keeps running forever unless I run the code in increments of 500 sublists. How do I change the flow of this program so I can still look at all j values corresponding to each i, and yet be able to run the program for ~5000 sublists. (wn is Wordnet library)
Here's part of my code:
for i in range(len(somelist)):
if i == len(somelist)-1: #if the last sublist, do not compare
break
title_former = somelist[i]
for word in title_former:
singular = wn.morphy(word) #convert to singular
if singular == None:
pass
elif singular != None:
newWordSyn = getNewWordSyn(word,singular)
if not newWordSyn:
uncounted_words.append(word)
else:
for j in range(i+1,len(somelist)):
title_latter = somelist[j]
for word1 in title_latter:
singular1 = wn.morphy(word1)
if singular1 == None:
uncounted_words.append(word1)
elif singular1 != None:
newWordSyn1 = getNewWordSyn(word1,singular1)
tempSimilarity = newWordSyn.wup_similarity(newWordSyn1)
Example:
Input = [['space', 'invaders'], ['draw']]
Output= {('space','draw'):0.5,('invaders','draw'):0.2}
The output is a dictionary with corresponding string pair tuple and their similarity value. The above code snippet is not complete.
How about doing a bit of preprocessing instead of doing a bunch of operations over and over? I did not test this, but you get the idea; you need to take anything you can out of the loop.
# Preprocessing:
unencountered_words = []
preprocessed_somelist = []
for sublist in somelist:
new_sublist = []
preprocessed_somelist.append(new_sublist)
for word in sublist:
temp = wn.morphy(word)
if temp:
new_sublist.append(temp)
else:
unencountered_words.append(word)
# Nested loops:
for i in range(len(preprocessed_somelist) - 1): #equivalent to your logic
for word in preprocessed_somelist[i]:
for j in range(i+1, len(preprocessed_somelist)):
for word1 in preprocessed_somelist[j]:
tempSimilarity = newWordSyn.wup_similarity(newWordSyn1)
you could try something like this but I doubt it will be faster (and you will probably need to change the distance function)
def dist(s1,s2):
return sum([i!=j for i,j in zip(s1,s2)]) + abs(len(s1)-len(s2))
dict([((k,v),dist(k,v)) for k,v in itertools.product(Input1,Input2)]
This is always going to have scaling issues, because you're doing n^2 string comparisons. Julius' optimization is certainly a good starting point.
The next thing you can do is store similarity results so you don't have to compare the same words repeatedly.
One other optimisation you can make is store comparisons of words and reuse them if the same words are encountered.
key = (newWordSyn, newWordSyn1)
if key in prevCompared:
tempSimilarity = prevCompared[(word, word1)]
else:
tempSimilarity = newWordSyn.wup_similarity(newWordSyn1)
prevCompared[key] = tempSimilarity
prevCompared[(newWordSyn1, newWordSyn)] = tempSimilarity
This only helps if you will see a lot of the same word combination, but i think wup_similarity is quite expensive.