I am doing a Python exercise to search a word from a given sorted wordlist, containing more than 100,000 words.
When using bisect_left from the Python bisect module, it is very efficient, but using the binary method created by myself is very inefficient. Could anyone please clarify why?
This is the searching method using the Python bisect module:
def in_bisect(word_list, word):
"""Checks whether a word is in a list using bisection search.
Precondition: the words in the list are sorted
word_list: list of strings
word: string
"""
i = bisect_left(word_list, word)
if i != len(word_list) and word_list[i] == word:
return True
else:
return False
My implementation is really very inefficient (don't know why):
def my_bisect(wordlist,word):
"""search the given word in a wordlist using
bisection search, also known as binary search
"""
if len(wordlist) == 0:
return False
if len(wordlist) == 1:
if wordlist[0] == word:
return True
else:
return False
if word in wordlist[len(wordlist)/2:]:
return True
return my_bisect(wordlist[len(wordlist)/2:],word)
if word in wordlist[len(wordlist)/2:]
will make Python search through half of your wordlist, which is kinda defeating the purpose of writing a binary search in the first place. Also, you are not splitting the list in half correctly. The strategy for binary search is to cut the search space in half each step, and then only apply the same strategy to the half which your word could be in. In order to know which half is the right one to search, it is critical that the wordlist is sorted. Here's a sample implementation which keeps track of the number of calls needed to verify whether a word is in wordlist.
import random
numcalls = 0
def bs(wordlist, word):
# increment numcalls
print('wordlist',wordlist)
global numcalls
numcalls += 1
# base cases
if not wordlist:
return False
length = len(wordlist)
if length == 1:
return wordlist[0] == word
# split the list in half
mid = int(length/2) # mid index
leftlist = wordlist[:mid]
rightlist = wordlist[mid:]
print('leftlist',leftlist)
print('rightlist',rightlist)
print()
# recursion
if word < rightlist[0]:
return bs(leftlist, word) # word can only be in left list
return bs(rightlist, word) # word can only be in right list
alphabet = 'abcdefghijklmnopqrstuvwxyz'
wl = sorted(random.sample(alphabet, 10))
print(bs(wl, 'm'))
print(numcalls)
I included some print statements so you can see what is going on. Here are two sample outputs. First: word is in the wordlist:
wordlist ['b', 'c', 'g', 'i', 'l', 'm', 'n', 'r', 's', 'v']
leftlist ['b', 'c', 'g', 'i', 'l']
rightlist ['m', 'n', 'r', 's', 'v']
wordlist ['m', 'n', 'r', 's', 'v']
leftlist ['m', 'n']
rightlist ['r', 's', 'v']
wordlist ['m', 'n']
leftlist ['m']
rightlist ['n']
wordlist ['m']
True
4
Second: word is not in the wordlist:
wordlist ['a', 'c', 'd', 'e', 'g', 'l', 'o', 'q', 't', 'x']
leftlist ['a', 'c', 'd', 'e', 'g']
rightlist ['l', 'o', 'q', 't', 'x']
wordlist ['l', 'o', 'q', 't', 'x']
leftlist ['l', 'o']
rightlist ['q', 't', 'x']
wordlist ['l', 'o']
leftlist ['l']
rightlist ['o']
wordlist ['l']
False
4
Note that if you double the size of the wordlist, i.e. use
wl = sorted(random.sample(alphabet, 20))
numcalls on average will be only one higher than for a wordlist of length 10, because wordlist has to be split in half only once more.
to search if a word is in a wordlist simply (python 2.7):
def bisect_fun(listfromfile, wordtosearch):
bi = bisect.bisect_left(listfromfile, wordtosearch)
if listfromfile[bi] == wordtosearch:
return listfromfile[bi], bi
Related
Instead of looping over each separate character of a string, I want to loop over parts of a string (multiple characters). Those parts are defined by the keys of a dictionary.
Example:
my_dict = {'010': 'a', '000': 'e', '1101': 'f', '1010': 'h', '1000': 'i', '0111': 'm', '0010': 'n', '1011': 's', '0110': 't', '11001': 'l', '00110': 'o', '10011': 'p', '11000': 'r', '00111': 'u', '10010': 'x'}
word = "1000001001100001100000100000110"
output = ""
What I've tried (looping over each character separately, indeed):
for i in word:
letter = my_dict[i]
output += letter
word = word.lstrip(letter)
My output:
"KeyError: '1'"
But I want to get key "1000" and its value "i", and then continue with key "0010" and get its value "n", etc...
Expected output:
# Expected output:
output = "internet"
Assuming it's a prefix code (otherwise you'd need to define how to deal with ambiguities), accumulate the bits until you have a match, then output the letter and clear the bits:
output = ""
bits = ""
for bit in word:
bits += bit
if bits in my_dict:
letter = my_dict[bits]
output += letter
bits = ""
Try it online!
Slight variation of it the lookup, reminded by Jnevill's answer:
if letter := my_dict.get(bits):
output += letter
You could use a regular expression to substitutes the patterns with the corresponding letters. re.sub allows use of a function for the replacement which could be access to the dictionary to get the letters. The search pattern would need to have the longer values first so that they are "consumed" in priority over shorter patterns that could start with the same bits:
my_dict = {'010': 'a', '000': 'e', '1101': 'f', '1010': 'h', '1000': 'i', '0111': 'm', '0010': 'n', '1011': 's', '0110': 't', '11001': 'l', '00110': 'o', '10011': 'p', '11000': 'r', '00111': 'u', '10010': 'x'}
word = "1000001001100001100000100000110"
import re
pattern = "|".join(sorted(my_dict.keys(),key=len,reverse=True))
output = re.sub(pattern,lambda m:my_dict[m.group(0)],word)
print(output) # internet
[EDIT]
If there are no conflicts between short and long bit patterns, the sort is not needed (as Kelly pointed out), the solution could be a single line:
output = re.sub('|'.join(my_dict),lambda m:my_dict[m[0]],word)
Issue with your code:
for i in word: # here, i is a single character
# so you can't get corresponding value since it's multiple character keys
letter = my_dict[i]
output += letter # this would work fine
word = word.lstrip(letter)
You can do a while loop on word, and remove the part you found in the dict each time. When words is empty, you will stop looping and the program ends.
You can iterate over each key in the dict and test if it match the beginning of the word. If it does, you have the letter you are looking for. Do what you want instead of the print, and repeat.
translate_table = {'010': 'a', '000': 'e', '1101': 'f', '1010': 'h', '1000': 'i', '0111': 'm', '0010': 'n', '1011': 's', '0110': 't', '11001': 'l', '00110': 'o', '10011': 'p', '11000': 'r', '00111': 'u', '10010': 'x'}
message = "1000001001100001100000100000110"
while message:
for code, letter in translate_table.items():
if message.startswith(code):
# replace this with whatever you want to do with the letter
print(letter, end="")
# "Cut" the word to keep the remaining characters
message = message[len(code):]
break # a letter was found, move to next while iteration
While iterating my_dict (as DorianTurba suggests) feels like a more elegant solution, your gut was suggesting that you should iterate word. To do this you can use a while loop and then manage the length of characters you jump in each iteration depending on the size of the my_dict key that matches the first 3, 4, or 5 characters in word.
Consider:
my_dict = {'010': 'a', '000': 'e', '1101': 'f', '1010': 'h', '1000': 'i', '0111': 'm', '0010': 'n', '1011': 's', '0110': 't', '11001': 'l', '00110': 'o', '10011': 'p', '11000': 'r', '00111': 'u', '10010': 'x'}
word = "1000001001100001100000100000110"
i=0
while len(word) > i:
for size in [3,4,5]:
if my_dict.get(word[i:i+size]):
print(my_dict[word[i:i+size]])
i += size
break
first time posting, new to programming. Issue I'm having is when I run my Pangram function, and turn the input string into a set list, the list still has multiple 't' but none of the other letters. when I input "The quick brown fox jumps over the lazy dog" when my code turns this to an organized list to match the alphabet, there are 2 t's. As you can see i'm using print to see what everything is doing
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 't', 'u', 'v', 'w', 'x', 'y', 'z']
above is what my input string converts to and there are 2 t's but all other multiple letters are gone. I also tried making the upper T a lower T manually, and also making other random letter upper and it has no problem with other letters.
def ispangram(str1, alphabet=string.ascii_lowercase):
str1 = str1.replace(' ','')
str1 = list(set(str1))
str1 = [letter.lower() for letter in str1]
str1.sort()
print(str1)
alphabet = list(set(alphabet))
alphabet.sort()
print(alphabet)
if str1 == alphabet:
return 'Is Pangram!'
else:
return 'Is not Pangram!'
You're converting your string to lowercase after collecting a set, you should make it lowercase before the set
str1 = list(set(str1.lower()))
You are lowercasing the characters after building the set, so T and t are considered different.
Instead, lowercase before building the set:
def ispangram(str1, alphabet=string.ascii_lowercase):
str1 = str1.replace(' ','')
str1 = [letter.lower() for letter in str1]
str1 = list(set(str1))
str1.sort()
print(str1)
Or in a much shorter way:
def ispangram(str1, alphabet=string.ascii_lowercase):
str1 = sorted(set(str1.replace(' ','').lower()))
Hey guys in new to Python. And I was playing around writing python and I'm stuck.
words = ['w','hello.','my','.name.','(is)','james.','whats','your','name?']
alphabet = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
'''#position of the invalid words i.e the ones that '''
inpos = -1
for word in words:
inpos = inpos + 1
#pass
for letter in word:
#print(letter)
if letter in alphabet:
pass
#print('Valid')
elif letter not in alphabet:
new_word = word.replace(letter,"")
print(word)
print(new_word)
words[inpos] = new_word
print(words)
This code is meant to clean the text (remove all full stops, commas, and other characters)
The problem is when I run it removes the adds the brackets
Heres the output:
Image of output
Can anyone explain why this is happening?
No, it does not add anything. You're printing both old and new word:
print(word)
print(new_word)
so when the new_word is (is the word is still (is).
BTW your code has a logical error: when you remove a character you put back new_word in the list, but word is still the old value. So only the last change for every word will be saved in the list words.
I need to find words in a list of lists (a somewhat matrix), by entering a given direction to search.
so for example, if I need to search all words that are horizontal from right to left - I will do that by manipulating the indexes that run over the columns.
the_matrix = [['a', 'p', 'p', 'l', 'e'],
['a', 'g', 'o', 'd', 'o'],
['n', 'n', 'e', 'r', 't'],
['g', 'a', 'T', 'A', 'C'],
['m', 'i', 'c', 's', 'r'],
['P', 'o', 'P', 'o', 'P']]
the_word_list = ['ert','PoP']
def find_words_in_matrix(directions):
good_list = []
for col in range(len(the_matrix[0])):
for row in range(len(the_matrix)):
for word in the_word_list:
for i in range(len(word)):
found_word = True
#a = word[i]
if col + i > len(the_matrix[0])-1:
break
#b = the_matrix[row][col+i]
if word[i] != the_matrix[row][col+i]:
found_word=False
break
if found_word is True:
good_list.append(word)
return good_list
Im getting the output:
['PoP', 'ert', 'PoP', 'ert', 'PoP']
instead of:
['PoP', 'ert', 'PoP']
*pop shows up twich at the bottom line, not three times. and ert only once.
that is my problem
thanks for the help!
You are getting stray matches when the break terminates the loop early before the whole word has matched. To eliminate this, you can keep track of the match length:
def find_words_in_matrix(directions):
good_list = []
for col in range(len(the_matrix[0])):
for row in range(len(the_matrix)):
for word in the_word_list:
match_len = 0
for i in range(len(word)):
found_word = True
#a = word[i]
if col + i > len(the_matrix[0])-1:
break
#b = the_matrix[row][col+i]
if word[i] != the_matrix[row][col+i]:
found_word=False
break
match_len += 1
if (match_len == len(word)) and (found_word is True):
good_list.append(word)
return good_list
I am new to NLP and NLTK, and I want to find ambiguous words, meaning words with at least n different tags. I have this method, but the output is more than confusing.
Code:
def MostAmbiguousWords(words, n):
# wordsUniqeTags holds a list of uniqe tags that have been observed for a given word
wordsUniqeTags = {}
for (w,t) in words:
if wordsUniqeTags.has_key(w):
wordsUniqeTags[w] = wordsUniqeTags[w] | set(t)
else:
wordsUniqeTags[w] = set([t])
# Starting to count
res = []
for w in wordsUniqeTags:
if len(wordsUniqeTags[w]) >= n:
res.append((w, wordsUniqeTags[w]))
return res
MostAmbiguousWords(brown.tagged_words(), 13)
Output:
[("what's", set(['C', 'B', 'E', 'D', 'H', 'WDT+BEZ', '-', 'N', 'T', 'W', 'V', 'Z', '+'])),
("who's", set(['C', 'B', 'E', 'WPS+BEZ', 'H', '+', '-', 'N', 'P', 'S', 'W', 'V', 'Z'])),
("that's", set(['C', 'B', 'E', 'D', 'H', '+', '-', 'N', 'DT+BEZ', 'P', 'S', 'T', 'W', 'V', 'Z'])),
('that', set(['C', 'D', 'I', 'H', '-', 'L', 'O', 'N', 'Q', 'P', 'S', 'T', 'W', 'CS']))]
Now I have no idea what B,C,Q, ect. could represent. So, my questions:
What are these?
What do they mean? (In case they are tags)
I think they are not tags, because who and whats don't have the WH tag indicating "wh question words".
I'll be happy if someone could post a link that includes a mapping of all possible tags and their meaning.
It looks like you have a typo. In this line:
wordsUniqeTags[w] = wordsUniqeTags[w] | set(t)
you should have set([t]) (not set(t)), like you do in the else case.
This explains the behavior you're seeing because t is a string and set(t) is making a set out of each character in the string. What you want is set([t]) which makes a set that has t as its element.
>>> t = 'WHQ'
>>> set(t)
set(['Q', 'H', 'W']) # bad
>>> set([t])
set(['WHQ']) # good
By the way, you can correct the problem and simplify things by just changing that line to:
wordsUniqeTags[w].add(t)
But, really, you should make use of the setdefault method on dict and list comprehension syntax to improve the method overall. So try this instead:
def most_ambiguous_words(words, n):
# wordsUniqeTags holds a list of uniqe tags that have been observed for a given word
wordsUniqeTags = {}
for (w,t) in words:
wordsUniqeTags.setdefault(w, set()).add(t)
# Starting to count
return [(word,tags) for word,tags in wordsUniqeTags.iteritems() if len(tags) >= n]
You are splitting your POS tags into single characters in this line:
wordsUniqeTags[w] = wordsUniqeTags[w] | set(t)
set('AT') results in set(['A', 'T']).
How about making use of the Counter and defaultdict functionality in the collections module?
from collection import defaultdict, Counter
def most_ambiguous_words(words, n):
counts = defaultdict(Counter)
for (word,tag) in words:
counts[word][tag] += 1
return [(w, counts[w].keys()) for w in counts if len(counts[word]) > n]