Loop over string based on keys in dictionary

Loop over string based on keys in dictionary - python

Instead of looping over each separate character of a string, I want to loop over parts of a string (multiple characters). Those parts are defined by the keys of a dictionary.
Example:
my_dict = {'010': 'a', '000': 'e', '1101': 'f', '1010': 'h', '1000': 'i', '0111': 'm', '0010': 'n', '1011': 's', '0110': 't', '11001': 'l', '00110': 'o', '10011': 'p', '11000': 'r', '00111': 'u', '10010': 'x'}
word = "1000001001100001100000100000110"
output = ""
What I've tried (looping over each character separately, indeed):
for i in word:
letter = my_dict[i]
output += letter
word = word.lstrip(letter)
My output:
"KeyError: '1'"
But I want to get key "1000" and its value "i", and then continue with key "0010" and get its value "n", etc...
Expected output:
# Expected output:
output = "internet"

Assuming it's a prefix code (otherwise you'd need to define how to deal with ambiguities), accumulate the bits until you have a match, then output the letter and clear the bits:
output = ""
bits = ""
for bit in word:
bits += bit
if bits in my_dict:
letter = my_dict[bits]
output += letter
bits = ""
Try it online!
Slight variation of it the lookup, reminded by Jnevill's answer:
if letter := my_dict.get(bits):
output += letter

You could use a regular expression to substitutes the patterns with the corresponding letters. re.sub allows use of a function for the replacement which could be access to the dictionary to get the letters. The search pattern would need to have the longer values first so that they are "consumed" in priority over shorter patterns that could start with the same bits:
my_dict = {'010': 'a', '000': 'e', '1101': 'f', '1010': 'h', '1000': 'i', '0111': 'm', '0010': 'n', '1011': 's', '0110': 't', '11001': 'l', '00110': 'o', '10011': 'p', '11000': 'r', '00111': 'u', '10010': 'x'}
word = "1000001001100001100000100000110"
import re
pattern = "|".join(sorted(my_dict.keys(),key=len,reverse=True))
output = re.sub(pattern,lambda m:my_dict[m.group(0)],word)
print(output) # internet
[EDIT]
If there are no conflicts between short and long bit patterns, the sort is not needed (as Kelly pointed out), the solution could be a single line:
output = re.sub('|'.join(my_dict),lambda m:my_dict[m[0]],word)

Issue with your code:
for i in word: # here, i is a single character
# so you can't get corresponding value since it's multiple character keys
letter = my_dict[i]
output += letter # this would work fine
word = word.lstrip(letter)
You can do a while loop on word, and remove the part you found in the dict each time. When words is empty, you will stop looping and the program ends.
You can iterate over each key in the dict and test if it match the beginning of the word. If it does, you have the letter you are looking for. Do what you want instead of the print, and repeat.
translate_table = {'010': 'a', '000': 'e', '1101': 'f', '1010': 'h', '1000': 'i', '0111': 'm', '0010': 'n', '1011': 's', '0110': 't', '11001': 'l', '00110': 'o', '10011': 'p', '11000': 'r', '00111': 'u', '10010': 'x'}
message = "1000001001100001100000100000110"
while message:
for code, letter in translate_table.items():
if message.startswith(code):
# replace this with whatever you want to do with the letter
print(letter, end="")
# "Cut" the word to keep the remaining characters
message = message[len(code):]
break # a letter was found, move to next while iteration

While iterating my_dict (as DorianTurba suggests) feels like a more elegant solution, your gut was suggesting that you should iterate word. To do this you can use a while loop and then manage the length of characters you jump in each iteration depending on the size of the my_dict key that matches the first 3, 4, or 5 characters in word.
Consider:
my_dict = {'010': 'a', '000': 'e', '1101': 'f', '1010': 'h', '1000': 'i', '0111': 'm', '0010': 'n', '1011': 's', '0110': 't', '11001': 'l', '00110': 'o', '10011': 'p', '11000': 'r', '00111': 'u', '10010': 'x'}
word = "1000001001100001100000100000110"
i=0
while len(word) > i:
for size in [3,4,5]:
if my_dict.get(word[i:i+size]):
print(my_dict[word[i:i+size]])
i += size
break

Related

How to make dictionary of alphabet and scrambled alphabet?

I need to create multiple dictionaries for another program using the alphabet as the key and a scrambled alphabet (the first one I have to do is BDFHJLCPRTXVZNYEIWGAKMUSQO) as the value. This is what I have so far but it says "unhashable type: 'list'"
_list = input("Enter list: ")
alpha = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']
dict1 = {}
for i in range(0,len(_list)):
dict1[alpha][i] == _list[i]
print(dict1)

There are two errors in your code.
dict1[alpha][i] should be dict1[alpha[i]]
As pointed out by Paul Rooney, you cannot use a list as a dictionary key, but I don't think that is what you are trying to do. Instead, you just want the ith element of the list alpha.
the == should be =.
== is used to compare, = is used to assign a value.
Also, the pythonic way to do your iteration is as follows:
dict1 = {alpha[idx]: character for idx, character in enumerate(_list)}
which is the equivalent of
dict1 = {}
for idx, character in enumerate(_list):
dict1[alpha[idx]] = character

why does the replace character remove both brackets?

Hey guys in new to Python. And I was playing around writing python and I'm stuck.
words = ['w','hello.','my','.name.','(is)','james.','whats','your','name?']
alphabet = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
'''#position of the invalid words i.e the ones that '''
inpos = -1
for word in words:
inpos = inpos + 1
#pass
for letter in word:
#print(letter)
if letter in alphabet:
pass
#print('Valid')
elif letter not in alphabet:
new_word = word.replace(letter,"")
print(word)
print(new_word)
words[inpos] = new_word
print(words)
This code is meant to clean the text (remove all full stops, commas, and other characters)
The problem is when I run it removes the adds the brackets
Heres the output:
Image of output
Can anyone explain why this is happening?

No, it does not add anything. You're printing both old and new word:
print(word)
print(new_word)
so when the new_word is (is the word is still (is).
BTW your code has a logical error: when you remove a character you put back new_word in the list, but word is still the old value. So only the last change for every word will be saved in the list words.

How can I pop this letter if it exists based on this algorithm?

I'm trying to pop any letters given by the user, for example if they give you a keyword "ROSES" then these letters should be popped out of the list.
Note: I have a lot more explanation after the SOURCE CODE
SOURCE CODE
alphabet = ["A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z"]
encrypted_message = []
key_word = str(input("Please enter a word:"))
key_word = list(key_word)
#print(key_word)
check_for_dup = 0
for letter in key_word:
for character in alphabet:
if letter in character:
check_for_dup +=1
alphabet.pop(check_for_dup)
print(alphabet)
print(encrypted_message)
SAMPLE INPUT
Let's say keyword is "Roses"
this what it gives me a list of the following ['A', 'C', 'E', 'G', 'I', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']
But that's wrong It should of just removed the characters that existed in the keyword given by the user, like the word "Roses" each letter should of been removed and not in the list being popped. as you can see in the list the letters "B","D","F","H",etc were gone. What I'm trying to do is pop the index of the alphabet letters that the keyword exists.
this is what should of happened.
["A","B","C","D","F","G","H","I","J","K","L","M","N","P","Q","T","U","V","W","X","Y","Z"]
The letters of the keyword "ROSES" were deleted of the list

There is some shortcomings in your code here is an implementation that works:
alphabet = ["A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z"]
key_word = str(input("Please enter a word:"))
for letter in key_word.upper():
if letter in alphabet:
alphabet.remove(letter)
print(alphabet)
Explanations
You can iterate on a string, no need to cast it as a list
Use remove since you can use the str type directly
You need to .upper() the input because you want to remove A if the user input a
Note that I did not handle encrypted_message since it is unused at the moment.
Also, as some comments says you could use a set instead of a list since lookups are faster for sets.
alphabet = {"A","B","C","D",...}
EDIT
alphabet = ["A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z"]
key_word = str(input("Please enter a word:"))
encrypted_message = []
for letter in key_word.upper():
if letter in alphabet:
alphabet.remove(letter)
encrypted_message.append(letter)
encrypted_message.extend(alphabet)
This is a new implementation with the handling of your encrypted_message. This will keep the order of the alphabet after the input of the user. Also, if you're wondering why there's no duplicate, you will be appending only if letter is in alphabet which means the second time it won't be in alphabet and therefore not added to your encrypted_message.

you can directly check with input key
iterate all the letters in data, and check whether or not the letter in the input_key, if yes discard it
data = ['A', 'C', 'E', 'G', 'I', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']
input_key = 'Roses'
output = [l for l in data if l not in input_key.upper()]
print output
['A', 'C', 'G', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']

You can do something like this....
import string
alphabet = list(string.ascii_lowercase)
user_word = 'Roses'
user_word = user_word.lower()
letters_to_remove = set(list(user_word)) # Create a unique set of characters to remove
for letter in letters_to_remove:
alphabet.remove(letter)

alphabet = ["A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z"]
# this can be accepted as a user input as well
userinput = "ROSES"
# creates a list with unique values in uppercase
l = set(list(userinput.upper()))
# iterates over items in l to remove
# the corresponding items in alphabet
for x in l:
alphabet.remove(x)

' '.join(alphabet).translate({ord(c): "" for c in keyword}).split()

Try this:
import string
key_word = set(input("Please enter a word:").lower())
print(sorted(set(string.ascii_lowercase) - key_word))
Explanation:
When checking for (in)existence, it's better to use set. https://docs.python.org/3/tutorial/datastructures.html#sets
Instead of iterating over it N times (N = number of input characters), you hash the character N times and check if there's already something at the result hash or not.
If you check the speed, try with "WXYZZZYW" and you'll see that it'll be a lot slower than if it were "ABCDDDCA" with the list way. With set, it will be always the same time.
The rest is pretty trivial. Casting to lowercase (or uppercase), to make sure it hits a match, case insensitive.
And then, we end by doing a set difference (-). It's all the items that are in the first set but not in the second one.

Why is my implementation of binary search very inefficient?

I am doing a Python exercise to search a word from a given sorted wordlist, containing more than 100,000 words.
When using bisect_left from the Python bisect module, it is very efficient, but using the binary method created by myself is very inefficient. Could anyone please clarify why？
This is the searching method using the Python bisect module:
def in_bisect(word_list, word):
"""Checks whether a word is in a list using bisection search.
Precondition: the words in the list are sorted
word_list: list of strings
word: string
"""
i = bisect_left(word_list, word)
if i != len(word_list) and word_list[i] == word:
return True
else:
return False
My implementation is really very inefficient (don't know why):
def my_bisect(wordlist,word):
"""search the given word in a wordlist using
bisection search, also known as binary search
"""
if len(wordlist) == 0:
return False
if len(wordlist) == 1:
if wordlist[0] == word:
return True
else:
return False
if word in wordlist[len(wordlist)/2:]:
return True
return my_bisect(wordlist[len(wordlist)/2:],word)

if word in wordlist[len(wordlist)/2:]
will make Python search through half of your wordlist, which is kinda defeating the purpose of writing a binary search in the first place. Also, you are not splitting the list in half correctly. The strategy for binary search is to cut the search space in half each step, and then only apply the same strategy to the half which your word could be in. In order to know which half is the right one to search, it is critical that the wordlist is sorted. Here's a sample implementation which keeps track of the number of calls needed to verify whether a word is in wordlist.
import random
numcalls = 0
def bs(wordlist, word):
# increment numcalls
print('wordlist',wordlist)
global numcalls
numcalls += 1
# base cases
if not wordlist:
return False
length = len(wordlist)
if length == 1:
return wordlist[0] == word
# split the list in half
mid = int(length/2) # mid index
leftlist = wordlist[:mid]
rightlist = wordlist[mid:]
print('leftlist',leftlist)
print('rightlist',rightlist)
print()
# recursion
if word < rightlist[0]:
return bs(leftlist, word) # word can only be in left list
return bs(rightlist, word) # word can only be in right list
alphabet = 'abcdefghijklmnopqrstuvwxyz'
wl = sorted(random.sample(alphabet, 10))
print(bs(wl, 'm'))
print(numcalls)
I included some print statements so you can see what is going on. Here are two sample outputs. First: word is in the wordlist:
wordlist ['b', 'c', 'g', 'i', 'l', 'm', 'n', 'r', 's', 'v']
leftlist ['b', 'c', 'g', 'i', 'l']
rightlist ['m', 'n', 'r', 's', 'v']
wordlist ['m', 'n', 'r', 's', 'v']
leftlist ['m', 'n']
rightlist ['r', 's', 'v']
wordlist ['m', 'n']
leftlist ['m']
rightlist ['n']
wordlist ['m']
True
4
Second: word is not in the wordlist:
wordlist ['a', 'c', 'd', 'e', 'g', 'l', 'o', 'q', 't', 'x']
leftlist ['a', 'c', 'd', 'e', 'g']
rightlist ['l', 'o', 'q', 't', 'x']
wordlist ['l', 'o', 'q', 't', 'x']
leftlist ['l', 'o']
rightlist ['q', 't', 'x']
wordlist ['l', 'o']
leftlist ['l']
rightlist ['o']
wordlist ['l']
False
4
Note that if you double the size of the wordlist, i.e. use
wl = sorted(random.sample(alphabet, 20))
numcalls on average will be only one higher than for a wordlist of length 10, because wordlist has to be split in half only once more.

to search if a word is in a wordlist simply (python 2.7):
def bisect_fun(listfromfile, wordtosearch):
bi = bisect.bisect_left(listfromfile, wordtosearch)
if listfromfile[bi] == wordtosearch:
return listfromfile[bi], bi

Is that a tag list or something else?

I am new to NLP and NLTK, and I want to find ambiguous words, meaning words with at least n different tags. I have this method, but the output is more than confusing.
Code:
def MostAmbiguousWords(words, n):
# wordsUniqeTags holds a list of uniqe tags that have been observed for a given word
wordsUniqeTags = {}
for (w,t) in words:
if wordsUniqeTags.has_key(w):
wordsUniqeTags[w] = wordsUniqeTags[w] | set(t)
else:
wordsUniqeTags[w] = set([t])
# Starting to count
res = []
for w in wordsUniqeTags:
if len(wordsUniqeTags[w]) >= n:
res.append((w, wordsUniqeTags[w]))
return res
MostAmbiguousWords(brown.tagged_words(), 13)
Output:
[("what's", set(['C', 'B', 'E', 'D', 'H', 'WDT+BEZ', '-', 'N', 'T', 'W', 'V', 'Z', '+'])),
("who's", set(['C', 'B', 'E', 'WPS+BEZ', 'H', '+', '-', 'N', 'P', 'S', 'W', 'V', 'Z'])),
("that's", set(['C', 'B', 'E', 'D', 'H', '+', '-', 'N', 'DT+BEZ', 'P', 'S', 'T', 'W', 'V', 'Z'])),
('that', set(['C', 'D', 'I', 'H', '-', 'L', 'O', 'N', 'Q', 'P', 'S', 'T', 'W', 'CS']))]
Now I have no idea what B,C,Q, ect. could represent. So, my questions:
What are these?
What do they mean? (In case they are tags)
I think they are not tags, because who and whats don't have the WH tag indicating "wh question words".
I'll be happy if someone could post a link that includes a mapping of all possible tags and their meaning.

It looks like you have a typo. In this line:
wordsUniqeTags[w] = wordsUniqeTags[w] | set(t)
you should have set([t]) (not set(t)), like you do in the else case.
This explains the behavior you're seeing because t is a string and set(t) is making a set out of each character in the string. What you want is set([t]) which makes a set that has t as its element.
>>> t = 'WHQ'
>>> set(t)
set(['Q', 'H', 'W']) # bad
>>> set([t])
set(['WHQ']) # good
By the way, you can correct the problem and simplify things by just changing that line to:
wordsUniqeTags[w].add(t)
But, really, you should make use of the setdefault method on dict and list comprehension syntax to improve the method overall. So try this instead:
def most_ambiguous_words(words, n):
# wordsUniqeTags holds a list of uniqe tags that have been observed for a given word
wordsUniqeTags = {}
for (w,t) in words:
wordsUniqeTags.setdefault(w, set()).add(t)
# Starting to count
return [(word,tags) for word,tags in wordsUniqeTags.iteritems() if len(tags) >= n]

You are splitting your POS tags into single characters in this line:
wordsUniqeTags[w] = wordsUniqeTags[w] | set(t)
set('AT') results in set(['A', 'T']).

How about making use of the Counter and defaultdict functionality in the collections module?
from collection import defaultdict, Counter
def most_ambiguous_words(words, n):
counts = defaultdict(Counter)
for (word,tag) in words:
counts[word][tag] += 1
return [(w, counts[w].keys()) for w in counts if len(counts[word]) > n]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Loop over string based on keys in dictionary - python

Related

How to make dictionary of alphabet and scrambled alphabet?

why does the replace character remove both brackets?

How can I pop this letter if it exists based on this algorithm?

Why is my implementation of binary search very inefficient?

Is that a tag list or something else?

Categories

Resources