Soundex algorithm in Python (homework help request) - python

The US census bureau uses a special encoding called “soundex” to locate information about a person. The soundex is an encoding of surnames (last names) based on the way a surname sounds rather than the way it is spelled. Surnames that sound the same, but are spelled differently, like SMITH and SMYTH, have the same code and are filed together. The soundex coding system was developed so that you can find a surname even though it may have been recorded under various spellings.
In this lab you will design, code, and document a program that produces the soundex code when input with a surname. A user will be prompted for a surname, and the program should output the corresponding code.
Basic Soundex Coding Rules
Every soundex encoding of a surname consists of a letter and three numbers. The letter used is always the first letter of the surname. The numbers are assigned to the remaining letters of the surname according to the soundex guide shown below. Zeroes are added at the end if necessary to always produce a four-character code. Additional letters are disregarded.
Soundex Coding Guide
Soundex assigns a number for various consonants. Consonants that sound alike are assigned the same number:
Number Consonants
1 B, F, P, V 2 C, G, J, K, Q, S, X, Z 3 D, T 4 L 5 M, N 6 R
Soundex disregards the letters A, E, I, O, U, H, W, and Y.
There are 3 additional Soundex Coding Rules that are followed. A good program design would implement these each as one or more separate functions.
Rule 1. Names With Double Letters
If the surname has any double letters, they should be treated as one letter. For example:
Gutierrez is coded G362 (G, 3 for the T, 6 for the first R, second R ignored, 2 for the Z).
Rule 2. Names with Letters Side-by-Side that have the Same Soundex Code Number
If the surname has different letters side-by-side that have the same number in the soundex coding guide, they should be treated as one letter. Examples:
Pfister is coded as P236 (P, F ignored since it is considered same as P, 2 for the S, 3 for the T, 6 for the R).
Jackson is coded as J250 (J, 2 for the C, K ignored same as C, S ignored same as C, 5 for the N, 0 added).
Rule 3. Consonant Separators
3.a. If a vowel (A, E, I, O, U) separates two consonants that have the same soundex code, the consonant to the right of the vowel is coded. Example:
Tymczak is coded as T-522 (T, 5 for the M, 2 for the C, Z ignored (see "Side-by-Side" rule above), 2 for the K). Since the vowel "A" separates the Z and K, the K is coded.
3.b. If "H" or "W" separate two consonants that have the same soundex code, the consonant to the right is not coded. Example:
*Ashcraft is coded A261 (A, 2 for the S, C ignored since same as S with H in between, 6 for the R, 1 for the F). It is not coded A226.
So far this is my code:
surname = raw_input("Please enter surname:")
outstring = ""
outstring = outstring + surname[0]
for i in range (1, len(surname)):
nextletter = surname[i]
if nextletter in ['B','F','P','V']:
outstring = outstring + '1'
elif nextletter in ['C','G','J','K','Q','S','X','Z']:
outstring = outstring + '2'
elif nextletter in ['D','T']:
outstring = outstring + '3'
elif nextletter in ['L']:
outstring = outstring + '4'
elif nextletter in ['M','N']:
outstring = outstring + '5'
elif nextletter in ['R']:
outstring = outstring + '6'
print outstring
sufficiently does what it is asked to, I am just not sure how to code the three rules. That is where I need help. So, any help is appreciated.

I would suggest you try the following.
Store a CurrentCoded and LastCoded variable to work with before appended to your output
Break down the system into useful functions, such as
Boolean IsVowel(Char)
Int Coded(Char)
Boolean IsRule1(Char, Char)
Once you break it down nicely it should become easier to manage.

This is hardly perfect (for instance, it produces the wrong result if the input doesn't start with a letter), and it doesn't implement the rules as independently-testable functions, so it's not really going to serve as an answer to the homework question. But this is how I'd implement it:
>>> def soundex_prepare(s):
"""Prepare string for Soundex encoding.
Remove non-alpha characters (and the not-of-interest W/H/Y),
convert to upper case, and remove all runs of repeated letters."""
p = re.compile("[^a-gi-vxz]", re.IGNORECASE)
s = re.sub(p, "", s).upper()
for c in set(s):
s = re.sub(c + "{2,}", c, s)
return s
>>> def soundex_encode(s):
"""Encode a name string using the Soundex algorithm."""
result = s[0].upper()
s = soundex_prepare(s[1:])
letters = 'ABCDEFGIJKLMNOPQRSTUVXZ'
codes = '.123.12.22455.12623.122'
d = dict(zip(letters, codes))
prev_code=""
for c in s:
code = d[c]
if code != "." and code != prev_code:
result += code
if len(result) >= 4: break
prev_code = code
return (result + "0000")[:4]

surname = input("Enter surname of the author: ") #asks user to input the author's surname
while surname != "": #initiates a while loop thats loops on as long as the input is not equal to an empty line
str_ini = surname[0] #denotes the initial letter of the surname string
mod_str1 = surname[1:] #denotes modified string excluding the first letter of the surname
import re #importing re module to access the sub function
mod_str2 = re.sub(r'[aeiouyhwAEIOUYHW]', '', mod_str1) #eliminating any instances of the given letters
mod_str21 = re.sub(r'[bfpvBFPV]', '1', mod_str2)
mod_str22 = re.sub(r'[cgjkqsxzCGJKQSXZ]', '2', mod_str21)
mod_str23 = re.sub(r'[dtDT]', '3', mod_str22)
mod_str24 = re.sub(r'[lL]', '4', mod_str23)
mod_str25 = re.sub(r'[mnMN]', '5', mod_str24)
mod_str26 = re.sub(r'[rR]', '6', mod_str25)
#substituting given letters with specific numbers as required by the soundex algorithm
mod_str3 = str_ini.upper()+mod_str26 #appending the surname initial with the remaining modified trunk
import itertools #importing itertools module to access the groupby function
mod_str4 = ''.join(char for char, rep in itertools.groupby(mod_str3))
#grouping each character of the string into individual characters
#removing sequences of identical numbers with a single number
#joining the individually grouped characters into a string
mod_str5 = (mod_str4[:4]) #setting character limit of the modified string upto the fourth place
if len (mod_str5) == 1:
print (mod_str5 + "000\n")
elif len (mod_str5) == 2:
print (mod_str5 + "00\n")
elif len (mod_str5) == 3:
print (mod_str5 + "0\n")
else:
print (mod_str5 + "\n")
#using if, elif and else arguments for padding with trailing zeros
print ("Press enter to exit") #specification for the interactor, to press enter (i.e., equivalent to a new line for breaking the while loop) when he wants to exit the program
surname = input("Enter surname of the author: ") #asking next input from the user if he wants to carry on
exit(0) #exiting the program at the break of the while loop

Related

How to filter a list of words as likely candidates for a secret word, given that I know certain letters

So I've been really struggling with this problem. First of all let's lay-out the basic rules that this Python program must follow:
Lingo is a popular word-guessing game show on television. The number
of letters of a target word to be guessed is given, and often also the
first letter. Players then make guesses subject to these restrictions
(number of letters and possibly also first letter), and the game tells
them which letters are correct and in the correct place, marked by a
red square (X), and which letters are correct but not in the correct
place, marked by a yellow circle (O). We do not use superfluous yellow
circles, i.e. a letter is marked correct at most as often as it
appears in the target word. If not all occurrences of the same letter
can get a yellow circle this way, priority is given from left to right
(but of course red squares have priority over yellow circles).
First I needed to create a function compare that compares a guessed word with a target word. The two inputs are string of the same length that entirely consist of lowercase ASCII letters. The output is a string of the same length consisting of the symbols X, O and -, where X represent a red square, O represents a yellow circle and - represents nothing.
Examples:
compare("health", "teethe") must return "OX--O-",
compare("rhythm", "teethe") must return "---XX-",
compare("mutate", "teethe") must return "--O-OX",
compare("teethe", "mutate") must return "O--O-X",
Now I've already successfully solved this part, but the next part is where I am stuck.
A function filter_targets(targets, guess_results) must be implemented.
This function must satisfy the following:
● targets is a list (or any other iterable) of possible target words; all words in targets must be of the same length.
● guess_results is a dictionary. Its keys are guessed words (which need not be in the target list) whose associated values are their compare results with a possible target.
● It must return a list of target words in targets that actually do satisfy all
the comparisons appearing in guess_results .
Examples:
Suppose targets contains all English 6-letter words obtained using the aforementioned word list and load function. Let guess_results contain the results of the first two guesses in the example given:
guess_results = {"health": "OX--O-", "rhythm": "---XX-"}
Then based on these results, we can see which words are still possible:
filter_targets(targets, guess_results) will return the following list:
["depths", "peitho", "seethe", "teethe", "tenths"]
Now my current program looks like this, it is very dirty code in my opinion and I would really love to see some implementations that are right. This code also still doesn't do what I want it to.
My reasoning was that if I convert the word or words in guess_results (it must also be able to take a single input, as my code hereunder is taking with the mem_list function) to the "XO-" format, then store all permutations of this format in a list, I can convert the words in wordlist to the format too. I would then check for every permutation in the list if it matches the "XO-" format of any word in the wordlist I am looping through, then store all matches in a separate list. I would then return that list as an answer.
I thought I was getting very close with this logic, but I now can't seem to permutate in a way the leaves some characters in place. And my current implementation would only work for a single word input.
from itertools import permutations
def load_words(file):
result = set()
with open(file) as f:
for line in f.readlines():
word = line.strip().lower()
if word.isalpha() and word.isascii():
result.add(word)
return sorted(result)
english_words = load_words("words.txt")
dutch_words = load_words("wordlist.txt")
english_10 = [word for word in dutch_words if len(word) == 10]
def filter_targets(targets, guess_results):
mem_list = ["-" for x in range(len("kasgelden"))]
perm_list = mem_list
# Checking X
for guess in guess_results.keys():
for idx in range(len(guess)):
if guess_results[guess][idx] == 'X':
mem_list[idx] = guess[idx]
o_list = []
# Checking O
for guess in guess_results.keys():
for idx in range(len(guess)):
if guess_results[guess][idx] == 'O':
mem_var = guess[idx]
o_list.append(mem_var)
add_list = mem_list
o_min = 0
o_max = len(o_list)
y = 0
# Adding the "O"
for element in add_list:
if element == "-" and o_min < o_max:
add_list[y] = o_list[o_min]
o_min += 1
y += 1
else:
y += 1
check_list = list(permutations(add_list))
guess_results = {"kasgelden": "--O--OOX-"}
d6 = [word for word in dutch_words if len(word) == 9]
filter_targets(d6, guess_results)
In this single world example: filter_targets(dw9, {"kasgelden": "--O--OOX-"}) should return ["doorspoel", "doublures", "hoofdstel", "moordspel",
"schildjes", "slurfdier", "thuisduel", "woordspel"] . With dw9 being all the words with a length of 9.
This problem is driving me completely crazy and has consumed three days of free time already, I hope someone can nudge me in the right direction here!
English wordlist I am using: https://raw.githubusercontent.com/dwyl/english-words/master/words.txt
Dutch wordlist I am using: https://raw.githubusercontent.com/OpenTaal/opentaal-wordlist/master/wordlist.txt
The guess results tell you two types of things:
For each position, is the guessed letter required or excluded.
For each guessed letter, does it appear in the correct answer.
You can track the first thing by maintaining a regex.
You can track the second thing with two strings, one containing needed characters, one containing forbidden ones.
You can build those as follows:
regex = ["[^^]"]*wordlen
required = restricted = ''
for guess,result in guess_results:
needs = never = ''
for i,score in enumerate(result):
letter = guess[i]
if score == 'X':
regex[i]=letter
needs += letter
elif score == 'O':
regex[i] = regex[i].replace(']',letter+']')
needs += letter
else:
regex[i] = regex[i].replace(']',letter+']')
never += letter
# don't reject a '-' found on doubled letter if one was needed.
restricted += ''.join(c for c in never if c not in needs)
# track the correct number of required letters.
required = ''.join([c*max(required.count(c),needs.count(c)) for c in set(required)] +
[c for c in needs if c not in required])
Then filtering the wordlist becomes
valid = [w for w in wordlist if (
re.match(''.join(regex), w) and
all(w.count(c)>=required.count(c) for c in required) and
not any(c in w for c in restricted) )]

Splitting String Variable Into New Variables

I am creating a script that splits the user input into letters, I have gotten this part down, however, how do I turn these separated letters into individual variables?
message = input("Enter Message) ")
list = list(message)
print(list)
Whilst this does print out the typed string into letters, I want to know how to turn those split letters into their own variables. e.g (h, e, l, l, o) Is there a way that I can, for example, only print the first letter or the second letter? (so that the letters are split into their own variables)
You can treat the list as a set 'own variables' (accessing them by index).
message = input("Enter Message) ")
l = list(message) # do not use reserved words, as 'list' for variable names
print(l)
print(l[0]) # prints the 1st letter
print(l[1]) # prints the 2nd letter
print(l[-1]) # prints the last letter
print(l[-2]) # prints the letter prior to the last
Just adding some examples:
message = input("Enter Message) ")
message_characters = list(message) # do not use single characters as variable names
for i, char in enumerate(message_characters):
print(f'The {i}{"th" if i>2 or i==0 else "nd" if i==2 else "st"} character is {char}')
# Though note that strings are also iterable:
for i, char in enumerate(message):
print(f'The {i}{"th" if i>2 or i==0 else "nd" if i==2 else "st"} character is {char}')

checking if string only contains certain letters in Python

i'm trying to write a program that completes the MU game
https://en.wikipedia.org/wiki/MU_puzzle
basically i'm stuck with ensuring that the user input contains ONLY M, U and I characters.
i've written
alphabet = ('abcdefghijklmnopqrstuvwxyz')
string = input("Enter a combination of M, U and I: ")
if "M" and "U" and "I" in string:
print("This is correct")
else:
print("This is invalid")
i only just realised this doesnt work because its not exclusive to just M U and I. can anyone give me a hand?
if all(c in "MIU" for c in string):
Checks to see if every character of the string is one of M, I, or U.
Note that this accepts an empty string, since every character of it is either an M, I, or a U, there just aren't any characters in "every character." If you require that the string actually contain text, try:
if string and all(c in "MIU" for c in string):
If you're a fan of regex you can do this, to remove any characters that aren't m, u, or i
import re
starting = "jksahdjkamdhuiadhuiqsad"
fixedString = re.sub(r"[^mui]", "" , starting)
print(fixedString)
#output: muiui
A simple program that achieve your goal with primitive structures:
valid = "IMU"
chaine = input ('enter a combination of letters among ' + valid + ' : ')
test=True
for caracter in chaine:
if caracter not in valid:
test = False
if test :
print ('This is correct')
else:
print('This is not valid')

How to make a program separate words by the end of the word?

I know the question sounds a little tricky or not really clear but I need a program, which would separate names. Since I am not from an English speaking country, our names either end in s (for males) or in e and a (for girls)
How do I make Python separate words by their last letter?
I guess this would explain more.
Like there are three names: "Jonas", "Giedre", "Anastasija".
And I need the program to print out like this
MALE: Jonas
FEMALE: Anastasija, Giedre
I started up the program and so far I have this:
mname = []
fname = []
name = input("Enter a name: ")
That's really all I can understand. Because I'm not familiar with how to work the if function with the last letter.
You could use negative indexes to acess the last element of the string
name = input("Enter a name: ")
if name[-1] in ('a','e'):
fname.append(name)
elif name[-1] == 's':
mname.append(name)
As you can see, -1 is the last character of a string.
Quoting from the python tutorial
Indices may also be negative numbers, to start counting from the
right:
>>> word[-1] # last character 'n'
If you're entering the names at once, you'll need to split them by whitespace checking the ends:
names = input('Enter names: ') # string of names to split by spaces
fname = [n for n in names.split() if n[-1] in ('a', 'e')] # females
mname = [n for n in names.split() if n[-1] == 's'] # now male names

String iteration matched to lookup table

A newbie question: I have to iterate a name then associate each letter with a number beginning with a=1, b=2, c=3, etc. and then sum the numbers. I've gotten this far but no farther:
def main():
name = input("Enter name ")
sum = 0
for ch in name:
# ?
How about this?
def main():
print sum(ord(c.lower()) - ord('a') + 1 for c in raw_input("Enter name: "))
This will work even if you're dealing with both uppercase and lowercase letters. If you'll only be dealing with lowercase, you can change c.lower() to c (it will still work as is, of course, but making that change will make it faster if you are only working with lowercase letters).
Create a dictionary mapping characters to values, then use the get() method with a default of 0 on the current character.

Categories