Check if a string is a possible abbrevation for a name - python

I'm trying to develop a python algorithm to check if a string could be an abbrevation for another word. For example
fck is a match for fc kopenhavn because it matches the first characters of the word. fhk would not match.
fco should not match fc kopenhavn because no one irl would abbrevate FC Kopenhavn as FCO.
irl is a match for in real life.
ifk is a match for ifk goteborg.
aik is a match for allmanna idrottskluben.
aid is a match for allmanna idrottsklubben. This is not a real team name abbrevation, but I guess it is hard to exclude it unless you apply domain specific knowledge on how Swedish abbrevations are formed.
manu is a match for manchester united.
It is hard to describe the exact rules of the algorithm, but I hope my examples show what I'm after.
Update I made a mistake in showing the strings with the matching letters uppercased. In the real scenario, all letters are lowercase so it is not as easy as just checking which letters are uppercased.

This passes all the tests, including a few extra I created. It uses recursion. Here are the rules that I used:
The first letter of the abbreviation must match the first letter of
the text
The rest of the abbreviation (the abbrev minus the first letter) must be an abbreviation for:
the remaining words, or
the remaining text starting from
any position in the first word.
tests=(
('fck','fc kopenhavn',True),
('fco','fc kopenhavn',False),
('irl','in real life',True),
('irnl','in real life',False),
('ifk','ifk gotebork',True),
('ifko','ifk gotebork',False),
('aik','allmanna idrottskluben',True),
('aid','allmanna idrottskluben',True),
('manu','manchester united',True),
('fz','faz zoo',True),
('fzz','faz zoo',True),
('fzzz','faz zoo',False),
)
def is_abbrev(abbrev, text):
abbrev=abbrev.lower()
text=text.lower()
words=text.split()
if not abbrev:
return True
if abbrev and not text:
return False
if abbrev[0]!=text[0]:
return False
else:
return (is_abbrev(abbrev[1:],' '.join(words[1:])) or
any(is_abbrev(abbrev[1:],text[i+1:])
for i in range(len(words[0]))))
for abbrev,text,answer in tests:
result=is_abbrev(abbrev,text)
print(abbrev,text,result,answer)
assert result==answer

Here's a way to accomplish what you seem to want to do
import re
def is_abbrev(abbrev, text):
pattern = ".*".join(abbrev.lower())
return re.match("^" + pattern, text.lower()) is not None
The caret makes sure that the first character of the abbreviation matches the first character of the word, it should be true for most abbreviations.
Edit:
Your new update changed the rules a bit. By using "(|.*\s)" instead of ".*" the characters in the abbreviation will only match if they are next to each other, or if the next character appears at the start of a new word.
This will correctly match fck with FC Kopenhavn, but fco will not.
However, matching aik with allmanna idrottskluben will not work, as that requires knowledge of the swedish language and is not as trivial to do.
Here's the new code with the minor modification
import re
def is_abbrev(abbrev, text):
pattern = "(|.*\s)".join(abbrev.lower())
return re.match("^" + pattern, text.lower()) is not None

#Ocaso Protal said in comment how should you decide that aik is valid, but aid is not valid? and he is right.
The algo which came in my mind is to work with word threshold (number of words separated by space).
words = string.strip().split()
if len(words) > 2:
#take first letter of every word
elif len(words) == 2:
#take two letters from first word and one letter from other
else:
#we have single word, take first three letter or as you like
you have to define your logic, you can't find abbreviation blindly.

Your algorithm seems simple - the abbreviation is the Concatenation of all upper case letters.
so:
upper_case_letters = "QWERTYUIOPASDFGHJKLZXCVBNM"
abbrevation = ""
for letter in word_i_want_to_check:
if letter in letters:
abbrevation += letter
for abb in _list_of_abbrevations:
if abb=abbrevation:
great_success()

This might be good enough.
def is_abbrevation(abbrevation, word):
lowword = word.lower()
lowabbr = abbrevation.lower()
for c in lowabbr:
if c not in lowword:
return False
return True
print is_abbrevation('fck', 'FC Kopenhavn')

Related

How can I check if a string contains only English characters, exclamation marks at the end?

For example, "hello!!" should return true, whereas "45!!","!!ok" should return false. The only case where it should return true is when the string has English characters (a-z) with 0 or more exclamation marks in the end.
The following is my solution using an iterative method. However, I want to know some clean method having fewer lines of code (maybe by using some Python library).
def fun(str):
i=-1
for i in range(0,len(str)):
if str[i]=='!':
break
elif (str[i]>='a' and str[i]<='z'):
continue
else:
return 0
while i<len(str):
if(str[i]!='!'):
return 0
i+=1
return 1
print(fun("hello!!"))
Regex can help you out here.
The regular expression you're looking for here is:
^[a-z]+!*$
This will allow one or more English letters (lowered case, you can add upper case as well if you'll go with ^[a-zA-Z]+!*$, or any other letters you'd like to add inside the square brackets)
and zero or more exclamation marks at the end of the word.
Wrapping it up with python code:
import re
pattern = re.compile(r'^[a-z ]+!*$')
word = pattern.search("hello!!")
print(f"Found word: {word.group()}")

Extract words from a string

Sample Input:
'note - Part model D3H6 with specifications X30G and Y2A is having features 12H89.'
Expected Output:
['D3H6', 'X30G', 'Y2A', '12H89']
My code:
split_note = re.split(r'[.;,\s]\s*', note)
pattern = re.compile("^[a-zA-Z0-9]+$")
#if pattern.match(ini_str):
for a in n2:
if pattern.match(a):
alphaList.append(a)
I need to extract all the alpha numeric words from a split string and store them in a list.
The above code is unable to give expected output.
Maybe this can solve the problem:
import re
# input string
stri = "Part model D3H6 with specifications X30 and Y2 is having features 12H89"
# words tokenization
split = re.findall("[A-Z]{2,}(?![a-z])|[A-Z][a-z]+(?=[A-Z])|[\'\w\-]+",stri)
# this statment returns words containing both numbers and letters
print([word for word in split if bool(re.match('^(?=.*[a-zA-Z])(?=.*[0-9])', word))])
#output: ['D3H6', 'X30', 'Y2', '12H89']
^ and $ are meant for the end and beginning of a line, not of a word.
Besides your example words don't include lower case, so why adding a-z?
Considering your example, if what you need is to fetch a word that always contains both at least one letter and at least one number and always ends with a number, this is the pattern:
\b[0-9A-Z]+\d+\b
If it may end with a letter rather than a digit, but still requires at least one digit and one letter,then it gets more complex:
\b[0-9A-Z]*\d|[A-Z][0-9A-Z]*\b
\b stands for a word boundary.

Unable to properly parse a string via character modification

I'm running into an issue where my Python code is not correctly returning a function call designed to add an underscore character before each capital letter and I'm not sure where I'm going wrong. For an output, only the "courseID" word in the string is getting touched whereas the other two words are not.
I thought cycling thru the letters in a word, looking for capitalized letters would work, but it doesn't appear to be so. Could someone let me know where my code might be going wrong?
def parse_variables(string):
new_string=''
for letter in string:
if letter.isupper():
pos=string.index(letter)
parsed_string=string[:pos] + '_' + string[pos:]
new_string=''.join(parsed_string+letter)
else:
new_string=''.join(letter)
# new_string=''.join(letter)
return new_string.lower()
parse_variables("courseID pathID apiID")
Current output is a single letter lowercase d and the expected output should be course_id path_id api_id.
The issue with your revised code is that index only finds the first occurence of the capital letter in the string. Since you have repeated instances of the same capital letters, the function never finds the subsequent instances. You could simplify your approach and avoid this issue by simply concatenating the letters with or without underscores depending on whether they are uppercase as you iterate.
For example:
def underscore_caps(s):
result = ''
for c in s:
if c.isupper():
result += f'_{c.lower()}'
else:
result += c
return result
print(underscore_caps('courseID pathID apiID'))
# course_i_d path_i_d api_i_d
Or a bit more concisely using list comprehension and join:
def underscore_caps(s):
return ''.join([f'_{c.lower()}' if c.isupper() else c for c in s])
print(underscore_caps('courseID pathID apiID'))
# course_i_d path_i_d api_i_d
I think a regex solution would be easier to understand here. This takes words that end with capital letters and adds the underscore and makes them lowercase
import re
s = "courseID pathID apiID exampleABC DEF"
def underscore_lower(match):
return "_" + match.group(1).lower()
pat = re.compile(r'(?<=[^A-Z\s])([A-Z]+)\b')
print(pat.sub(underscore_lower, s))
# course_id path_id api_id example_abc DEF
You might have to play with that regex to get it to do exactly what you want. At the moment, it takes capital letters at the end of words that are preceded by a character that is neither a capital letter or a space. It then makes those letters lowercase and adds an underscore in front of them.
You have a number of issues with your code:
string.index(letter) gives the index of the first occurrence of letter, so if you have multiple e.g. D, pos will only update to the position of the first one.
You could correct this by iterating over both position and letter using enumerate e.g. for pos, letter in enumerate(string):
You are putting underscores before each capital letter i.e. _i_d
You are overwriting previous edits by referring to string in parsed_string=string[:pos] + '_' + string[pos:]
Correcting all these issues you would have:
def parse_variables(string):
new_string=''
for pos, letter in enumerate(string):
if letter.isupper() and pos+1 < len(string) and string[pos+1].isupper():
new_string += f'_{letter}'
else:
new_string += letter
return new_string.lower()
But a much simpler method is:
"courseID pathID apiID".replace('ID', '_id')
Update:
Given the variety of strings you want to capture, it seems regex is the tool you want to use:
import re
def parse_variables(string, pattern=r'(?<=[a-z])([A-Z]+)', prefix='_'):
"""Replace patterns in string with prefixed lowercase version.
Default pattern is any substring of consecutive
capital letters that occur after a lowercase letter."""
foo = lambda pat: f'{prefix}{pat.group(1).lower()}'
return re.sub(pattern, foo, text)
text = 'courseID pathProjects apiCode'
parse_variables(text)
>>> course_id path_projects api_code

Python 3 - How to capitalize first letter of every sentence when translating from morse code

I am trying to translate morse code into words and sentences and it all works fine... except for one thing. My entire output is lowercased and I want to be able to capitalize every first letter of every sentence.
This is my current code:
text = input()
if is_morse(text):
lst = text.split(" ")
text = ""
for e in lst:
text += TO_TEXT[e].lower()
print(text)
Each element in the split list is equal to a character (but in morse) NOT a WORD. 'TO_TEXT' is a dictionary. Does anyone have a easy solution to this? I am a beginner in programming and Python btw, so I might not understand some solutions...
Maintain a flag telling you whether or not this is the first letter of a new sentence. Use that to decide whether the letter should be upper-case.
text = input()
if is_morse(text):
lst = text.split(" ")
text = ""
first_letter = True
for e in lst:
if first_letter:
this_letter = TO_TEXT[e].upper()
else:
this_letter = TO_TEXT[e].lower()
# Period heralds a new sentence.
first_letter = this_letter == "."
text += this_letter
print(text)
From what is understandable from your code, I can say that you can use the title() function of python.
For a more stringent result, you can use the capwords() function importing the string class.
This is what you get from Python docs on capwords:
Split the argument into words using str.split(), capitalize each word using str.capitalize(), and join the capitalized words using str.join(). If the optional second argument sep is absent or None, runs of whitespace characters are replaced by a single space and leading and trailing whitespace are removed, otherwise sep is used to split and join the words.

Can't convert 'list'object to str implicitly Python

I am trying to import the alphabet but split it so that each character is in one array but not one string. splitting it works but when I try to use it to find how many characters are in an inputted word I get the error 'TypeError: Can't convert 'list' object to str implicitly'. Does anyone know how I would go around solving this? Any help appreciated. The code is below.
import string
alphabet = string.ascii_letters
print (alphabet)
splitalphabet = list(alphabet)
print (splitalphabet)
x = 1
j = year3wordlist[x].find(splitalphabet)
k = year3studentwordlist[x].find(splitalphabet)
print (j)
EDIT: Sorry, my explanation is kinda bad, I was in a rush. What I am wanting to do is count each individual letter of a word because I am coding a spelling bee program. For example, if the correct word is 'because', and the user who is taking part in the spelling bee has entered 'becuase', I want the program to count the characters and location of the characters of the correct word AND the user's inputted word and compare them to give the student a mark - possibly by using some kind of point system. The problem I have is that I can't simply say if it is right or wrong, I have to award 1 mark if the word is close to being right, which is what I am trying to do. What I have tried to do in the code above is split the alphabet and then use this to try and find which characters have been used in the inputted word (the one in year3studentwordlist) versus the correct word (year3wordlist).
There is a much simpler solution if you use the in keyword. You don't even need to split the alphabet in order to check if a given character is in it:
year3wordlist = ['asdf123', 'dsfgsdfg435']
total_sum = 0
for word in year3wordlist:
word_sum = 0
for char in word:
if char in string.ascii_letters:
word_sum += 1
total_sum += word_sum
# Length of characters in the ascii letters alphabet:
# total_sum == 12
# Length of all characters in all words:
# sum([len(w) for w in year3wordlist]) == 18
EDIT:
Since the OP comments he is trying to create a spelling bee contest, let me try to answer more specifically. The distance between a correctly spelled word and a similar string can be measured in many different ways. One of the most common ways is called 'edit distance' or 'Levenshtein distance'. This represents the number of insertions, deletions or substitutions that would be needed to rewrite the input string into the 'correct' one.
You can find that distance implemented in the Python-Levenshtein package. You can install it via pip:
$ sudo pip install python-Levenshtein
And then use it like this:
from __future__ import division
import Levenshtein
correct = 'because'
student = 'becuase'
distance = Levenshtein.distance(correct, student) # distance == 2
mark = ( 1 - distance / len(correct)) * 10 # mark == 7.14
The last line is just a suggestion on how you could derive a grade from the distance between the student's input and the correct answer.
I think what you need is join:
>>> "".join(splitalphabet)
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
join is a class method of str, you can do
''.join(splitalphabet)
or
str.join('', splitalphabet)
To convert the list splitalphabet to a string, so you can use it with the find() function you can use separator.join(iterable):
"".join(splitalphabet)
Using it in your code:
j = year3wordlist[x].find("".join(splitalphabet))
I don't know why half the answers are telling you how to put the split alphabet back together...
To count the number of characters in a word that appear in the splitalphabet, do it the functional way:
count = len([c for c in word if c in splitalphabet])
import string
# making letters a set makes "ch in letters" very fast
letters = set(string.ascii_letters)
def letters_in_word(word):
return sum(ch in letters for ch in word)
Edit: it sounds like you should look at Levenshtein edit distance:
from Levenshtein import distance
distance("because", "becuase") # => 2
While join creates the string from the split, you would not have to do that as you can issue the find on the original string (alphabet). However, I do not think is what you are trying to do. Note that the find that you are trying attempts to find the splitalphabet (actually alphabet) within year3wordlist[x] which will always fail (-1 result)
If what you are trying to do is to get the indices of all the letters of the word list within the alphabet, then you would need to handle it as
for each letter in the word of the word list, determine the index within alphabet.
j = []
for c in word:
j.append(alphabet.find(c))
print j
On the other hand if you are attempting to find the index of each character within the alphabet within the word, then you need to loop over splitalphabet to get an individual character to find within the word. That is
l = []
for c within splitalphabet:
j = word.find(c)
if j != -1:
l.append((c, j))
print l
This gives the list of tuples showing those characters found and the index.
I just saw that you talk about counting the number of letters. I am not sure what you mean by this as len(word) gives the number of characters in each word while len(set(word)) gives the number of unique characters. On the other hand, are you saying that your word might have non-ascii characters in it and you want to count the number of ascii characters in that word? I think that you need to be more specific in what you want to determine.
If what you are doing is attempting to determine if the characters are all alphabetic, then all you need to do is use the isalpha() method on the word. You can either say word.isalpha() and get True or False or check each character of word to be isalpha()

Categories