Find multiple list items within a string - python

Working on a problem in which I am trying to get a count of the number of vowels in a string. I wrote the following code:
def vowel_count(s):
count = 0
for i in s:
if i == 'a' or i == 'e' or i == 'i' or i == 'o' or i == 'u':
count += 1
print count
vowel_count(s)
While the above works, I would like to know how to do this more simply by creating a list of all vowels, then looping my If statement through that, instead of multiple boolean checks. I'm sure there's an even more elegant way to do this with import modules, but interested in this type of solution.
Relative noob...appreciate the help.

No need to create a list, you can use a string like 'aeiou' to do this:
>>> vowels = 'aeiou'
>>> s = 'fooBArSpaM'
>>> sum(c.lower() in vowels for c in s)
4

You can actually treat a string similarly to how you would a list in python (as they are both iterables), for example
vowels = 'aeiou'
sum(1 for i in s if i.lower() in vowels)
For completeness sake, others suggest vowels = set('aeiou') to allow not matching checks such as 'eio' in vowels. However note if you are iterating over your string in a for loop one character at a time, you won't run into this problem.

A weird way around this is the following:
vowels = len(s) - len(s.translate(None, 'aeiou'))
What you are doing with s.translate(None, 'aeiou') is creating a copy of the string removing all vowels. And then checking how the length differed.
Special note: the way I'm using it is even part of the official documentation
What is a vowel?
Note, though, that method presented here only replaces exactly the characters present in the second parameter of the translate string method. In particular, this means that it will not replace uppercase versions characters, let alone accented ones (like áèïôǔ).
Uppercase vowels
Solving the uppercase ones is kind of easy, just do the replacemente on a copy of the string that has been converted to lowercase:
vowels = len(s) - len(s.lower().translate(None, 'aeiou'))
Accented vowels
This one is a little bit more convoluted, but thanks to this other SO question we know the best way to do it. The resulting code would be:
from unicodedate import normalize
# translate special characters to unaccented versions
normalized_str = normalize('NFD', s).encode('ascii', 'ignore')
vowels = len(s) - len(normalized_str.lower().translate(None, 'aeiou'))

You can filter using a list comprehension, like so:
len([letter for letter in s if letter in 'aeiou'])

Related

Checking if a word is a subset of another word with same amount of letters [duplicate]

This question already has answers here:
Python Counter Comparison as Bag-type
(3 answers)
Closed 7 months ago.
I am making a word game program where I get a list of ~80,000 words from a text file, then use those words as a lexicon of words to choose from. The user requests a word of a certain length which is then given to them scrambled. They then guess words that are of the same length or less and that use the same letters in the same amount or less. I have this list comprehension in order to get all the words from the lexicon that are subsets of the scrambled word and are also in the lexicon. However it allows more occurrences of letters than appear in the original word. For example: If the scrambled word was 'minute', then 'in' should be a correct answer but 'inn' should not. The way I have it written now allows that though. Here is the list comprehension:
correct_answers = [
word for word in word_list
if set(word).issubset(random_length_word)
and word in word_list
and len(word) <= len(random_length_word)]
So I'm looking for something like issubset but that only allows the same number of letters or less. Hopefully that makes sense. Thanks in advance.
I wrote a function that does this for playing the Countdown letters game. I called the desired input a "subset-anagram", but there's probably a better technical term for it.
Essentially, what you're looking for is a multiset (from word) that is a subset of another multiset (from random_length_word). You can do this with collections.Counter, but I actually found it much faster to do it a different way: make a list out of random_length_word, then remove each character of word. It's probably faster due to the overhead of creating new Counter objects.
def is_subset_anagram(str1, str2):
"""
Check if str1 is a subset-anagram of str2.
Return true if str2 contains at least as many of each char as str1.
>>> is_subset_anagram('bottle', 'belott') # Just enough
True
>>> is_subset_anagram('bottle', 'belot') # less
False
>>> is_subset_anagram('bottle', 'bbeelloott') # More
True
"""
list2 = list(str2)
try:
for char in str1:
list2.remove(char)
except ValueError:
return False
return True
>>> [w for w in ['in', 'inn', 'minute'] if is_subset_anagram(w, 'minute')]
['in', 'minute']
For what it's worth, here's the Counter implementation:
from collections import Counter
def is_subset_anagram(str1, str2):
delta = Counter(str1) - Counter(str2)
return not delta
This works because Counter.__sub__() produces a multiset, that is, counts less than 1 are removed.
Your approach loses the information, how often a certain character appears, because set(answer) does not contain this information any more.
Anyway, I think you are over-complicating things with your approach. There is a more efficient way for checking whether an answer is correct, instead of creating a list of all possible answers:
We could just check whether the answer has the matching character frequencies with any of the words in word_list. More specifically, "matching character frequencies" means that all characters appear less (or equally) often in the answer and the candidate from word list.
Getting the character frequencies of a string is a classic job collections.Counter has been invented for.
Checking that the character frequencies match means that all characters in the word have less or equal count in the answer.
Finally, checking that an answer is correct means that this condition is true for any of the words in word_list.
from collections import Counter
from typing import List
def correct_answer(word_list: List[str], answer: str) -> bool:
return any(
all(
# this checks if each char occurs less often in the word
Counter(answer)[character] <= Counter(word)[character]
for character in Counter(answer).keys()
)
for word in word_list
)
This is more efficient than your approach, because it takes way less memory space. Thanks to any and all being short-circuit, it is also quite time-efficient.

Unable to properly parse a string via character modification

I'm running into an issue where my Python code is not correctly returning a function call designed to add an underscore character before each capital letter and I'm not sure where I'm going wrong. For an output, only the "courseID" word in the string is getting touched whereas the other two words are not.
I thought cycling thru the letters in a word, looking for capitalized letters would work, but it doesn't appear to be so. Could someone let me know where my code might be going wrong?
def parse_variables(string):
new_string=''
for letter in string:
if letter.isupper():
pos=string.index(letter)
parsed_string=string[:pos] + '_' + string[pos:]
new_string=''.join(parsed_string+letter)
else:
new_string=''.join(letter)
# new_string=''.join(letter)
return new_string.lower()
parse_variables("courseID pathID apiID")
Current output is a single letter lowercase d and the expected output should be course_id path_id api_id.
The issue with your revised code is that index only finds the first occurence of the capital letter in the string. Since you have repeated instances of the same capital letters, the function never finds the subsequent instances. You could simplify your approach and avoid this issue by simply concatenating the letters with or without underscores depending on whether they are uppercase as you iterate.
For example:
def underscore_caps(s):
result = ''
for c in s:
if c.isupper():
result += f'_{c.lower()}'
else:
result += c
return result
print(underscore_caps('courseID pathID apiID'))
# course_i_d path_i_d api_i_d
Or a bit more concisely using list comprehension and join:
def underscore_caps(s):
return ''.join([f'_{c.lower()}' if c.isupper() else c for c in s])
print(underscore_caps('courseID pathID apiID'))
# course_i_d path_i_d api_i_d
I think a regex solution would be easier to understand here. This takes words that end with capital letters and adds the underscore and makes them lowercase
import re
s = "courseID pathID apiID exampleABC DEF"
def underscore_lower(match):
return "_" + match.group(1).lower()
pat = re.compile(r'(?<=[^A-Z\s])([A-Z]+)\b')
print(pat.sub(underscore_lower, s))
# course_id path_id api_id example_abc DEF
You might have to play with that regex to get it to do exactly what you want. At the moment, it takes capital letters at the end of words that are preceded by a character that is neither a capital letter or a space. It then makes those letters lowercase and adds an underscore in front of them.
You have a number of issues with your code:
string.index(letter) gives the index of the first occurrence of letter, so if you have multiple e.g. D, pos will only update to the position of the first one.
You could correct this by iterating over both position and letter using enumerate e.g. for pos, letter in enumerate(string):
You are putting underscores before each capital letter i.e. _i_d
You are overwriting previous edits by referring to string in parsed_string=string[:pos] + '_' + string[pos:]
Correcting all these issues you would have:
def parse_variables(string):
new_string=''
for pos, letter in enumerate(string):
if letter.isupper() and pos+1 < len(string) and string[pos+1].isupper():
new_string += f'_{letter}'
else:
new_string += letter
return new_string.lower()
But a much simpler method is:
"courseID pathID apiID".replace('ID', '_id')
Update:
Given the variety of strings you want to capture, it seems regex is the tool you want to use:
import re
def parse_variables(string, pattern=r'(?<=[a-z])([A-Z]+)', prefix='_'):
"""Replace patterns in string with prefixed lowercase version.
Default pattern is any substring of consecutive
capital letters that occur after a lowercase letter."""
foo = lambda pat: f'{prefix}{pat.group(1).lower()}'
return re.sub(pattern, foo, text)
text = 'courseID pathProjects apiCode'
parse_variables(text)
>>> course_id path_projects api_code

Separating between Hebrew and English strings

So I have this huge list of strings in Hebrew and English, and I want to extract from them only those in Hebrew, but couldn't find a regex example that works with Hebrew.
I have tried the stupid method of comparing every character:
import string
data = []
for s in slist:
found = False
for c in string.ascii_letters:
if c in s:
found = True
if not found:
data.append(s)
And it works, but it is of course very slow and my list is HUGE.
Instead of this, I tried comparing only the first letter of the string to string.ascii_letters which was much faster, but it only filters out those that start with an English letter, and leaves the "mixed" strings in there. I only want those that are "pure" Hebrew.
I'm sure this can be done much better... Help, anyone?
P.S: I prefer to do it within a python program, but a grep command that does the same would also help
To check if a string contains any ASCII letters (ie. non-Hebrew) use:
re.search('[' + string.ascii_letters + ']', s)
If this returns true, your string is not pure Hebrew.
This one should work:
import re
data = [s for s in slist if re.match('^[a-zA-Z ]+$', s)]
This will pick all the strings that consist of lowercase and uppercase English letters and spaces. If the strings are allowed to contain digits or punctuation marks, the allowed characters should be included into the regex.
Edit: Just noticed, it filters out the English-only strings, but you need it do do the other way round. You can try this instead:
data = [s for s in slist if not re.match('^.*[a-zA-Z].*$', s)]
This will discard any string that contains at least one English letter.
Python has extensive unicode support. It depends on what you're asking for. Is a hebrew word one that contains only hebrew characters and whitespace, or is it simply a word that contains no latin characters? Either way, you can do so directly. Just create the criteria set and test for membership.
Note that testing for membership in a set is much faster than iteration through string.ascii_letters.
Please note that I do not speak hebrew so I may have missed a letter or two of the alphabet.
def is_hebrew(word):
hebrew = set("א‎ב‎ג‎ד‎ה‎ו‎ז‎ח‎ט‎י‎כ‎ך‎ל‎מ‎נ‎ס‎ ע‎פ‎צ‎ק‎ר‎ש‎ת‎ם‎ן‎ף‎ץ"+string.whitespace)
for char in word:
if char not in hebrew:
return False
return True
def contains_latin(word):
return any(char in set("abcdefghijklmnopqrstuvwxyz") for char in word.lower())
# a generator expression like this is a terser way of expressing the
# above concept.
hebrew_words = [word for word in words if is_hebrew(word)]
non_latin words = [word for word in words if not contains_latin(word)]
Another option would be to create a dictionary of hebrew words:
hebrew_words = {...}
And then you iterate through the list of words and compare them against this dictionary ignoring case. This will work much faster than other approaches (O(n) where n is the length of your list of words).
The downside is that you need to get all or most of hebrew words somewhere. I think it's possible to find it on the web in csv or some other form. Parse it and put it into python dictionary.
However, it makes sense if you need to parse such lists of words very often and quite quickly. Another problem is that the dictionary may contain not all hebrew words which will not give a completely right answer.
Try this:
>>> import re
>>> filter(lambda x: re.match(r'^[^\w]+$',x),s)

Can't convert 'list'object to str implicitly Python

I am trying to import the alphabet but split it so that each character is in one array but not one string. splitting it works but when I try to use it to find how many characters are in an inputted word I get the error 'TypeError: Can't convert 'list' object to str implicitly'. Does anyone know how I would go around solving this? Any help appreciated. The code is below.
import string
alphabet = string.ascii_letters
print (alphabet)
splitalphabet = list(alphabet)
print (splitalphabet)
x = 1
j = year3wordlist[x].find(splitalphabet)
k = year3studentwordlist[x].find(splitalphabet)
print (j)
EDIT: Sorry, my explanation is kinda bad, I was in a rush. What I am wanting to do is count each individual letter of a word because I am coding a spelling bee program. For example, if the correct word is 'because', and the user who is taking part in the spelling bee has entered 'becuase', I want the program to count the characters and location of the characters of the correct word AND the user's inputted word and compare them to give the student a mark - possibly by using some kind of point system. The problem I have is that I can't simply say if it is right or wrong, I have to award 1 mark if the word is close to being right, which is what I am trying to do. What I have tried to do in the code above is split the alphabet and then use this to try and find which characters have been used in the inputted word (the one in year3studentwordlist) versus the correct word (year3wordlist).
There is a much simpler solution if you use the in keyword. You don't even need to split the alphabet in order to check if a given character is in it:
year3wordlist = ['asdf123', 'dsfgsdfg435']
total_sum = 0
for word in year3wordlist:
word_sum = 0
for char in word:
if char in string.ascii_letters:
word_sum += 1
total_sum += word_sum
# Length of characters in the ascii letters alphabet:
# total_sum == 12
# Length of all characters in all words:
# sum([len(w) for w in year3wordlist]) == 18
EDIT:
Since the OP comments he is trying to create a spelling bee contest, let me try to answer more specifically. The distance between a correctly spelled word and a similar string can be measured in many different ways. One of the most common ways is called 'edit distance' or 'Levenshtein distance'. This represents the number of insertions, deletions or substitutions that would be needed to rewrite the input string into the 'correct' one.
You can find that distance implemented in the Python-Levenshtein package. You can install it via pip:
$ sudo pip install python-Levenshtein
And then use it like this:
from __future__ import division
import Levenshtein
correct = 'because'
student = 'becuase'
distance = Levenshtein.distance(correct, student) # distance == 2
mark = ( 1 - distance / len(correct)) * 10 # mark == 7.14
The last line is just a suggestion on how you could derive a grade from the distance between the student's input and the correct answer.
I think what you need is join:
>>> "".join(splitalphabet)
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
join is a class method of str, you can do
''.join(splitalphabet)
or
str.join('', splitalphabet)
To convert the list splitalphabet to a string, so you can use it with the find() function you can use separator.join(iterable):
"".join(splitalphabet)
Using it in your code:
j = year3wordlist[x].find("".join(splitalphabet))
I don't know why half the answers are telling you how to put the split alphabet back together...
To count the number of characters in a word that appear in the splitalphabet, do it the functional way:
count = len([c for c in word if c in splitalphabet])
import string
# making letters a set makes "ch in letters" very fast
letters = set(string.ascii_letters)
def letters_in_word(word):
return sum(ch in letters for ch in word)
Edit: it sounds like you should look at Levenshtein edit distance:
from Levenshtein import distance
distance("because", "becuase") # => 2
While join creates the string from the split, you would not have to do that as you can issue the find on the original string (alphabet). However, I do not think is what you are trying to do. Note that the find that you are trying attempts to find the splitalphabet (actually alphabet) within year3wordlist[x] which will always fail (-1 result)
If what you are trying to do is to get the indices of all the letters of the word list within the alphabet, then you would need to handle it as
for each letter in the word of the word list, determine the index within alphabet.
j = []
for c in word:
j.append(alphabet.find(c))
print j
On the other hand if you are attempting to find the index of each character within the alphabet within the word, then you need to loop over splitalphabet to get an individual character to find within the word. That is
l = []
for c within splitalphabet:
j = word.find(c)
if j != -1:
l.append((c, j))
print l
This gives the list of tuples showing those characters found and the index.
I just saw that you talk about counting the number of letters. I am not sure what you mean by this as len(word) gives the number of characters in each word while len(set(word)) gives the number of unique characters. On the other hand, are you saying that your word might have non-ascii characters in it and you want to count the number of ascii characters in that word? I think that you need to be more specific in what you want to determine.
If what you are doing is attempting to determine if the characters are all alphabetic, then all you need to do is use the isalpha() method on the word. You can either say word.isalpha() and get True or False or check each character of word to be isalpha()

Sequence of vowels count

This is not a homework question, it is an exam preparation question.
I should define a function syllables(word) that counts the number of syllables in
A word in the following way:
• a maximal sequence of vowels is a syllable;
• a final e in a word is not a syllable (or the vowel sequence it is a part
Of).
I do not have to deal with any special cases, such as a final e in a
One-syllable word (e.g., ’be’ or ’bee’).
>>> syllables(’honour’)
2
>>> syllables(’decode’)
2
>>> syllables(’oiseau’)
2
Should I use regular expression here or just list comprehension ?
I find regular expressions natural for this question. (I think a non-regex answer would take more coding. I use two string methods, 'lower' and 'endswith' to make the answer more clear.)
import re
def syllables(word):
word = word.lower()
if word.endswith('e'):
word = word[:-1]
count = len(re.findall('[aeiou]+', word))
return count
for word in ('honour', 'decode', 'decodes', 'oiseau', 'pie'):
print word, syllables(word)
Which prints:
honour 2
decode 2
decodes 3
oiseau 2
pie 1
Note that 'decodes' has one more syllable than 'decode' (which is strange, but fits your definition).
Question. How does this help you? Isn't the point of the study question that you work through it yourself? You may get more benefit in the future by posting a failed attempt in your question, so you can learn exactly where you are lacking.
Use regexps - most languages will let you count the number of matches of a regexp in a string.
Then special-case the terminal-e by checking the right-most match group.
I don't think regex is the right solution here.
It seems pretty straightforward to write this treating each string as a list.
Some pointers:
[abc] matches a, b or c.
A + after a regex token allows the token to match once or more
$ matches the end of the string.
(?<=x) matches the current position only if the previous character is an x.
(?!x) matches the current position only if the next character is not an x.
EDIT:
I just saw your comment that since this is not homework, actual code is requested.
Well, then:
[aeiou]+(?!(?<=e)$)
If you don't want to count final vowel sequences that end in e at all (like the u in tongue or the o in toe), then use
[aeiou]+(?=[^aeiou])|[aeiou]*[aiou]$
I'm sure you'll be able to figure out how it works if you read the explanation above.
Here's an answer without regular expressions. My real answer (also posted) uses regular expressions. Untested code:
def syllables(word):
word = word.lower()
if word.endswith('e'):
word = word[:-1]
vowels = 'aeiou'
in_vowel_group = False
vowel_groups = 0
for letter in word:
if letter in vowels:
if not in_vowel_group:
in_vowel_group = True
vowel_groups += 1
else:
in_vowel_group = False
return vowel_groups
Both ways work. You said yourself that it was for exam preparation. Use whichever is going to be on the exam. If they're both on the exam, use which you need more practice for. Just remember:
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. ~Jamie Zawinski
So in my opinion, don't use regex unless you need the practice.
Regular expressions would be way too complex, and a list comprehension probably wouldn't be robust enough. You will probably be able to solve this easily using a grammar lexer like PyParsing. Give it a shot!
Use a regex that matches a,e,i,o, or u, convert the string to a list, then iterate through the list... 1 for first true, 1 for next false, 2 for next true, 2 for next false, etc.
To handle the case where the last letter is 'e' following a consonant (as in ate), just check the last two letters of the word before you start. If they match that pattern truncate the final e and process as normal.
This pattern works for your definition:
(?!e$)([aeiouy]+)
Just count how many times it occurs.

Categories