This is not a homework question, it is an exam preparation question.
I should define a function syllables(word) that counts the number of syllables in
A word in the following way:
• a maximal sequence of vowels is a syllable;
• a final e in a word is not a syllable (or the vowel sequence it is a part
Of).
I do not have to deal with any special cases, such as a final e in a
One-syllable word (e.g., ’be’ or ’bee’).
>>> syllables(’honour’)
2
>>> syllables(’decode’)
2
>>> syllables(’oiseau’)
2
Should I use regular expression here or just list comprehension ?
I find regular expressions natural for this question. (I think a non-regex answer would take more coding. I use two string methods, 'lower' and 'endswith' to make the answer more clear.)
import re
def syllables(word):
word = word.lower()
if word.endswith('e'):
word = word[:-1]
count = len(re.findall('[aeiou]+', word))
return count
for word in ('honour', 'decode', 'decodes', 'oiseau', 'pie'):
print word, syllables(word)
Which prints:
honour 2
decode 2
decodes 3
oiseau 2
pie 1
Note that 'decodes' has one more syllable than 'decode' (which is strange, but fits your definition).
Question. How does this help you? Isn't the point of the study question that you work through it yourself? You may get more benefit in the future by posting a failed attempt in your question, so you can learn exactly where you are lacking.
Use regexps - most languages will let you count the number of matches of a regexp in a string.
Then special-case the terminal-e by checking the right-most match group.
I don't think regex is the right solution here.
It seems pretty straightforward to write this treating each string as a list.
Some pointers:
[abc] matches a, b or c.
A + after a regex token allows the token to match once or more
$ matches the end of the string.
(?<=x) matches the current position only if the previous character is an x.
(?!x) matches the current position only if the next character is not an x.
EDIT:
I just saw your comment that since this is not homework, actual code is requested.
Well, then:
[aeiou]+(?!(?<=e)$)
If you don't want to count final vowel sequences that end in e at all (like the u in tongue or the o in toe), then use
[aeiou]+(?=[^aeiou])|[aeiou]*[aiou]$
I'm sure you'll be able to figure out how it works if you read the explanation above.
Here's an answer without regular expressions. My real answer (also posted) uses regular expressions. Untested code:
def syllables(word):
word = word.lower()
if word.endswith('e'):
word = word[:-1]
vowels = 'aeiou'
in_vowel_group = False
vowel_groups = 0
for letter in word:
if letter in vowels:
if not in_vowel_group:
in_vowel_group = True
vowel_groups += 1
else:
in_vowel_group = False
return vowel_groups
Both ways work. You said yourself that it was for exam preparation. Use whichever is going to be on the exam. If they're both on the exam, use which you need more practice for. Just remember:
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. ~Jamie Zawinski
So in my opinion, don't use regex unless you need the practice.
Regular expressions would be way too complex, and a list comprehension probably wouldn't be robust enough. You will probably be able to solve this easily using a grammar lexer like PyParsing. Give it a shot!
Use a regex that matches a,e,i,o, or u, convert the string to a list, then iterate through the list... 1 for first true, 1 for next false, 2 for next true, 2 for next false, etc.
To handle the case where the last letter is 'e' following a consonant (as in ate), just check the last two letters of the word before you start. If they match that pattern truncate the final e and process as normal.
This pattern works for your definition:
(?!e$)([aeiouy]+)
Just count how many times it occurs.
Related
I have a number of long strings and I want to match those that contain all words of a given list.
keywords=['special','dreams']
search_string1="This is something that manifests especially in dreams"
search_string2="This is something that manifests in special cases in dreams"
I want only search_string2 matched. So far I have this code:
if all(x in search_text for x in keywords):
print("matched")
The problem is that it will also match search_string1. Obviously I need to include some regex matching that uses \w or or \b, but I can't figure out how I can include a regex in the if all statement.
Can anyone help?
you can use regex to do the same but I prefer to just use python.
string classes in python can be split to list of words. (join can join a list to string). while using word in list_of_words will help you understand if word is in the list.
keywords=['special','dreams']
found = True
for word in keywords:
if not word in search_string1.split():
found = False
Could be not the best idea, but we could check if one set is a part of another set:
keywords = ['special', 'dreams']
strs = [
"This is something that manifests especially in dreams",
"This is something that manifests in special cases in dreams"
]
_keywords = set(keywords)
for s in strs:
s_set = set(s.split())
if _keywords.issubset(s_set):
print(f"Matched: {s}")
Axe319's comment works and is closest to my original question of how to solve the problem using regex. To quote the solution again:
all(re.search(fr'\b{x}\b', search_text) for x in keywords)
Thanks to everyone!
I'm currently using the find function and found a slight problem.
theres gonna be a fire here
If I have a sentence with the word "here" and "theres" and I use find() to find "here"s index, I instead get "theres"
I thought find() would be like
if thisword in thatword:
as it would find the word, not a substring within a string.
Is there another function that may work similarly? I'm using find() quite heavily would like to know of alternatives before I clog the code with string.split() then iterate until I find the exact match with an index counter on the side.
MainLine = str('theres gonna be a fire here')
WordtoFind = str('here')
#String_Len = MainLine.find(WordtoFind)
split_line = MainLine.split()
indexCounter = 0
for i in range (0,len(split_line)):
indexCounter += (len(split_line[i]) + 1)
if WordtoFind in split_line[i]:
#String_Len = MainLine.find(split_line[i])
String_Len = indexCounter
break
The best route would be regular expressions. To find a "word" just make sure that the leading and ending characters are not alphanumeric. It uses no splits, has no exposed loops, and even works when you run into a weird sentence like "There is a fire,here". A find_word function might look like this
import re
def find_word_start(word, string):
pattern = "(?<![a-zA-Z0-9])"+word+"(?![a-zA-Z0-9])"
result = re.search(pattern, string)
return result.start()
>> find_word_start("here", "There is a fire,here")
>> 16
The regex I made uses a trick called lookarounds that make sure that the characters preceding and after the word are not letters or numbers. https://www.regular-expressions.info/lookaround.html. The term [a-zA-Z0-9] is a character set that is comprised of a single character in the sets a-z, A-Z, and 0-9. Look up the python re module to find out more about regular expressions.
So I have this huge list of strings in Hebrew and English, and I want to extract from them only those in Hebrew, but couldn't find a regex example that works with Hebrew.
I have tried the stupid method of comparing every character:
import string
data = []
for s in slist:
found = False
for c in string.ascii_letters:
if c in s:
found = True
if not found:
data.append(s)
And it works, but it is of course very slow and my list is HUGE.
Instead of this, I tried comparing only the first letter of the string to string.ascii_letters which was much faster, but it only filters out those that start with an English letter, and leaves the "mixed" strings in there. I only want those that are "pure" Hebrew.
I'm sure this can be done much better... Help, anyone?
P.S: I prefer to do it within a python program, but a grep command that does the same would also help
To check if a string contains any ASCII letters (ie. non-Hebrew) use:
re.search('[' + string.ascii_letters + ']', s)
If this returns true, your string is not pure Hebrew.
This one should work:
import re
data = [s for s in slist if re.match('^[a-zA-Z ]+$', s)]
This will pick all the strings that consist of lowercase and uppercase English letters and spaces. If the strings are allowed to contain digits or punctuation marks, the allowed characters should be included into the regex.
Edit: Just noticed, it filters out the English-only strings, but you need it do do the other way round. You can try this instead:
data = [s for s in slist if not re.match('^.*[a-zA-Z].*$', s)]
This will discard any string that contains at least one English letter.
Python has extensive unicode support. It depends on what you're asking for. Is a hebrew word one that contains only hebrew characters and whitespace, or is it simply a word that contains no latin characters? Either way, you can do so directly. Just create the criteria set and test for membership.
Note that testing for membership in a set is much faster than iteration through string.ascii_letters.
Please note that I do not speak hebrew so I may have missed a letter or two of the alphabet.
def is_hebrew(word):
hebrew = set("אבגדהוזחטיכךלמנס עפצקרשתםןףץ"+string.whitespace)
for char in word:
if char not in hebrew:
return False
return True
def contains_latin(word):
return any(char in set("abcdefghijklmnopqrstuvwxyz") for char in word.lower())
# a generator expression like this is a terser way of expressing the
# above concept.
hebrew_words = [word for word in words if is_hebrew(word)]
non_latin words = [word for word in words if not contains_latin(word)]
Another option would be to create a dictionary of hebrew words:
hebrew_words = {...}
And then you iterate through the list of words and compare them against this dictionary ignoring case. This will work much faster than other approaches (O(n) where n is the length of your list of words).
The downside is that you need to get all or most of hebrew words somewhere. I think it's possible to find it on the web in csv or some other form. Parse it and put it into python dictionary.
However, it makes sense if you need to parse such lists of words very often and quite quickly. Another problem is that the dictionary may contain not all hebrew words which will not give a completely right answer.
Try this:
>>> import re
>>> filter(lambda x: re.match(r'^[^\w]+$',x),s)
I want a regular expression (in Python) that given a sentence like:
heyy how are youuuuu, it's so cool here, cooool.
converts it to:
heyy how are youu, it's so cool here, cool.
which means maximum of 1 time a character can be repeated and if it's more than that it should be removed.
heyy ==> heyy
youuuu ==> youu
cooool ==> cool
You can use back reference in the pattern to match repeated characters and then replace it with two instances of the matched character, here (.)\1+ will match a pattern that contains the same character two or more times, replace it with only two instances by \1\1:
import re
re.sub(r"(.)\1+", r"\1\1", s)
# "heyy how are youu, it's so cool here, cool."
create a new empty text and only add to it if there aren't 3 consecutive
text = "heyy how are youuuuu, it's so cool here, cooool."
new_text = ''
for i in range(len(text)):
try:
if text[i]==text[i+1]==text[i+2]:
pass
else:
new_text+=text[i]
except:
new_text+=text[i]
print new_text
>>>heyy how are youu, it's so cool here, cool.
eta: hmmm just noticed you requested "regular expressions" so approved answer is better; though this works
Hey there, I love regular expressions, but I'm just not good at them at all.
I have a list of some 400 shortened words such as lol, omg, lmao...etc. Whenever someone types one of these shortened words, it is replaced with its English counterpart ([laughter], or something to that effect). Anyway, people are annoying and type these short-hand words with the last letter(s) repeated x number of times.
examples:
omg -> omgggg, lol -> lollll, haha -> hahahaha, lol -> lololol
I was wondering if anyone could hand me the regex (in Python, preferably) to deal with this?
Thanks all.
(It's a Twitter-related project for topic identification if anyone's curious. If someone tweets "Let's go shoot some hoops", how do you know the tweet is about basketball, etc)
FIRST APPROACH -
Well, using regular expression(s) you could do like so -
import re
re.sub('g+', 'g', 'omgggg')
re.sub('l+', 'l', 'lollll')
etc.
Let me point out that using regular expressions is a very fragile & basic approach to dealing with this problem. You could so easily get strings from users which will break the above regular expressions. What I am trying to say is that this approach requires lot of maintenance in terms of observing the patterns of mistakes the users make & then creating case specific regular expressions for them.
SECOND APPROACH -
Instead have you considered using difflib module? It's a module with helpers for computing deltas between objects. Of particular importance here for you is SequenceMatcher. To paraphrase from official documentation-
SequenceMatcher is a flexible class
for comparing pairs of sequences of
any type, so long as the sequence
elements are hashable. SequenceMatcher
tries to compute a "human-friendly
diff" between two sequences. The
fundamental notion is the longest
contiguous & junk-free matching subsequence.
import difflib as dl
x = dl.SequenceMatcher(lambda x : x == ' ', "omg", "omgggg")
y = dl.SequenceMatcher(lambda x : x == ' ', "omgggg","omg")
avg = (x.ratio()+y.ratio())/2.0
if avg>= 0.6:
print 'Match!'
else:
print 'Sorry!'
According to documentation, any ratio() over 0.6 is a close match. You might need to explore tweak the ratio for your data needs. If you need more stricter matching I found any value over 0.8 serves well.
How about
\b(?=lol)\S*(\S+)(?<=\blol)\1*\b
(replace lol with omg, haha etc.)
This will match lol, lololol, lollll, lollollol etc. but fail lolo, lollllo, lolly and so on.
The rules:
Match the word lol completely.
Then allow any repetition of one or more characters at the end of the word (i. e. l, ol or lol)
So \b(?=zomg)\S*(\S+)(?<=\bzomg)\1*\b will match zomg, zomggg, zomgmgmg, zomgomgomg etc.
In Python, with comments:
result = re.sub(
r"""(?ix)\b # assert position at a word boundary
(?=lol) # assert that "lol" can be matched here
\S* # match any number of characters except whitespace
(\S+) # match at least one character (to be repeated later)
(?<=\blol) # until we have reached exactly the position after the 1st "lol"
\1* # then repeat the preceding character(s) any number of times
\b # and ensure that we end up at another word boundary""",
"lol", subject)
This will also match the "unadorned" version (i. e. lol without any repetition). If you don't want this, use \1+ instead of \1*.