Python regular expression for consonants and vowels

Python regular expression for consonants and vowels - python

I was trying to create a regular expression which would allow any number of consonants , or any number of vowels , or a mix of consonants and vowels such that we only have any number of consonants in the beginning followed by any number of vowels ONLY, no consonant should be allowed after the vowels, it would be more clear from the example:
The following cases should pass:
TR, EE, TREE, Y, BY.
But the following should not pass the expression :
TROUBLE, OATS, TREES, IVY, TROUBLES, PRIVATE, OATEN, ORRERY.
So generally it can be visualized as : [C] [V]
C - Consonants
V - Vowels
[ ] - where the square brackets denote arbitrary presence of their contents.
And I reached upto this piece of code:
import re
def find_m(word):
if re.match("[^aeiou]*?[aeiou]*?",word):
print "PASS"
else:
print "FAIL"
find_m("tr")
find_m("ee")
find_m("tree")
find_m("y")
find_m("by")
find_m("trouble")
find_m("oats")
find_m("trees")
find_m("ivy")
find_m("aaabbbbaaa")
But it is passing for all the cases, I need a correct expressions which gives the desired results.

All you need to do is to add another anchor $ at the end of you regex as
if re.match("[^aeiou]*[aeiou]*$",word):
$ achors the regex at the end of the string. Allows nothing after The vowels
Note
You can drop the non greedy ? from the regex as the non greedy does not have any effect on the specified character class
The regex would match empty strings as well.
The mandatory part in the string can be specified by replacing the * with +. Say for example if the input must contain vowels, then the regex must be
if re.match("[^aeiou]*[aeiou]+$",word):
Or you can check if the string is empty using a lookahead so as to ensure the string is non empty as
if re.match("^(?=.+)[^aeiou]*[aeiou]*$",word):
Test
$ cat test.py
import re
def find_m(word):
if re.match("[^aeiou]*[aeiou]*$",word):
print "PASS"
else:
print "FAIL"
find_m("tr")
find_m("ee")
find_m("tree")
find_m("y")
find_m("by")
find_m("trouble")
find_m("oats")
find_m("trees")
find_m("ivy")
find_m("aaabbbbaaa")
$ python test.py
PASS
PASS
PASS
PASS
PASS
FAIL
FAIL
FAIL
FAIL
FAIL

Fill in the code to check if the text passed contains the vowels a, e and i, with exactly one occurrence of any other character in between.
import re
def check_aei (text):
result = re.search(r"a.e.i", text)
return result != None
print(check_aei("academia")) # True
print(check_aei("aerial")) # False
print(check_aei("paramedic")) # True

Related

How can I check if a string contains only English characters, exclamation marks at the end?

For example, "hello!!" should return true, whereas "45!!","!!ok" should return false. The only case where it should return true is when the string has English characters (a-z) with 0 or more exclamation marks in the end.
The following is my solution using an iterative method. However, I want to know some clean method having fewer lines of code (maybe by using some Python library).
def fun(str):
i=-1
for i in range(0,len(str)):
if str[i]=='!':
break
elif (str[i]>='a' and str[i]<='z'):
continue
else:
return 0
while i<len(str):
if(str[i]!='!'):
return 0
i+=1
return 1
print(fun("hello!!"))

Regex can help you out here.
The regular expression you're looking for here is:
^[a-z]+!*$
This will allow one or more English letters (lowered case, you can add upper case as well if you'll go with ^[a-zA-Z]+!*$, or any other letters you'd like to add inside the square brackets)
and zero or more exclamation marks at the end of the word.
Wrapping it up with python code:
import re
pattern = re.compile(r'^[a-z ]+!*$')
word = pattern.search("hello!!")
print(f"Found word: {word.group()}")

If i in string but partial is even ok python

I am sure this is simple but I can't see or possibly can't find a solution.
Suppose I have a string, for example something--abcd--something and I want to find abcc in the string. I want to allow for one mismatch, meaning that the output of the below code should be True.
my_string='something--abcd--something'
substring = 'abcc'
if substring in my_string:
print('True')
else:
print('False')
I know that the substring is not in my_string but then what I want is to allow for one mismatch then the output will be True.
How can I achieve that?

There are certainly finer ways to do it, but one solution is to search for it with regexes in which one of the characters is replaced by a dot (or '\w' if you want the character to be a letter and nothing else).
We use a generator to lazily generate the regexes by replacing one of the letters each time, then check if any of these regexes match:
import re
def with_one_dot(s):
for i in range(len(s)):
yield s[:i] + '.' + s[i+1:]
def match_all_but_one(string, target):
return any(re.search(fuzzy_target, string) for fuzzy_target in with_one_dot(target))
def find_fuzzy(string, target):
" Return the start index of the fuzzy match, -1 if not found"
for fuzzy_target in with_one_dot(target):
m = re.search(fuzzy_target, string)
if m:
return m.start()
return -1
my_string = 'something--abcd--something'
print(match_all_but_one(my_string, 'abcc')) # 1 difference
# True
print(find_fuzzy(my_string, 'abcc'))
# 11
print(match_all_but_one(my_string,'abbb')) # 2 differences
# False
print(find_fuzzy(my_string, 'abbb'))
# -1
The with_one_dot(s) generator yields s with one letter replaced by a dot on each iteration:
for reg in with_one_dot('abcd'):
print(reg)
outputs:
.bcd
a.cd
ab.d
abc.
Each of these strings is used as a regex and tested on my_string. The dot . in a regex means 'match anything', so it allows any symbol instead of the original letter.
any returns True immediately if any of theses regexes matches, False if none does.

How to write regular expression to find combination of characters, but each can only appear once in python

I would like to find whether "xy" in a string, "xy" is optional, for each character it can only appear once. For example:
def findpat(texts, pat):
for text in texts:
if re.search(pat, t):
print re.search(pat, t).group()
else:
print None
pat = re.compile(r'[xy]*?b')
text = ['xyb', 'xb', 'yb', 'yxb','b', 'xyxb']
findpat(text, pat)
# it prints
# xyb
# xb
# yb
# yxb
# b
# xyxb
For the last one, my desired output is "yxb".
How should I modify my regex? Many thanks

You may use the following approach: match and capture the two groups, ([xy]*)(b). Then, once a match is found, check if the length of the value in Group 1 is the same as the number of unique chars in this value. If not, remove the chars from the start of the group value until you get a string with the length of the number of unique chars.
Something like:
def findpat(texts, pat):
for t in texts:
m = re.search(pat, t) # Find a match
if m:
tmp = set([x for x in m.group(1)]) # Get the unqiue chars
if len(tmp) == len(m.group(1)): # If Group 1 length is the same
print re.search(pat, t).group() # Report a whole match value
else:
res = m.group(1)
while len(tmp) < len(res): # While the length of the string is not
res = res[1:] # equal to the number of unique chars, truncate from the left
print "{}{}".format(res, m.group(2)) # Print the result
else:
print None # Else, no match
pat = re.compile(r'([xy]*)(b)')
text = ['xyb', 'xb', 'yb', 'yxb','b', 'xyxb']
findpat(text, pat)
# => [xyb, xb, yb, yxb, b, yxb]
See the Python demo

You can use this pattern
r'(x?y?|yx)b'
To break down, the interesting part x?y?|yx will match:
empty string
only x
only y
xy
and on the alternative branch, yx
As an advice, when you aren't very comfortable with regex and your number of scenarios are small, you could simply brute force the pattern. It's ugly, but it makes clear what your cases are:
r'b|xb|yb|xyb|yxb'
Part 2.
For a generic solution, that will do the same, but for any number of characters instead of just {x, y}, the following regex style can be used:
r'(?=[^x]*x?[^x]*b)(?=[^y]*y?[^y]*b)(?=[^z]*z?[^z]*b)[xyz]*b'
I'll explain it a bit:
By using lookaheads you advance the regex cursor and for each position, you just "look ahead" and see if what follows respects a certain condition. By using this technique, you may combine several conditions into a single regex.
For a cursor position, we test each character from our set to appear at most once from the position, until we match our target b character. We do this with this pattern [^x]*x?[^x]*, which means match not-x if there are any, match at most one x, then match any number of not x
Once the test conditions are met, we start advancing the cursor and matching all the characters from our needed set, until we find a b. At this point we are guaranteed that we won't match any duplicates, because we performed our lookahead tests.
Note: I strongly suspect that this has poor performance, because it does backtracking. You should only use it for small test strings.
Test it.

Well, the regexp that literally passes your test cases is:
pat = re.compile(r'(x|y|xy|yx)?b$')
where the "$" anchors the string at the end and thereby ensures it's the last match found.
However it's a little more tricky to use the regexp mechanism(s) to ensure that only one matching character from the set is used ...

From Wiktor Stribiżew's comment & demo, I got my answer.
pat = re.compile(r'([xy]?)(?:(?!\1)[xy])?b')
Thanks you all!

Regular Expression Testing

So i have been working on this project for myself to understand regular expressions There are 6 lines of input. The first line will contain 10 character strings. The last 5 lines will contain a valid regular expression string.
For the output, each regular expression print all the character strings that are matches to the strings according to line 1; if none match then print none. # is used to say it is an empty string. I have gotten everything but the empty string part so here is my code
and example input that would be
1)#,aac,acc,abc,ac,abbc,abbbc,abbbbc,aabc,accb
and i would like the second input to be
2)b*
the output im trying to get is #
and so far it outputs nothing
import re
inp = input("Search String:").upper().split(',')
for runs in range(50):
temp = []
query = input("Search Query:").replace("?", "[A-Z_0-9]+?+$").upper()
for item in inp:
search = re.match(query, item)
if search:
if search.group() not in temp:
temp.append(search.group())
if len(temp) > 0:
print(" ".join(temp))
else:
print("NONE")

b matches only the literal character 'b', so your search string will only match a sequence of zero or more b's, such as
b
or
bbbb
or
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb (and so on)
Your match string will not match anything else.
I don't know why you are using a specific letter, but I assume you intended an escape sequence, like "\b*", although that only matches transitions between types of characters, so it won't match # in this context. If you use \W*, it will match # (not sure whether it will match the other stuff you want).
If you haven't already, check out the following resources on regular expressions, including all of the escape characters and metacharacters:
Wikipedia
Python.org 2.7

Python regular expression to remove space and capitalize letters where the space was?

I want to create a list of tags from a user supplied single input box, separated by comma's and I'm looking for some expression(s) that can help automate this.
What I want is to supply the input field and:
remove all double+ whitespaces, tabs, new lines (leaving just single spaces)
remove ALL (single's and double+) quotation marks, except for comma's, which there can be only one of
in between each comma, i want Something Like Title Case, but excluding the first word and not at all for single words, so that when the last spaces are removed, the tag comes out as 'somethingLikeTitleCase' or just 'something' or 'twoWords'
and finally, remove all remaining spaces
Here's what I have gathered around SO so far:
def no_whitespace(s):
"""Remove all whitespace & newlines. """
return re.sub(r"(?m)\s+", "", s)
# remove spaces, newlines, all whitespace
# http://stackoverflow.com/a/42597/523051
tag_list = ''.join(no_whitespace(tags_input))
# split into a list at comma's
tag_list = tag_list.split(',')
# remove any empty strings (since I currently don't know how to remove double comma's)
# http://stackoverflow.com/questions/3845423/remove-empty-strings-from-a-list-of-strings
tag_list = filter(None, tag_list)
I'm lost though when it comes to modifying that regex to remove all the punctuation except comma's and I don't even know where to begin for the capitalizing.
Any thoughts to get me going in the right direction?
As suggested, here are some sample inputs = desired_outputs
form: 'tHiS iS a tAg, 'whitespace' !&#^ , secondcomment , no!punc$$, ifNOSPACESthenPRESERVEcaps' should come out as
['thisIsATag', 'secondcomment', 'noPunc', 'ifNOSPACESthenPRESERVEcaps']

Here's an approach to the problem (that doesn't use any regular expressions, although there's one place where it could). We split up the problem into two functions: one function which splits a string into comma-separated pieces and handles each piece (parseTags), and one function which takes a string and processes it into a valid tag (sanitizeTag). The annotated code is as follows:
# This function takes a string with commas separating raw user input, and
# returns a list of valid tags made by sanitizing the strings between the
# commas.
def parseTags(str):
# First, we split the string on commas.
rawTags = str.split(',')
# Then, we sanitize each of the tags. If sanitizing gives us back None,
# then the tag was invalid, so we leave those cases out of our final
# list of tags. We can use None as the predicate because sanitizeTag
# will never return '', which is the only falsy string.
return filter(None, map(sanitizeTag, rawTags))
# This function takes a single proto-tag---the string in between the commas
# that will be turned into a valid tag---and sanitizes it. It either
# returns an alphanumeric string (if the argument can be made into a valid
# tag) or None (if the argument cannot be made into a valid tag; i.e., if
# the argument contains only whitespace and/or punctuation).
def sanitizeTag(str):
# First, we turn non-alphanumeric characters into whitespace. You could
# also use a regular expression here; see below.
str = ''.join(c if c.isalnum() else ' ' for c in str)
# Next, we split the string on spaces, ignoring leading and trailing
# whitespace.
words = str.split()
# There are now three possibilities: there are no words, there was one
# word, or there were multiple words.
numWords = len(words)
if numWords == 0:
# If there were no words, the string contained only spaces (and/or
# punctuation). This can't be made into a valid tag, so we return
# None.
return None
elif numWords == 1:
# If there was only one word, that word is the tag, no
# post-processing required.
return words[0]
else:
# Finally, if there were multiple words, we camel-case the string:
# we lowercase the first word, capitalize the first letter of all
# the other words and lowercase the rest, and finally stick all
# these words together without spaces.
return words[0].lower() + ''.join(w.capitalize() for w in words[1:])
And indeed, if we run this code, we get:
>>> parseTags("tHiS iS a tAg, \t\n!&#^ , secondcomment , no!punc$$, ifNOSPACESthenPRESERVEcaps")
['thisIsATag', 'secondcomment', 'noPunc', 'ifNOSPACESthenPRESERVEcaps']
There are two points in this code that it's worth clarifying. First is the use of str.split() in sanitizeTags. This will turn a b c into ['a','b','c'], whereas str.split(' ') would produce ['','a','b','c','']. This is almost certainly the behavior you want, but there's one corner case. Consider the string tAG$. The $ gets turned into a space, and is stripped out by the split; thus, this gets turned into tAG instead of tag. This is probably what you want, but if it isn't, you have to be careful. What I would do is change that line to words = re.split(r'\s+', str), which will split the string on whitespace but leave in the leading and trailing empty strings; however, I would also change parseTags to use rawTags = re.split(r'\s*,\s*', str). You must make both these changes; 'a , b , c'.split(',') becomes ['a ', ' b ', ' c'], which is not the behavior you want, whereas r'\s*,\s*' deletes the space around the commas too. If you ignore leading and trailing white space, the difference is immaterial; but if you don't, then you need to be careful.
Finally, there's the non-use of regular expressions, and instead the use of str = ''.join(c if c.isalnum() else ' ' for c in str). You can, if you want, replace this with a regular expression. (Edit: I removed some inaccuracies about Unicode and regular expressions here.) Ignoring Unicode, you could replace this line with
str = re.sub(r'[^A-Za-z0-9]', ' ', str)
This uses [^...] to match everything but the listed characters: ASCII letters and numbers. However, it's better to support Unicode, and it's easy, too. The simplest such approach is
str = re.sub(r'\W', ' ', str, flags=re.UNICODE)
Here, \W matches non-word characters; a word character is a letter, a number, or the underscore. With flags=re.UNICODE specified (not available before Python 2.7; you can instead use r'(?u)\W' for earlier versions and 2.7), letters and numbers are both any appropriate Unicode characters; without it, they're just ASCII. If you don't want the underscore, you can add |_ to the regex to match underscores as well, replacing them with spaces too:
str = re.sub(r'\W|_', ' ', str, flags=re.UNICODE)
This last one, I believe, matches the behavior of my non-regex-using code exactly.
Also, here's how I'd write the same code without those comments; this also allows me to eliminate some temporary variables. You might prefer the code with the variables present; it's just a matter of taste.
def parseTags(str):
return filter(None, map(sanitizeTag, str.split(',')))
def sanitizeTag(str):
words = ''.join(c if c.isalnum() else ' ' for c in str).split()
numWords = len(words)
if numWords == 0:
return None
elif numWords == 1:
return words[0]
else:
return words[0].lower() + ''.join(w.capitalize() for w in words[1:])
To handle the newly-desired behavior, there are two things we have to do. First, we need a way to fix the capitalization of the first word: lowercase the whole thing if the first letter's lowercase, and lowercase everything but the first letter if the first letter's upper case. That's easy: we can just check directly. Secondly, we want to treat punctuation as completely invisible: it shouldn't uppercase the following words. Again, that's easy—I even discuss how to handle something similar above. We just filter out all the non-alphanumeric, non-whitespace characters rather than turning them into spaces. Incorporating those changes gives us
def parseTags(str):
return filter(None, map(sanitizeTag, str.split(',')))
def sanitizeTag(str):
words = filter(lambda c: c.isalnum() or c.isspace(), str).split()
numWords = len(words)
if numWords == 0:
return None
elif numWords == 1:
return words[0]
else:
words0 = words[0].lower() if words[0][0].islower() else words[0].capitalize()
return words0 + ''.join(w.capitalize() for w in words[1:])
Running this code gives us the following output
>>> parseTags("tHiS iS a tAg, AnD tHIs, \t\n!&#^ , se#%condcomment$ , No!pUnc$$, ifNOSPACESthenPRESERVEcaps")
['thisIsATag', 'AndThis', 'secondcomment', 'NopUnc', 'ifNOSPACESthenPRESERVEcaps']

You could use a white list of characters allowed to be in a word, everything else is ignored:
import re
def camelCase(tag_str):
words = re.findall(r'\w+', tag_str)
nwords = len(words)
if nwords == 1:
return words[0] # leave unchanged
elif nwords > 1: # make it camelCaseTag
return words[0].lower() + ''.join(map(str.title, words[1:]))
return '' # no word characters
This example uses \w word characters.
Example
tags_str = """ 'tHiS iS a tAg, 'whitespace' !&#^ , secondcomment , no!punc$$,
ifNOSPACESthenPRESERVEcaps' """
print("\n".join(filter(None, map(camelCase, tags_str.split(',')))))
Output
thisIsATag
whitespace
secondcomment
noPunc
ifNOSPACESthenPRESERVEcaps

I think this should work
def toCamelCase(s):
# remove all punctuation
# modify to include other characters you may want to keep
s = re.sub("[^a-zA-Z0-9\s]","",s)
# remove leading spaces
s = re.sub("^\s+","",s)
# camel case
s = re.sub("\s[a-z]", lambda m : m.group(0)[1].upper(), s)
# remove all punctuation and spaces
s = re.sub("[^a-zA-Z0-9]", "", s)
return s
tag_list = [s for s in (toCamelCase(s.lower()) for s in tag_list.split(',')) if s]
the key here is to make use of re.sub to make the replacements you want.
EDIT : Doesn't preserve caps, but does handle uppercase strings with spaces
EDIT : Moved "if s" after the toCamelCase call

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python regular expression for consonants and vowels - python

Related

How can I check if a string contains only English characters, exclamation marks at the end?

If i in string but partial is even ok python

How to write regular expression to find combination of characters, but each can only appear once in python

Regular Expression Testing

Python regular expression to remove space and capitalize letters where the space was?

Categories

Resources