python Using Regular Expression to find letters in a string - python

I want to find the first vowel in a word, and remove all the letters before the first occurrence of vowel, finally return the left of the word. i thought i can use a list to do that, first find 'a' in the word, and get the first part separated by 'a', and then find 'e'.....but i want to simplify it with regular expression, i am thinking if theres a way that i can find all the five vowels at the same time and get a index of the first one, then it will be easy to do next step. so i might need some help. i am a newer in regular expression, does anyone have an idea about this?
i have problems again. this is the code i write according to the suggestion made by #Martijin.
import re
def pigify():
user_input=raw_input()
sentence=re.sub(r'\b([aeiou])([a-z]*)\b',r'\1\2'+'hay',user_input,re.I)
sentence1=re.sub(r'\b(qu)([a-z]*)\b',r'\2\1'+'ay',sentence,re.I)
sentence2=re.sub(r'\b([^aeiou]*)(\w*)\b',r'\2\1'+'ay',sentence1,re.I)
print sentence2
return
pigify()
if i input:
quiet askhj a dhjsadf skdhyksj qdksdj y
i would like to get:
ietquay askhjhay ahay adfdhjsay yksjskdhay qdksdjay yay
but now i've only complished the first two steps:1. find the vowel-started word and add 'hay' at the end of it, 2.find the 'qu'-started word and move 'qu' to the end then add 'ay'.the 3rd step is to find the left words in the sentence and for every word, find the first vowel or 'y'(when 'y' is not the first letter) in it, move all the letters before the vowel to the end and add 'ay'. the code run as result like this:
ietquayayaskhjhay ay ahay dhjsadf skdhyksj qdksdj y
i guess i didn't use \b in a right way, because re.sub use replacement to replace the blocks. how to get it right? by the way, i've accomplished another version with 'for' loop and 'if|else',this is the code, i think there must be a way to simplify it.
def SieveWord(user_input):
return user_input.split(' ')
def UpperToLower(user_input):
return user_input.lower()
vowel=['a','e','i','o','u']
transform_input=UpperToLower(raw_input())
input_list=SieveWord(transform_input)
u=[]
for word in input_list:
if len(word)!=1:
if word[0] in vowel:
word+='h'
else:
if word[0]+word[1]=='qu':
word=word[2:]+'qu'
else:
for letter in word:
if letter in vowel or (letter=='y' and word[0]!='y'):
position=word.index(letter)
removepart=word[0:position]
word=word[position:]+removepart
break
elif word in vowel:
word+='h'
u.append(word+'ay')
for d in u:
print d,

You can use a regular expression to remove all non-vowels at the start of a word:
re.sub(r'\b[^aeoui]*', '', inputstring, flags=re.I)
Demo:
>>> import re
>>> inputstring = 'School'
>>> re.sub(r'\b[^aeoui]*', '', inputstring, flags=re.I)
'ool'
The [^...] negative class matches anything that is not a vowel (with the re.I flag making sure it'll ignore case). The \b anchor matches the position in a string just before or after a word. The re.I makes the In the example above, \b matches the start, and the negative class matches the Sch characters, as they are not in the class.

Related

how to change ith letter of a word in capital letter in python?

I want to change the second last letter of each word in capital letter. but when my sentence contains a word with one letter the program gives an error of (IndexError: string index out of range). Here is my code. It works with more than one letter words. if I write, for example, str="Python is best programming language" it will work because there is not any word with (one) letter.
str ="I Like Studying Python Programming"
array1=str.split()
result =[]
for i in array1:
result.append(i[:-2].lower()+i[-2].upper()+i[-1].lower())
print(" ".join(result))
Your problem is quite amenable to using regular expressions, so I would recommend that here:
str = " I Like Studying Python Programming"
output = re.sub(r'(\w)(?=\w\b)', lambda m: m.group(1).upper(), str)
print(output)
This prints:
I LiKe StudyiNg PythOn ProgrammiNg
Note that this approach will not target any single letter words, since they would not be following by another word character.
Another option using a regex is to narrow down the match for characters only to be uppercased using a negated character class [^\W_\d] to match word characters except a digit or underscore followed by matching a non whitespace characters
This will for example uppercase a) to A) but will not match 3 in 3d
Explanation
[^\W_\d](?=\S(?!\S))
[^\W_\d] Match a word char except _ or a digit
(?= Positive lookahead, assert what is directly to the right is
\S(?!\S) Match a non whitespace char followed by a whitespace boundary
) Close lookahead
See a regex demo and a Python demo
Example
import re
regex = r"[^\W_\d](?=\S(?!\S))"
s = ("I Like Studying Python Programming\n\n"
"a) This is a test with 3d\n")
output = re.sub(regex, lambda m: m.group(0).upper(), s)
print(output)
Output
I LiKe StudyiNg PythOn ProgrammiNg
A) ThIs Is a teSt wiTh 3d
Using the PyPi regex module, you could also use \p{Ll} to match a lowercase letter that has an uppercase variant.
\p{Ll}(?=\S(?!\S))
See a regex demo and a Python demo
Simple check whether the length of each word is greater than one, only then convert the second last letter to uppercase and append it to the variable result, if the length the word is one, append the word as it is to the result variable.
Here is the code:
str ="I Like Studying Python Programming"
array1=str.split()
result =[]
for i in array1:
if len(i) > 1:
result.append(i[:-2].lower()+i[-2].upper()+i[-1].lower())
else:
result.append(i)
print(" ".join(result))

Search through a list of strings for a word that has a variable character

Basically, I start with inserting the word "brand" where I replace a single character in the word with an underscore and try and find all words that match the remaining characters. For example:
"b_and" would return: "band", "brand", "bland" .... etc.
I started with using re.sub to substitute the underscore in the character. But I'm really lost on where to go next. I only want words that are different by this underscore, either without the underscore or by replacing it with a letter. Like if the word "under" was to run through the list, i wouldn't want it to return "understood" or "thunder", just a single character difference. Any ideas would be great!
I tried replacing the character with every letter in the alphabet first, then back checking if that word is in the dictionary, but that took such a long time, I really want to know if there's a faster way
from itertools import chain
dictionary=open("Scrabble.txt").read().split('\n')
import re,string
#after replacing the word with "_", we find words in the dictionary that match the pattern
new=[]
for letter in string.ascii_lowercase:
underscore=re.sub('_', letter, word)
if underscore in dictionary:
new.append(underscore)
if new == []:
pass
else:
return new
IIUC this should do it. I'm doing it outside a function so you have a working example, but it's straightforward to do it inside a function.
string = 'band brand bland cat dand bant bramd branding blandisher'
word='brand'
new=[]
for n,letter in enumerate(word):
pattern=word[:n]+'\w?'+word[n+1:]
new.extend(re.findall(pattern,string))
new=list(set(new))
Output:
['bland', 'brand', 'bramd', 'band']
Explanation:
We're using regex to do what you're looking. In this case, in every iteration we're taking one letter out of "brand" and making the algorithm look for any word that matches. So it'll look for:
_rand, b_and, br_nd, bra_d, bran_
For the case of "b_and" the pattern is b\w?and, which means: find a word with b, then any character may or may not appear, and then 'and'.
Then it adds to the list all words that match.
Finally I remove duplicates with list(set(new))
Edit: forgot to add string vairable.
Here's a version of Juan C's answer that's a bit more Pythonic
import re
dictionary = open("Scrabble.txt").read().split('\n')
pattern = "b_and" # change to what you need
pattern = pattern.replace('_', '.?')
pattern += '\\b'
matching_words = [word for word in dictionary if re.match(pattern, word)]
Edit: fixed the regex according to your comment, quick explanation:
pattern = "b_and"
pattern = pattern.replace('_', '.?') # pattern is now b.?and, .? matches any one character (or none at all)
pattern += '\\b' # \b prevents matching with words like "bandit" or words longer than "b_and"

Lowercase letter after certain character?

I like some ways of how string.capwords() behaves, and some ways of how .title() behaves, but not one single one.
I need abbreviations capitalized, which .title() does, but not string.capwords(), and string.capwords() does not capitalize letters after single quotes, so I need a combination of the two. I want to use .title(), and then I need to lowercase the single letter after an apostrophe only if there are no spaces between.
For example, here's a user's input:
string="it's e.t.!"
And I want to convert it to:
>>> "It's E.T.!"
.title() would capitalize the 's', and string.capwords() would not capitalize the "e.t.".
You can use regular expression substitution (See re.sub):
>>> s = "it's e.t.!"
>>> import re
>>> re.sub(r"\b(?<!')[a-z]", lambda m: m.group().upper(), s)
"It's E.T.!"
[a-z] will match lowercase alphabet letter. But not after ' ((?<!') - negative look-behind assertion). And the letter should appear after the word boundary; so t will not be matched.
The second argument to re.sub, lambda will return substitution string. (upper version of the letter) and it will be used for replacement.
a = ".".join( [word.capitalize() for word in "it's e.t.!".split(".")] )
b = " ".join( [word.capitalize() for word in a.split(" ")] )
print(b)
Edited to use the capitalize function instead. Now it's starting to look like something usable :). But this solution doesn't work with other whitespace characters. For that I would go with falsetru's solution.
if you don't want to use regex , you can always use this simple for loop
s = "it's e.t.!"
capital_s = ''
pos_quote = s.index("'")
for pos, alpha in enumerate(s):
if pos not in [pos_quote-1, pos_quote+1]:
alpha = alpha.upper()
capital_s += alpha
print capital_s
hope this helps :)

Python 3 - Regular Expression - Match string with one character less

So I want to write a regex that matches with a word that is one character less than the word. So for example:
wordList = ['inherit', 'inherent']
for word in wordList:
if re.match('^inhe....', word):
print(word)
And in theory, it would print both inherit and inherent, but I can only get it to print inherent. So how can I match with a word one letter short without just erasing one of the dots (.)
(Edited)
For matching only inherent, you could use .{4}:
re.match('^inhe.{4}', word)
Or ....$:
re.match('^inhe....$')
A regex may not be the best tool here, if you just want to know if word Y starts with the first N-1 letters of word X, do this:
if Y.startswith( X[:-1] ):
# Do whatever you were trying to do.
X[:-1] gets all but the last character of X (or the empty string if X is the empty string).
Y.startswith( 'blah' ) returns true if Y starts with 'blah'.

Find a lowercase letter surronded by three uppercase letters

I have a string with a mix of uppercase and lowercase letters. I need want to find every lowercase letter that is surronded by 3 uppercase letters and extract it from the string.
For instance ZZZaZZZ I want to extract the a in the previous string.
I have written a script that is able to extract ZZZaZZZ but not the a alone. I know I need to use nested regex expressions to do this but I can not wrap my mind on how to implement this. The following is what I have:
import string, re
if __name__ == "__main__":
#open the file
eqfile = open("string.txt")
gibberish = eqfile.read()
eqfile.close()
r = re.compile("[A-Z]{3}[a-z][A-Z]{3}")
print r.findall(gibberish)
EDIT:
Thanks for the answers guys! I guess I should have been more specific. I need to find the lowercase letter that is surrounded by three uppercase letters that are exactly the same, such as in my example ZZZaZZZ.
You are so close! Read about the .group* methods of MatchObjects. For example, if your script ended with
r = re.compile("[A-Z]{3}([a-z])[A-Z]{3}")
print r.match(gibberish).group(1)
then you'd capture the desired character inside the first group.
To address the new constraint of matching repeated letters, you can use backreferences:
r = re.compile(r'([A-Z])\1{2}(?P<middle>[a-z])\1{3}')
m = r.match(gibberish)
if m is not None:
print m.group('middle')
That reads like:
Match a letter A-Z and remember it.
Match two occurrences of the first letter found.
Match your lowercase letter and store it in the group named middle.
Match three more consecutive instances of the first letter found.
If a match was found, print the value of the middle group.
r = re.compile("(?<=[A-Z]{3})[a-z](?=[A-Z]{3})")
(?<=...) indicates a positive lookbehind and (?=...) is a positive lookahead.
module re
(?=...)
Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.
(?<=...)
Matches if the current position in the string is preceded by a match for ... that ends at the current position.
You need to capture the part of the string you are interested in with parentheses, and then access it with re.MatchObject#group:
r = re.compile("[A-Z]{3}([a-z])[A-Z]{3}")
m = r.match(gibberish)
if m:
print "Match! Middle letter was " + m.group(1)
else:
print "No match."

Categories