Regex divide with upper-case - python

I would like to replace strings like 'HDMWhoSomeThing' to 'HDM Who Some Thing' with regex.
So I would like to extract words which starts with an upper-case letter or consist of upper-case letters only. Notice that in the string 'HDMWho' the last upper-case letter is in the fact the first letter of the word Who - and should not be included in the word HDM.
What is the correct regex to achieve this goal? I have tried many regex' similar to [A-Z][a-z]+ but without success. The [A-Z][a-z]+ gives me 'Who Some Thing' - without 'HDM' of course.
Any ideas?
Thanks,
Rukki

#! /usr/bin/env python
import re
from collections import deque
pattern = r'([A-Z]{2,}(?=[A-Z]|$)|[A-Z](?=[a-z]|$))'
chunks = deque(re.split(pattern, 'HDMWhoSomeMONKEYThingXYZ'))
result = []
while len(chunks):
buf = chunks.popleft()
if len(buf) == 0:
continue
if re.match(r'^[A-Z]$', buf) and len(chunks):
buf += chunks.popleft()
result.append(buf)
print ' '.join(result)
Output:
HDM Who Some MONKEY Thing XYZ
Judging by lines of code, this task is a much more natural fit with re.findall:
pattern = r'([A-Z]{2,}(?=[A-Z]|$)|[A-Z][a-z]*)'
print ' '.join(re.findall(pattern, 'HDMWhoSomeMONKEYThingX'))
Output:
HDM Who Some MONKEY Thing X

Try to split with this regular expression:
/(?=[A-Z][a-z])/
And if your regular expression engine does not support splitting empty matches, try this regular expression to put spaces between the words:
/([A-Z])(?![A-Z])/
Replace it with " $1" (space plus match of the first group). Then you can split at the space.

one liner :
' '.join(a or b for a,b in re.findall('([A-Z][a-z]+)|(?:([A-Z]*)(?=[A-Z]))',s))
using regexp
([A-Z][a-z]+)|(?:([A-Z]*)(?=[A-Z]))

So 'words' in this case are:
Any number of uppercase letters - unless the last uppercase letter is followed by a lowercase letter.
One uppercase letter followed by any number of lowercase letters.
so try:
([A-Z]+(?![a-z])|[A-Z][a-z]*)
The first alternation includes a negative lookahead (?![a-z]), which handles the boundary between an all-caps word and an initial caps word.

May be '[A-Z]*?[A-Z][a-z]+'?
Edit: This seems to work: [A-Z]{2,}(?![a-z])|[A-Z][a-z]+
import re
def find_stuff(str):
p = re.compile(r'[A-Z]{2,}(?![a-z])|[A-Z][a-z]+')
m = p.findall(str)
result = ''
for x in m:
result += x + ' '
print result
find_stuff('HDMWhoSomeThing')
find_stuff('SomeHDMWhoThing')
Prints out:
HDM Who Some Thing
Some HDM Who Thing

Related

find and replace the word on a string based on the condition

hello dear helpful ppl at stackoverflow ,
I have couple questions about manipulating a string in python ,
first question:-
if I have a string like :
'What's the use?'
and I want to locate the first letter after 'the'
like (What's the use?) the letter is u
how I could do it in the best way possible ?
second question:-
if I want to change something on this string based on the first letter i found in the (First question)
how I could do it ?
and thanks for helping !
You could use a regex replacement to remove all content up and including the first the (along with any following whitespace). Then, just access the first character from that output.
inp = 'What''s the use?'
inp = re.sub(r'^.*?\bthe\b\s*', '', inp)
print("First character after first 'the' is: " + inp[0])
This prints:
First character after first 'the' is: u
Another re take:
import re
sample = "What is the use?"
pattern = r"""
(?<=\bthe\b) # look-behind to ensure 'the' is there. This is non-capturing.
\s+ # one or more whitespace characters
(\w) # Only one alphanumeric or underscore character
"""
# re.X is for verbose, which handles multi-line patterns
m = re.search(pattern, sample, flags = re.X).groups(1)
if not m is None:
print(f"First character after first 'the' is: {m[0]}")
You can find the index of 'u' by using the str.index() method. Then you can extract string before and after using slice operation.
s = "What's the use?"
character_index = s.lower().index('the ') + 4
print(character_index)
# 11
print(s[:character_index] + '*' + s[character_index+1:])
# What's the *se?

how to change ith letter of a word in capital letter in python?

I want to change the second last letter of each word in capital letter. but when my sentence contains a word with one letter the program gives an error of (IndexError: string index out of range). Here is my code. It works with more than one letter words. if I write, for example, str="Python is best programming language" it will work because there is not any word with (one) letter.
str ="I Like Studying Python Programming"
array1=str.split()
result =[]
for i in array1:
result.append(i[:-2].lower()+i[-2].upper()+i[-1].lower())
print(" ".join(result))
Your problem is quite amenable to using regular expressions, so I would recommend that here:
str = " I Like Studying Python Programming"
output = re.sub(r'(\w)(?=\w\b)', lambda m: m.group(1).upper(), str)
print(output)
This prints:
I LiKe StudyiNg PythOn ProgrammiNg
Note that this approach will not target any single letter words, since they would not be following by another word character.
Another option using a regex is to narrow down the match for characters only to be uppercased using a negated character class [^\W_\d] to match word characters except a digit or underscore followed by matching a non whitespace characters
This will for example uppercase a) to A) but will not match 3 in 3d
Explanation
[^\W_\d](?=\S(?!\S))
[^\W_\d] Match a word char except _ or a digit
(?= Positive lookahead, assert what is directly to the right is
\S(?!\S) Match a non whitespace char followed by a whitespace boundary
) Close lookahead
See a regex demo and a Python demo
Example
import re
regex = r"[^\W_\d](?=\S(?!\S))"
s = ("I Like Studying Python Programming\n\n"
"a) This is a test with 3d\n")
output = re.sub(regex, lambda m: m.group(0).upper(), s)
print(output)
Output
I LiKe StudyiNg PythOn ProgrammiNg
A) ThIs Is a teSt wiTh 3d
Using the PyPi regex module, you could also use \p{Ll} to match a lowercase letter that has an uppercase variant.
\p{Ll}(?=\S(?!\S))
See a regex demo and a Python demo
Simple check whether the length of each word is greater than one, only then convert the second last letter to uppercase and append it to the variable result, if the length the word is one, append the word as it is to the result variable.
Here is the code:
str ="I Like Studying Python Programming"
array1=str.split()
result =[]
for i in array1:
if len(i) > 1:
result.append(i[:-2].lower()+i[-2].upper()+i[-1].lower())
else:
result.append(i)
print(" ".join(result))

Find a match word and printing letters by their sides

I am trying to find a word on a string, match it with a query word, and then print them with some of their neighboring letters, like this:
input = aaxxYYxxaa
match = YY
requested_output = xxYYxx
So far I have tried with the Regex module, but I cannot go beyond the ‘match’ part:
import re
teststring = "aaxxYYxxaa"
word = re.findall (r"YY", teststring)
print(word)
output = YY
What could I do here to print the letters on each end of the ‘YY’ word?.
Thank you.
It looks as if you want to match any 0 to 2 chars before and after the YY value. Add .{0,2} on both sides of the pattern:
re.findall(r".{0,2}YY.{0,2}", teststring)
See the regex demo and a Python demo:
import re
teststring = "aaxxYYxxaa"
word = re.findall (r".{0,2}YY.{0,2}", teststring)
print(word) # => ['xxYYxx']
You would write your regex in a way, that it matches arbitrary characters before and after you known search term.
. matches any character
{m,n} repeats at least m times and at most n times
so to match xxYYxx you would say .{2,2}YY.{2,2}

Lowercase letter after certain character?

I like some ways of how string.capwords() behaves, and some ways of how .title() behaves, but not one single one.
I need abbreviations capitalized, which .title() does, but not string.capwords(), and string.capwords() does not capitalize letters after single quotes, so I need a combination of the two. I want to use .title(), and then I need to lowercase the single letter after an apostrophe only if there are no spaces between.
For example, here's a user's input:
string="it's e.t.!"
And I want to convert it to:
>>> "It's E.T.!"
.title() would capitalize the 's', and string.capwords() would not capitalize the "e.t.".
You can use regular expression substitution (See re.sub):
>>> s = "it's e.t.!"
>>> import re
>>> re.sub(r"\b(?<!')[a-z]", lambda m: m.group().upper(), s)
"It's E.T.!"
[a-z] will match lowercase alphabet letter. But not after ' ((?<!') - negative look-behind assertion). And the letter should appear after the word boundary; so t will not be matched.
The second argument to re.sub, lambda will return substitution string. (upper version of the letter) and it will be used for replacement.
a = ".".join( [word.capitalize() for word in "it's e.t.!".split(".")] )
b = " ".join( [word.capitalize() for word in a.split(" ")] )
print(b)
Edited to use the capitalize function instead. Now it's starting to look like something usable :). But this solution doesn't work with other whitespace characters. For that I would go with falsetru's solution.
if you don't want to use regex , you can always use this simple for loop
s = "it's e.t.!"
capital_s = ''
pos_quote = s.index("'")
for pos, alpha in enumerate(s):
if pos not in [pos_quote-1, pos_quote+1]:
alpha = alpha.upper()
capital_s += alpha
print capital_s
hope this helps :)

regex - how to select a word that has a '-' in it?

I am learning Regular Expressions, so apologies for a simple question.
I want to select the words that have a '-' (minus sign) in it but not at the beginning and not at the end of the word
I tried (using findall):
r'\b-\b'
for
str = 'word semi-column peace'
but, of course got only:
['-']
Thank you!
What you actually want to do is a regex like this:
\w+-\w+
What this means is find a alphanumeric character at least once as indicated by the utilization of '+', then find a '-', following by another alphanumeric character at least once, again, as indicated by the '+' again.
str is a built in name, better not to use it for naming
st = 'word semi-column peace'
# \w+ word - \w+ word after -
print(re.findall(r"\b\w+-\w+\b",st))
['semi-column']
a '-' (minus sign) in it but not at the beginning and not at the end of the word
Since "-" is not a word character, you can't use word boundaries (\b) to prevent a match from words with hyphens at the beggining or end. A string like "-not-wanted-" will match both \b\w+-\w+\b and \w+-\w+.
We need to add an extra condition before and after the word:
Before: (?<![-\w]) not preceded by either a hyphen nor a word character.
After: (?![-\w]) not followed by either a hyphen nor a word character.
Also, a word may have more than 1 hyphen in it, and we need to allow it. What we can do here is repeat the last part of the word ("hyphen and word characters") once or more:
\w+(?:-\w+)+ matches:
\w+ one or more word characters
(?:-\w+)+ a hyphen and one or more word characters, and also allows this last part to repeat.
Regex:
(?<![-\w])\w+(?:-\w+)+(?![-\w])
regex101 demo
Code:
import re
pattern = re.compile(r'(?<![-\w])\w+(?:-\w+)+(?![-\w])')
text = "-abc word semi-column peace -not-wanted- one-word dont-match- multi-hyphenated-word"
result = re.findall(pattern, text)
ideone demo
You can also use the following regex:
>>> st = "word semi-column peace"
>>> print re.findall(r"\S+\-\S+", st)
['semi-column']
You can try something like this: Centering on the hyphen, I match until there is a white space in either direction from the hyphen I also make check to see if the words are surrounded by hyphens (e.g -test-cats-) and if they are I make sure not to include them. The regular expression should also work with findall.
st = 'word semi-column peace'
m = re.search(r'([^ | ^-]+-[^ | ^-]+)', st)
if m:
print m.group(1)

Categories