How to replace if the first two letters in a word repeats with the same letter?
For instance,
string = 'hhappy'
And I want to get
happy
I tried with
re.sub(r'(.)\1+', r'\1', string)
But, this gives
hapy
Thank you!
You need to add a caret (^) to match only the start of the line.
re.sub(r'^(.)\1+', r'\1', string)
Example:
import re
string = 'hhappy'
print re.sub(r'^(.)\1+', r'\1', string)
Prints:
happy
The above works only for the start of the line. If you need this for each word you need to do this:
re.sub(r'\b(\w)\1+', r'\1', string)
The regex would be
\b(\w)\1+
\b checks for a word boundary.
Check it out here at regex101.
Or you could simply slice:
string = 'hhappy'
func = lambda s: s[1:] if s[0] == s[1] else s
new_string = func(string)
# happy
Related
in my data cleaning process i found some strings with inhbit a single char that might bias my analysis
i.e. 'hello please help r me with this s question'.
Until now i only found tools to remove specific chars , like
char= 's'
def char_remover(text:
spec_char = ''.join (i for i in text if i not in s text)
return spec_char
or the rsplit(), split() functions, which are good for deleting first /last char of a string.
In the end, I want to code a function that removes all single chars (whitespace char whitespace) from my string/dataframe.
My own thoughts on that question:
def spec_char_remover(text):
spec_char_rem= ''.join(i for i in text if i not len(i) <= 1)
return spec_char_rem
But that obviously didnĀ“t work.
Thanks in advance.
You could use regex:
>>> import re
>>> s = 'hello please help r me with this s question'
>>> re.sub(' . ', ' ', s)
'hello please help me with this question'
"." in regex matches any character. So " . " matches any character surrounded by spaces. You could also use "\s.\s" to match any character surrounded by any whitespace.
I'm relatively new to using Python and Regex, and I wanted to check if strings first and last characters are the same.
If first and last characters are same, then return 'True' (Ex: 'aba')
If first and last characters are not same, then return 'False' (Ex: 'ab')
Below is the code, I've written:
import re
string = 'aba'
pattern = re.compile(r'^/w./1w$')
matches = pattern.finditer(string)
for match in matches
print (match)
But from the above code, I don't see any output
if and only if you really want to use regex (for learning purpose):
import re
string = 'aba'
string2 = 'no match'
pattern = re.compile(r'^(.).*\1$')
if re.match(pattern, string):
print('ok')
else:
print('nok')
if re.match(pattern, string2):
print('ok')
else:
print('nok')
output:
ok
nok
Explanations:
^(.).*\1$
^ start of line anchor
(.) match the first character of the line and store it in a group
.* match any characters any time
\1 backreference to the first group, in this case the first character to impose that the first char and the last one are equal
$ end of line anchor
Demo: https://regex101.com/r/DaOPEl/1/
Otherwise the best approach is to simply use the comparison string[0] == string[-1]
string = 'aba'
if string[0] == string[-1]:
print 'same'
output:
same
Why do you overengineer with an regex at all? One principle of programming should be keeping it simple like:
string[0] is string[-1]
Or is there a need for regex?
The above answer of #Tobias is perfect & simple but if you want solution using regex then try the below code.
Try this code !
Code :
import re
string = 'abbaaaa'
pattern = re.compile(r'^(.).*\1$')
matches = pattern.finditer(string)
for match in matches:
print (match)
Output :
<_sre.SRE_Match object; span=(0, 7), match='abbaaaa'>
I think this is the regex you are trying to execute:
Code:
import re
string = 'aba'
pattern = re.compile(r'^(\w).(\1)$')
matches = pattern.finditer(string)
for match in matches:
print (match.group(0))
Output:
aba
if you want to check with regex use below:
import re
string = 'aba is a cowa'
pat = r'^(.).*\1$'
re.findall(pat,string)
if re.findall(pat,string):
print(string)
this will match first and last character of line or string if they match then it returns matching character in that case it will print string of line otherwise it will skip
I wanted to replace all 'A' in the middle of string by '*' using regex in python
I tried this
re.sub(r'[B-Z]+([A]+)[B-Z]+', r'*', 'JAYANTA ')
but it outputs - '*ANTA '
I would want it to be 'J*Y*NTA'
Can someone provide the required code? I would like an explanation of what is wrong in my code if possible.
Using the non-wordboundary \B.
To make sure that the A's are surrounded by word characters:
import re
str = 'JAYANTA POKED AGASTYA WITH BAAAAMBOO '
str = re.sub(r'\BA+\B', r'*', str)
print(str)
Prints:
J*Y*NTA POKED AG*STYA WITH B*MBOO
Alternatively, if you want to be more specific that it has to be surrounded by upper case letters. You can use lookbehind and lookahead instead.
str = re.sub(r'(?<=[A-Z])A+(?=[A-Z])', r'*', str)
>>> re.sub(r'(?!^)[Aa](?!$)','*','JAYANTA')
'J*Y*NTA'
My regex searches for an A but it cannot be at the start of the string (?!^) and not at the end of the string (?!$).
Lookahead assertion:
>>> re.sub(r'A(?=[A-Z])', r'*', 'JAYANTA ')
'J*Y*NTA '
In case if word start and end with 'A':
>>> re.sub(r'(?<=[A-Z])A(?=[A-Z])', r'*', 'AJAYANTA ')
'AJ*Y*NTA '
I like some ways of how string.capwords() behaves, and some ways of how .title() behaves, but not one single one.
I need abbreviations capitalized, which .title() does, but not string.capwords(), and string.capwords() does not capitalize letters after single quotes, so I need a combination of the two. I want to use .title(), and then I need to lowercase the single letter after an apostrophe only if there are no spaces between.
For example, here's a user's input:
string="it's e.t.!"
And I want to convert it to:
>>> "It's E.T.!"
.title() would capitalize the 's', and string.capwords() would not capitalize the "e.t.".
You can use regular expression substitution (See re.sub):
>>> s = "it's e.t.!"
>>> import re
>>> re.sub(r"\b(?<!')[a-z]", lambda m: m.group().upper(), s)
"It's E.T.!"
[a-z] will match lowercase alphabet letter. But not after ' ((?<!') - negative look-behind assertion). And the letter should appear after the word boundary; so t will not be matched.
The second argument to re.sub, lambda will return substitution string. (upper version of the letter) and it will be used for replacement.
a = ".".join( [word.capitalize() for word in "it's e.t.!".split(".")] )
b = " ".join( [word.capitalize() for word in a.split(" ")] )
print(b)
Edited to use the capitalize function instead. Now it's starting to look like something usable :). But this solution doesn't work with other whitespace characters. For that I would go with falsetru's solution.
if you don't want to use regex , you can always use this simple for loop
s = "it's e.t.!"
capital_s = ''
pos_quote = s.index("'")
for pos, alpha in enumerate(s):
if pos not in [pos_quote-1, pos_quote+1]:
alpha = alpha.upper()
capital_s += alpha
print capital_s
hope this helps :)
I would like to replace strings like 'HDMWhoSomeThing' to 'HDM Who Some Thing' with regex.
So I would like to extract words which starts with an upper-case letter or consist of upper-case letters only. Notice that in the string 'HDMWho' the last upper-case letter is in the fact the first letter of the word Who - and should not be included in the word HDM.
What is the correct regex to achieve this goal? I have tried many regex' similar to [A-Z][a-z]+ but without success. The [A-Z][a-z]+ gives me 'Who Some Thing' - without 'HDM' of course.
Any ideas?
Thanks,
Rukki
#! /usr/bin/env python
import re
from collections import deque
pattern = r'([A-Z]{2,}(?=[A-Z]|$)|[A-Z](?=[a-z]|$))'
chunks = deque(re.split(pattern, 'HDMWhoSomeMONKEYThingXYZ'))
result = []
while len(chunks):
buf = chunks.popleft()
if len(buf) == 0:
continue
if re.match(r'^[A-Z]$', buf) and len(chunks):
buf += chunks.popleft()
result.append(buf)
print ' '.join(result)
Output:
HDM Who Some MONKEY Thing XYZ
Judging by lines of code, this task is a much more natural fit with re.findall:
pattern = r'([A-Z]{2,}(?=[A-Z]|$)|[A-Z][a-z]*)'
print ' '.join(re.findall(pattern, 'HDMWhoSomeMONKEYThingX'))
Output:
HDM Who Some MONKEY Thing X
Try to split with this regular expression:
/(?=[A-Z][a-z])/
And if your regular expression engine does not support splitting empty matches, try this regular expression to put spaces between the words:
/([A-Z])(?![A-Z])/
Replace it with " $1" (space plus match of the first group). Then you can split at the space.
one liner :
' '.join(a or b for a,b in re.findall('([A-Z][a-z]+)|(?:([A-Z]*)(?=[A-Z]))',s))
using regexp
([A-Z][a-z]+)|(?:([A-Z]*)(?=[A-Z]))
So 'words' in this case are:
Any number of uppercase letters - unless the last uppercase letter is followed by a lowercase letter.
One uppercase letter followed by any number of lowercase letters.
so try:
([A-Z]+(?![a-z])|[A-Z][a-z]*)
The first alternation includes a negative lookahead (?![a-z]), which handles the boundary between an all-caps word and an initial caps word.
May be '[A-Z]*?[A-Z][a-z]+'?
Edit: This seems to work: [A-Z]{2,}(?![a-z])|[A-Z][a-z]+
import re
def find_stuff(str):
p = re.compile(r'[A-Z]{2,}(?![a-z])|[A-Z][a-z]+')
m = p.findall(str)
result = ''
for x in m:
result += x + ' '
print result
find_stuff('HDMWhoSomeThing')
find_stuff('SomeHDMWhoThing')
Prints out:
HDM Who Some Thing
Some HDM Who Thing