Replace all a in the middle of string by * using regex - python

I wanted to replace all 'A' in the middle of string by '*' using regex in python
I tried this
re.sub(r'[B-Z]+([A]+)[B-Z]+', r'*', 'JAYANTA ')
but it outputs - '*ANTA '
I would want it to be 'J*Y*NTA'
Can someone provide the required code? I would like an explanation of what is wrong in my code if possible.

Using the non-wordboundary \B.
To make sure that the A's are surrounded by word characters:
import re
str = 'JAYANTA POKED AGASTYA WITH BAAAAMBOO '
str = re.sub(r'\BA+\B', r'*', str)
print(str)
Prints:
J*Y*NTA POKED AG*STYA WITH B*MBOO
Alternatively, if you want to be more specific that it has to be surrounded by upper case letters. You can use lookbehind and lookahead instead.
str = re.sub(r'(?<=[A-Z])A+(?=[A-Z])', r'*', str)

>>> re.sub(r'(?!^)[Aa](?!$)','*','JAYANTA')
'J*Y*NTA'
My regex searches for an A but it cannot be at the start of the string (?!^) and not at the end of the string (?!$).

Lookahead assertion:
>>> re.sub(r'A(?=[A-Z])', r'*', 'JAYANTA ')
'J*Y*NTA '
In case if word start and end with 'A':
>>> re.sub(r'(?<=[A-Z])A(?=[A-Z])', r'*', 'AJAYANTA ')
'AJ*Y*NTA '

Related

How to remove non-specific char of a string/dataframe[i] in Python

in my data cleaning process i found some strings with inhbit a single char that might bias my analysis
i.e. 'hello please help r me with this s question'.
Until now i only found tools to remove specific chars , like
char= 's'
def char_remover(text:
spec_char = ''.join (i for i in text if i not in s text)
return spec_char
or the rsplit(), split() functions, which are good for deleting first /last char of a string.
In the end, I want to code a function that removes all single chars (whitespace char whitespace) from my string/dataframe.
My own thoughts on that question:
def spec_char_remover(text):
spec_char_rem= ''.join(i for i in text if i not len(i) <= 1)
return spec_char_rem
But that obviously didnĀ“t work.
Thanks in advance.
You could use regex:
>>> import re
>>> s = 'hello please help r me with this s question'
>>> re.sub(' . ', ' ', s)
'hello please help me with this question'
"." in regex matches any character. So " . " matches any character surrounded by spaces. You could also use "\s.\s" to match any character surrounded by any whitespace.

Split a string at second occurence of Comma

My string is like below:
Str=S1('amm','string'),S2('amm_sec','string'),S3('amm_','string')
How can I Split the string so that my str_list item becomes:
Str_List[0]=S1('amm','string')
Str_List[1]=S2('amm_sec','string')
...
If I use Str.split(',') then the output is:
Str_List[0]=S1('amm'
...
you can use regex with re in python
import re
Str = "S1('amm','string'),S2('amm_sec','string'),S3('amm_','string')"
lst = re.findall("S\d\(.*?\)", Str)
this will give you:
["S1('amm','string')", "S2('amm_sec','string')", "S3('amm_','string')"]
to explain the regex a little more:
S first you match 'S'
\d next look for a digit
\( then the '(' character
.*? with any number of characters in the middle (but match as few as you can)
\) followed by the last ')' character
you can play with the regex a little more here
My first thought would be to replace ',S' with ' S' using regex and split on spaces.
import re
Str = re.sub(',S',' S',Str)
Str_list = Str.split()

regex and python

I have a string:
myString = "123ABC,'2009-12-23T23:45:58.544-04:00'"
I want to extract the "T" character from the Timestamp, ie change it to:
"123ABC,'2009-12-23 23:45:58.544-04:00'"
I am trying this:
newString = re.sub('(?:\-\d{2})T(?:\d{2}\:)', ' ', myString)
BUT, the returned string is:
"123ABC,'2009-12 45:58.544-04:00'"
The "non capturing groups" don't appear to be "non capturing", and it's removing everything. What am I doing wrong?
You can use lookarounds (positive lookbehind and -ahead):
(?<=\d)T(?=\d)
See a demo on regex101.com.
In Python this would be:
import re
myString = "123ABC,'2009-12-23T23:45:58.544-04:00'"
rx = r'(?<=\d)T(?=\d)'
# match a T surrounded by digits
new_string = re.sub(rx, ' ', myString)
print new_string
# 123ABC,'2009-12-23 23:45:58.544-04:00'
See a demo on ideone.com.
regex seems a bit of an overkill:
mystring.replace("T"," ")
I'd use capturing groups, unanchored lookbehinds are costly in terms of regex performance:
(\d)T(\d)
And replace with r'\1 \2' replacement pattern containing backreferences to the digit before and after T. See the regex demo
Python demo:
import re
s = "123ABC,'2009-12-23T23:45:58.544-04:00'"
reg = re.compile(r'(\d)T(\d)')
s = reg.sub(r'\1 \2', s)
print(s)
That T is trapped in between numbers and will always be alone on the right. You could use a rsplit and join:
myString = "123ABC,'2009-12-23T23:45:58.544-04:00'"
s = ' '.join(myString.rsplit('T', maxsplit=1))
print(s)
# "123ABC,'2009-12-23 23:45:58.544-04:00'"
Trying this on a leading T somewhere in the string:
myString = "123ATC,'2009-12-23T23:45:58.544-04:00'"
s = ' '.join(myString.rsplit('T', maxsplit=1))
print(s)
# "123ATC,'2009-12-23 23:45:58.544-04:00'"

Replace if the first two letters are repeating in python

How to replace if the first two letters in a word repeats with the same letter?
For instance,
string = 'hhappy'
And I want to get
happy
I tried with
re.sub(r'(.)\1+', r'\1', string)
But, this gives
hapy
Thank you!
You need to add a caret (^) to match only the start of the line.
re.sub(r'^(.)\1+', r'\1', string)
Example:
import re
string = 'hhappy'
print re.sub(r'^(.)\1+', r'\1', string)
Prints:
happy
The above works only for the start of the line. If you need this for each word you need to do this:
re.sub(r'\b(\w)\1+', r'\1', string)
The regex would be
\b(\w)\1+
\b checks for a word boundary.
Check it out here at regex101.
Or you could simply slice:
string = 'hhappy'
func = lambda s: s[1:] if s[0] == s[1] else s
new_string = func(string)
# happy

regex: match all characteres between 2 words, returns strange output

in this text:
"IPAddress":"10.0.0.18","PolicerID":"","IPAddress":"","PolicerID":""
I want to catch all ips, in this example are 10.0.0.18 and emptystring
I tried to use this regex:
(?<="IPAddress":")(.*?)(?=")
which returns me 10.0.0.18 and ",
it took the first " from PolicerID instead of the last " in IPAddress.
Can you please help me?
Thanks
You can keep it simple and just use a capturing group:
>>> str = r'"IPAddress":"10.0.0.18","PolicerID":"","IPAddress":"","PolicerID":""'
>>> print re.findall(r'"IPAddress":"([^"]*)', str)
['10.0.0.18', '']
>>>
However if you have to use lookbehind assertion then use this regex:
(?<="IPAddress":")([^"]*)
([^"]*) is a negated pattern to match 0 or more of any character that is not a double quote.
RegEx Demo
If you want all IPs in that text I would suggest this regex
[0-9]+(?:\.[0-9]+){3}

Categories