regex and python - python

I have a string:
myString = "123ABC,'2009-12-23T23:45:58.544-04:00'"
I want to extract the "T" character from the Timestamp, ie change it to:
"123ABC,'2009-12-23 23:45:58.544-04:00'"
I am trying this:
newString = re.sub('(?:\-\d{2})T(?:\d{2}\:)', ' ', myString)
BUT, the returned string is:
"123ABC,'2009-12 45:58.544-04:00'"
The "non capturing groups" don't appear to be "non capturing", and it's removing everything. What am I doing wrong?

You can use lookarounds (positive lookbehind and -ahead):
(?<=\d)T(?=\d)
See a demo on regex101.com.
In Python this would be:
import re
myString = "123ABC,'2009-12-23T23:45:58.544-04:00'"
rx = r'(?<=\d)T(?=\d)'
# match a T surrounded by digits
new_string = re.sub(rx, ' ', myString)
print new_string
# 123ABC,'2009-12-23 23:45:58.544-04:00'
See a demo on ideone.com.

regex seems a bit of an overkill:
mystring.replace("T"," ")

I'd use capturing groups, unanchored lookbehinds are costly in terms of regex performance:
(\d)T(\d)
And replace with r'\1 \2' replacement pattern containing backreferences to the digit before and after T. See the regex demo
Python demo:
import re
s = "123ABC,'2009-12-23T23:45:58.544-04:00'"
reg = re.compile(r'(\d)T(\d)')
s = reg.sub(r'\1 \2', s)
print(s)

That T is trapped in between numbers and will always be alone on the right. You could use a rsplit and join:
myString = "123ABC,'2009-12-23T23:45:58.544-04:00'"
s = ' '.join(myString.rsplit('T', maxsplit=1))
print(s)
# "123ABC,'2009-12-23 23:45:58.544-04:00'"
Trying this on a leading T somewhere in the string:
myString = "123ATC,'2009-12-23T23:45:58.544-04:00'"
s = ' '.join(myString.rsplit('T', maxsplit=1))
print(s)
# "123ATC,'2009-12-23 23:45:58.544-04:00'"

Related

how to skip backslash followed by integer?

i have regex https://regex101.com/r/2H5ew6/1
(\!|\#)(1)
Hello!1 World
and i wanna get first mark (!|#) and change the number 1 to another number 2
I did
{\1}2_
\1\\2_
but it adds extra text and i just wanna change the number
i expect result to be
Hello!2_World
and ifusing # to be
Hello#2_World
Match and capture either ! or # in a named capture group, here called char, if followed by one or more digits and a whitespace:
(?P<char>[!#])\d+\s
Substitute with the named capture, \g<char> followed by 2_:
\g<char>2_
Demo
If you only want the substitution if there's a 1 following either ! or #, replace \d+ with 1.
In your substitution you need to change the {\1}2_ to just 2_.
string = "Hello!1 World"
pattern = "(\!|\#)(1)"
replacement = "2_"
result = re.sub(pattern, replacement, string)
Why not: string.replace('!1 ', '!2_').replace('#1 ', '#2_') ?
>>> string = "Hello!1 World"
>>> repl = lambda s: s.replace('!1 ', '!2_').replace('#1 ', '#2_')
>>> string2 = repl(string)
>>> string2
'Hello!2_World'
>>> string = "Hello!12 World"
>>> string2 = repl(string)
>>> string2
'Hello!12 World'
The replacement for you pattern should be \g<1>2_
Regex demo
You could also shorten your pattern to a single capture with a character class [!#] and a match and use the same replacement as above.
([!#])1
Regex demo
Or with a lookbehind assertion without any groups and replace with 2_
(?<=[!#])1
Regex demo

Replace all a in the middle of string by * using regex

I wanted to replace all 'A' in the middle of string by '*' using regex in python
I tried this
re.sub(r'[B-Z]+([A]+)[B-Z]+', r'*', 'JAYANTA ')
but it outputs - '*ANTA '
I would want it to be 'J*Y*NTA'
Can someone provide the required code? I would like an explanation of what is wrong in my code if possible.
Using the non-wordboundary \B.
To make sure that the A's are surrounded by word characters:
import re
str = 'JAYANTA POKED AGASTYA WITH BAAAAMBOO '
str = re.sub(r'\BA+\B', r'*', str)
print(str)
Prints:
J*Y*NTA POKED AG*STYA WITH B*MBOO
Alternatively, if you want to be more specific that it has to be surrounded by upper case letters. You can use lookbehind and lookahead instead.
str = re.sub(r'(?<=[A-Z])A+(?=[A-Z])', r'*', str)
>>> re.sub(r'(?!^)[Aa](?!$)','*','JAYANTA')
'J*Y*NTA'
My regex searches for an A but it cannot be at the start of the string (?!^) and not at the end of the string (?!$).
Lookahead assertion:
>>> re.sub(r'A(?=[A-Z])', r'*', 'JAYANTA ')
'J*Y*NTA '
In case if word start and end with 'A':
>>> re.sub(r'(?<=[A-Z])A(?=[A-Z])', r'*', 'AJAYANTA ')
'AJ*Y*NTA '

Replace if the first two letters are repeating in python

How to replace if the first two letters in a word repeats with the same letter?
For instance,
string = 'hhappy'
And I want to get
happy
I tried with
re.sub(r'(.)\1+', r'\1', string)
But, this gives
hapy
Thank you!
You need to add a caret (^) to match only the start of the line.
re.sub(r'^(.)\1+', r'\1', string)
Example:
import re
string = 'hhappy'
print re.sub(r'^(.)\1+', r'\1', string)
Prints:
happy
The above works only for the start of the line. If you need this for each word you need to do this:
re.sub(r'\b(\w)\1+', r'\1', string)
The regex would be
\b(\w)\1+
\b checks for a word boundary.
Check it out here at regex101.
Or you could simply slice:
string = 'hhappy'
func = lambda s: s[1:] if s[0] == s[1] else s
new_string = func(string)
# happy

Python regex extracting substrings containing numbers and letters

I am attempting to extract a substring that contains numbers and letters:
string = "LINE : 11m56.95s CPU 13m31.14s TODAY"
I only want 11m56.95s and 13m31.14s
I have tried doing this:
re.findall('\d+', string)
that doesn't give me what I want, I also tried this:
re.findall('\d{2}[m]+\d[.]+\d|\+)
that did not work either, any other suggestions?
Try this:
re.findall("[0-9]{2}[m][0-9]{2}\.[0-9]{2}[s]", string)
Output:
['11m56.95s', '13m31.14s']
Your current regular expression does not match what you expect it to.
You could use the following regular expression to extract those substrings.
re.findall(r'\d+m\d+\.\d+s', string)
Live Demo
Example:
>>> import re
>>> s = 'LINE : 11m56.95s CPU 13m31.14s TODAY'
>>> for x in re.findall(r'\d+m\d+\.\d+s', s):
... print x
11m56.95s
13m31.14s
Your Regex pattern is not formed correctly. It is currently matching:
\d{2} # Two digits
[m]+ # One or more m characters
\d # A digit
[.]+ # One or more . characters
\d|\+ # A digit or +
Instead, you should use:
>>> import re
>>> string = "LINE : 11m56.95s CPU 13m31.14s TODAY"
>>> re.findall('\d+m\d+\.\d+s', string)
['11m56.95s', '13m31.14s']
>>>
Below is an explanation of what the new pattern matches:
\d+ # One or more digits
m # m
\d+ # One or more digits
\. # .
\d+ # One or more digits
s # s
\b #word boundary
\d+ #starts with digit
.*? #anything (non-greedy so its the smallest possible match)
s #ends with s
\b #word boundary
If your lines are all like your example split will work:
s = "LINE : 11m56.95s CPU 13m31.14s TODAY"
spl = s.split()
a,b = spl[2],spl[4]
print(a,b)
('11m56.95s', '13m31.14s')

Unicode Substitutions using Regex , Python

I have a string as follows:
str1 = "heylisten\uff08there is something\uff09to say \uffa9"
I need to replace the unicode values detected by my regex expression with spaces on either sides.
Desired output string:
out = "heylisten \uff08 there is something \uff09 to say \uffa9 "
I have used an re.findall to get all the matches and then replace them. It looks like:
p1 = re.findall(r'\uff[0-9a-e][0-9]', str1, flags = re.U)
out = str1
for item in p1:
print item
print out
out= re.sub(item, r" " + item + r" ", out)
And this outputs:
'heylisten\\ uff08 there is something\\ uff09 to say \\ uffa9 '
What is wrong with the above that it prints an extra "\" and also separates it from uff? I even tried with re.search but it seems to only separate \uff08. Is there a better way?
I have a string as follows:
str1 = "heylisten\uff08there is something\uff09to say \uffa9"
I need to replace the unicode values ...
You don't have any unicode values. You have a bytestring.
str1 = u"heylisten\uff08there is something\uff09to say \uffa9"
...
p1 = re.sub(ur'([\uff00-\uffe9])', r' \1 ', str1)
print re.sub(r"(\\uff[0-9a-e][0-9])", r" \1 ", x)
You can directly use this re.sub. See demo.
http://regex101.com/r/sU3fA2/67
import re
p = re.compile(ur'(\\uff[0-9a-e][0-9])', re.UNICODE)
test_str = u"heylisten\uff08there is something\uff09to say \uffa9"
subst = u" \1 "
result = re.sub(p, subst, test_str)
Output:
heylisten \uff08 there is something \uff09 to say \uffa9

Categories