Regular expression to replace second occurrence of dot

Regular expression to replace second occurrence of dot - python

String is Hello.world.hello. I wanted to replace the second occurrence of the dot with '_'.
str = "Hello. world. Hello!"
x = re.sub(r'^((.){1}).', r'\1_', str)
#x = str.find(str.find('.')
print(x)
The output I am getting is 'H_llo. world. Hello!'. What should be the correct solution

You can use
import re
text = "Hello. world. Hello!"
print( re.sub(r'^([^.]*\.[^.]*)\.', r'\1_', text) )
# => Hello. world_ Hello!
See the Python demo and the regex demo.
Details:
^ - start of string
([^.]*\.[^.]*) - Group 1: any zero or more chars other than a ., a dot and again any 0+ non-dots
\. - a dot.
The replacement is Group 1 value + _.
It is also possible to do without a regex:
text = "Hello. world. Hello!"
chunks = text.split('.', 2) # split the text twice
if len(chunks) > 2: # if there are more than 2 items
print( fr'{".".join(chunks[0:2])}_{chunks[2]}' )
else:
print(text) # Replace the second dot or print the original
# => Hello. world_ Hello!
See the Python demo.

With your shown samples, could you please try following. Written and tested in Python3.8
import re
str1 = "Hello. world. Hello!"
re.sub(r'^(.*?\.)([^.]*)\.(.*)$', r'\1\2_\3', str1)
'Hello. world_ Hello!'
Explanation: Simply importing re function of Python3.8 then creating str1 variable with value. Then using re.sub function to replace 2nd dot with _ as per requirement. In re.sub function on first argument giving regex to match everything apart from 2nd dot(in 3 capturing groups) and replacing them as per need with respective capturing groups placing _ on place of 2nd dot.
Explanation of regex:
^(.*?\.) ##Creating 1st capturing group, where Matching till 1st dot from staring of value.
([^.]*) ##Creating 2nd capturing group, matching just before dot(2nd dot) here.
\. ##Matching exact literal dot here.
(.*)$ ##Matching/keeping everything else till last of value in 3rd capturing group.

Related

Pandas regex to remove digits before consecutive dots

I have a string Node57Name123.grpObject12.grp23Symbol43.shape52.anotherobject25.shape23.
Removing all the numbers that are before the dot and after the word.
Ignoring the first part of the string i.e. "Node57Name123".
Should not remove the digits if they are inside words.
Tried re.sub(r"\d+","",string) but it removed every other digit.
The output should look like this "Node57Name123.grpObject.grp23Symbol.shape.anotherobject.shape"
Can you please point me to the right direction.

You can use
re.sub(r'^([^.]*\.)|\d+(?![^.])', r'\1', text)
See the regex demo.
Details:
^([^.]*\.) - zero or more chars other than a dot and then a . char at the start of the string captured into Group 1 (referred to with \1 from the replacement pattern)
| - or
\d+(?![^.]) - one or more digits followed with a dot or end of string (=(?=\.|$)).
See the Python demo:
import re
text = r'Node57Name123.grpObject12.grp23Symbol43.shape52.anotherobject25.shape23'
print( re.sub(r'^([^.]*\.)|\d+(?![^.])', r'\1', text) )
## => Node57Name123.grpObject.grp23Symbol.shape.anotherobject.shape

Just to give you a non-regex alternative' using rstrip(). We can feed this function a bunch of characters to remove from the right of the string e.g.: rstrip('0123456789'). Alternatively we can also use the digits constant from the string module:
from string import digits
s = 'Node57Name123.grpObject12.grp23Symbol43.shape52.anotherobject25.shape23'
x = '.'.join([s.split('.')[0]] + [i.rstrip(digits) for i in s.split('.')[1:]])
print(x)
Prints:
Node57Name123.grpObject.grp23Symbol.shape.anotherobject.shape
EDIT:
If you must use a regular pattern, it seems that the following covers your sample:
(\.[^.]*?)\d+\b
Replace with the 1st capture group, see the online demo
( - Open capture group:
\.[^.]*? - A literal dot followed by 0+ non-dot characters (lazy).
) - Close capture group.
\d+\b - Match 1+ digits up to a word-boundary.
A sample:
import re
s = 'Node57Name123.grpObject12.grp23Symbol43.shape52.anotherobject25.shape23'
x = re.sub(r'(\.[^.]*?)\d+\b', r'\1', s)
print(x)
Prints:
Node57Name123.grpObject.grp23Symbol.shape.anotherobject.shape

How can I remove a specific character from multi line string using regex in python

I have a multiline string which looks like this:
st = '''emp:firstinfo\n
:secondinfo\n
thirdinfo
'''
print(st)
What I am trying to do is to skip the second ':' from my string, and get an output which looks like this:
'''emp:firstinfo\n
secondinfo\n
thirdinfo
'''
simply put if it starts with a ':' I'm trying to ignore it.
Here's what I've done:
mat_obj = re.match(r'(.*)\n*([^:](.*))\n*(.*)' , st)
print(mat_obj.group())
Clearly, I don't see my mistake but could anyone please help me telling where I am getting it wrong?

You may use re.sub with this regex:
>>> print (re.sub(r'([^:\n]*:[^:\n]*\n)\s*:(.+)', r'\1\2', st))
emp:firstinfo
secondinfo
thirdinfo
RegEx Demo
RegEx Details:
(: Start 1st capture group
[^:\n]*: Match 0 or more of any character that is not : and newline
:: Match a colon
[^:\n]*: Match 0 or more of any character that is not : and newline
\n: Match a new line
): End 1st capture group
\s*: Match 0 or more whitespaces
:: Match a colon
(.+): Match 1 or more of any characters (except newlines) in 2nd capture group
\1\2: Is used in replacement to put back substring captured in groups 1 and 2.

You can use sub instead, just don't capture the undesired part.
(.*\n)[^:]*:(.*\n)(.*)
Replace by
\1\2\3
Regex Demo
import re
regex = r"(.*\n)[^:]*:(.*\n)(.*)"
test_str = ("emp:firstinfo\\n\n"
" :secondinfo\\n\n"
" thirdinfo")
subst = "\\1\\2\\3"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)

#import regex library
import re
#remove character in a String and replace with empty string.
text = "The film Pulp Fiction was released in year 1994"
result = re.sub(r"[a-z]", "", text)
print(result)

Remove duplicate words in a string using regex

I'm working on my regex skills and i find one of my strings having duplicate words at the starting. I would like to remove the duplicate and just have one word of it -
server_server_dev1_check_1233.zzz
server_server_qa1_run_1233.xyz
server_server_dev2_1233.qqa
server_dev1_1233.zzz
data_data_dev9_check_660.log
I used the below regex but i get both server_server in my output,
((.*?))_(?!\D)
How can i have my output just to one server_ if there are two or more and if its only one server_, then take as is?
The output doesn't have to contain the digits and also the part after . i.e. .zzz, .xyz etc
Expected output -
server_dev1_check
server_qa1_run
server_dev2
server_dev1
data_dev9_check

you could back reference the word in your search expression:
>>> s = "server_server_dev1_check_1233.zzz"
>>> re.sub(r"(.*_)\1",r"\1",s)
'server_dev1_check_1233.zzz'
and use the "many times" suffix so if there are more than 2 occurrences it still works:
'server_server_server_dev1_check_1233.zzz'
>>> re.sub(r"(.*_)\1{1,}",r"\1",s)
'server_dev1_check_1233.zzz'
getting rid of the suffix is not the hardest part, just capture the rest and discard the end:
>>> re.sub(r"(.*_)\1{1,}(.*)(_\d+\..*)",r"\1\2",s)
'server_dev1_check'

You may use a single re.sub call to match and remove what you do not need and match and capture what you need:
re.sub(r'^([^_]+)(?:_\1)*(.*)_\d+\.\w+$', r'\1\2', s)
See the regex demo
Details
^ - start of string
([^_]+) - Capturing group 1: any 1+ chars other than _
(?:_\1)* - zero or more repetitions of _ followed with the same substring as in Group 1 (thanks to the inline backreference \1 that retrieves the text from Group 1)
(.*) - Group 2: any 0+ chars, as many as possible
_ - an underscore
\d+ - 1+ digits
\. - a dot
\w+ - 1+ word chars ([^.]+ will also do, 1 or more chars other than .)
$ - end of string.
The replacement pattern is \1\2, i.e. the contents of Group 1 and 2 are concatenated and make up the resulting value.
Python demo:
import re
rx = r'^([^_]+)(?:_\1)*(.*)_\d+\.\w+$'
strs = ["server_server_dev1_check_1233.zzz", "server_server_qa1_run_1233.xyz", "server_server_dev2_1233.qqa", "server_dev1_1233.zzz", "data_data_dev9_check_660.log"]
for s in strs:
print(re.sub(rx, r'\1\2', s))
Output:
server_dev1_check
server_qa1_run
server_dev2
server_dev1
data_dev9_check

Python Regex for End of Line

I am trying to write a regex which adds a space before and after a dot.
However I only want this if there is a space or end of line after the dot.
However I am unable to do so for end of line cases.
Eg.
I want a hotel. >> I want a hotel .
my email is zob#gmail.com >> my email is zob#gmail.com
I have to play. bye! >> I have to play . bye!
Following is my code:
# If "Dot and space" after word or number put space before and after
utterance = re.sub(r'(?<=[a-z0-9])[.][ $]',' . ',utterance)
How do I correct my regex to make sure my 1st example above also works, I tried putting a $ sign in square bracket but that doesn't work.

The main issue is that $ inside a character class denotes a literal $ symbol, you just need a grouping construct here.
I suggest using the following code:
import re
regex = r"([^\W_])\.(?:\s+|$)"
ss = ["I want a hotel.","my email is zob#gmail.com", "I have to play. bye!"]
for s in ss:
result = re.sub(regex, r"\1 . ", s).rstrip()
print(result)
See the Python demo.
If you need to apply this on lines only without affecting line breaks, you can use
import re
regex = r"([^\W_])\.(?:[^\S\n\r]+|$)"
text = "I want a hotel.\nmy email is zob#gmail.com\nI have to play. bye!"
print( re.sub(regex, r"\1 . ", text, flags=re.M).rstrip() )
See this Python demo.
Output:
I want a hotel .
my email is zob#gmail.com
I have to play . bye!
Details:
([^\W_]) - Group 1 matching any letter or digit
\. - a literal dot
(?:\s+|$) - a grouping matching either 1+ whitespaces or end of string anchor (here, $ matches the end of string.)
The rstrip will remove the trailing space added during replacement.
If you are using Python 3, the [^\W_] will match all Unicode letters and digits by default. In Python 2, re.U flag will enable this behavior.
Note that \s+ in the last (?:\s+|$) will "shrink" multiple whitespaces into 1 space.

Use the lookahead assertion (?=) to find a . followed by space or end of line \n:
utterance = re.sub('\\.(?= )|\\.(?=\n)', ' . ', utterance )

[ $] defines a class of characters consisting of a space and a dollar sign, so it matches on space or dollar (literally). To match on space or end of line, use ( |$) (in this case, $ keeps it special meaning.

Regular expression in python doesn't seem to be working like I expect

My code doesn't seem to be working like it's supposed to:
x = "engniu4nwi5u"
print re.sub(r"\D(\d)\D", r"\1abc", x)
My desired output is: engniuabcnwiabcu
But the output actually given is: engni4abcw5abc

You are grouping the wrong characters it must be written as
>>> x = "engniu4nwi5u"
>>> re.sub(r"(\D)\d(\D)", r"\1abc\2", x)
'engniuabcnwiabcu'
(\D) Matches a non digit and captures it in \1
\d Matches the digit
(\D) Matches the following digit. captures in \2
How does it matches?
engniu4nwi5u
|
\D => \1
engniu4nwi5u
|
\d
engniu4nwi5u
|
\D => \2
Another Solution
You can also use look arounds to perform the same as
>>> x = "engniu4nwi5u"
>>> re.sub(r"(?<=\D)\d(?=\D)", r"abc", x)
'engniuabcnwiabcu'
(?<=\D) Look behind assertion. Checks if the digit is presceded by a non digit. But not caputred
\d Matches the digit
(?=\D) Look ahead assertion. Checks if the digit is followed by the non digit. Also not captured.

This is because you replaced the wrong part:
Let's consider the first match. \D\d\D matches the following:
engniu4nwi5u
^^^
4 is captured as \1. Then you replace the whole match with: \1abc, which becomes 4abc.
You have a couple solutions here:
Capture what you want to keep: (\D)\d(\D) and replace it with \1abc\2
Use lookaheads: (?<=\D)\d(?=\D) and replace this with abc

Based on your regexp:
>>> re.sub("(\D)\d", r"\1abc", x)
'engniuabcnwiabcu'
Although I would do this instead:
>>> re.sub("\d", "abc", x)
'engniuabcnwiabcu'

If you plan to check also the beginning and end of string, you need to add ^ and $ to the regex:
(\D|^)\d(?=$|\D)
And replace with \1abc.
See demo
Sample code on IDEONE:
import re
p = re.compile(ur'(\D|^)\d(?=$|\D)')
test_str = u"1engniu4nwi5u"
subst = u"\1abc"
print re.sub(p, subst, test_str)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regular expression to replace second occurrence of dot - python

String is Hello.world.hello. I wanted to replace the second occurrence of the dot with '_'. str = "Hello. world. Hello!" x = re.sub(r'^((.){1}).', r'\1_', str) #x = str.find(str.find('.') print(x) The output I am getting is 'H_llo. world. Hello!'. What should be the correct solution

Related

Pandas regex to remove digits before consecutive dots

How can I remove a specific character from multi line string using regex in python

Remove duplicate words in a string using regex

Python Regex for End of Line

Regular expression in python doesn't seem to be working like I expect

Categories

Resources