python re.sub, only replace part of match [duplicate] - python

This question already has answers here:
Why does re.sub replace the entire pattern, not just a capturing group within it?
(4 answers)
Closed 2 years ago.
I am very new to python
I need to match all cases by one regex expression and do a replacement. this is a sample substring --> desired result:
<cross_sell id="123" sell_type="456"> --> <cross_sell>
i am trying to do this in my code:
myString = re.sub(r'\<[A-Za-z0-9_]+(\s[A-Za-z0-9_="\s]+)', "", myString)
instead of replacing everything after <cross_sell, it replaces everything and just returns '>'
is there a way for re.sub to replace only the capturing group instead of the entire pattern?

You can use substitution groups:
>>> my_string = '<cross_sell id="123" sell_type="456"> --> <cross_sell>'
>>> re.sub(r'(\<[A-Za-z0-9_]+)(\s[A-Za-z0-9_="\s]+)', r"\1", my_string)
'<cross_sell> --> <cross_sell>'
Notice I put the first group (the one you want to keep) in parenthesis and then I kept that in the output by using the "\1" modifier (first group) in the replacement string.

You can use a group reference to match the first word and a negated character class to match the rest of the string between <> :
>>> s='<cross_sell id="123" sell_type="456">'
>>> re.sub(r'(\w+)[^>]+',r'\1',s)
'<cross_sell>'
\w is equal to [A-Za-z0-9_].

Since the input data is XML, you'd better parse it with an XML parser.
Built-in xml.etree.ElementTree is one option:
>>> import xml.etree.ElementTree as ET
>>> data = '<cross_sell id="123" sell_type="456"></cross_sell>'
>>> cross_sell = ET.fromstring(data)
>>> cross_sell.attrib = {}
>>> ET.tostring(cross_sell)
'<cross_sell />'
lxml.etree is an another option.

below code tested under python 3.6 , without use group..
test = '<cross_sell id="123" sell_type="456">'
resp = re.sub(r'\w+="\w+"' ,r'',test)
print (resp)
<cross_sell>

Related

Multiple results when using consecutive patterns regex

Test String: "Version 3.1.A"
RegEx: "(\d\.){2}."
Returning: [('1.3.A', '3.')]
Why does this return 2 matches, the second only matches a non-reoccuring (\d.)
Is there a way I can force only the complete match (1.3.A) to return using the {*} operator (not explicit \d.\d..)
By using a non-capturing group you can get what you want, like the following:
>>> import re
>>> text = "Version 3.1.A"
>>> re.findall(r"((?:\d\.){2}.)", text)
['3.1.A']

RegEx fails to capture groups in loops [duplicate]

This question already has answers here:
'NoneType' object has no attribute 'group'
(4 answers)
Closed 3 years ago.
Regex search fails to get a string from match object when being used in a for loop.
row_values = result_script_name.split('^')
for row in row_values:
table_name = re.search(r"(?<=')(.*)(?=')", row).group(0)
AttributeError: 'NoneType' object has no attribute 'group'
But the same regex pattern finds string perfectly fine when used outside the loop.
table_name = re.search(r"(?<=')(.*)(?=')", row_values[0]).group(0)
The string I wanted was to get "lifetime" outof below string
^WORKFLOW_NAME='lifetime'
I believe what is happening is that certain rows do not match at all, and therefore you are trying to access a capture group (the zeroth one, in this case), which does not even exist. Here is the pattern you should be using:
input = "^WORKFLOW_NAME='lifetime'"
match = re.search(r"(?<=')(.*)(?=')", input)
if match:
print(match.group(0))
That is, you should first be checking if the call to search were successful, and only then print. I don't know exactly what your loop is supposed to be doing, but you can easily enough adapt the above script to your needs.
Here, we might want to simplify our expression, maybe to something similar to:
.+?'(.+?)'
where our data is saved in the capturing group \\1
Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r".+'(.+?)'"
test_str = "^WORKFLOW_NAME='lifetime'"
subst = "\\1"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print (result)
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
DEMO
RegEx
If this expression wasn't desired, it can be modified or changed in regex101.com.
RegEx Circuit
jex.im visualizes regular expressions:
Demo
const regex = /.+?'(.+?)'/gm;
const str = `^WORKFLOW_NAME='lifetime'
WORKFLOW_NAME='Any other data that we want'
WORKFLOW_NAME='Any other data that we want'`;
const subst = `$1`;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log('Substitution result: ', result);

How to get String Before Last occurrence of substring?

I want to get String before last occurrence of my given sub string.
My String was,
path =
D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v1001-1010.mov
my substring, 1001-1010 which will occurred twice. all i want is get string before its last occurrence.
Note: My substring is dynamic with different padding but only number.
I want,
D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v
I have done using regex and slicing,
>>> p = 'D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v1001-1010.mov'
>>> q = re.findall("\d*-\d*",p)
>>> q[-1].join(p.split(q[-1])[:-1])
'D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v'
>>>
Is their any better way to do by purely using regex?
Please Note I have tried so many eg:
regular expression to match everything until the last occurrence of /
Regex Last occurrence?
I got answer by using regex with slicing but i want to achieve by using regex alone..
Why use regex. Just use built in string methods:
path = "D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v1001-1010.mov"
index = path.rfind("1001-1010")
print(path[:index])
You can use a simple greedy match and a capture group:
(.*)1001-1010
Your match is in capture group #1
Since .* is greedy by nature, it will match longest match before matching your keyword 1001-1010.
RegEx Demo
As per comments below if keyword is not a static string then you may use this regex:
r'(.*\D)\d+-\d+'
Python Code:
>>> p = 'D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v1001-1010.mov'
>>> print (re.findall(r'(.*\D)\d+-\d+', p))
['D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v']
Thanks #anubhava,
My first regex was,
.*(\d*-\d*)\/
Now i have corrected mine..
.*(\d*-\d*)
or
(.*)(\d*-\d*)
which gives me,
>>> q = re.search('.+(\d*-\d*)', p)
>>> q.group()
'D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v0001-1001'
>>>
(.*\D)\d+-\d+
this gives me exactly what i want...
>>> q = re.search('(.*\D)\d+-\d+', p)
>>> q.groups()
('D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v',)
>>>

Python regex if all whole words in string [duplicate]

This question already has answers here:
Do regular expressions from the re module support word boundaries (\b)?
(5 answers)
Closed 4 years ago.
I have the following a string, I need to check if
the string contains App2 and iPhone,
but not App and iPhone
I wrote the following:
campaign_keywords = "App2 iPhone"
my_string = "[Love]App2 iPhone Argentina"
pattern = re.compile("r'\b" + campaign_keywords + "\b")
print pattern.search(my_string)
It prints None. Why?
The raw string notation is wrong, the r should not be inside the the quotes. and the second \b should also be a raw string.
The match function tries to match at the start of the string. You need to use search or findall
Difference between re.search and re.match
Example
>>> pattern = re.compile(r"\b" + campaign_keywords + r"\b")
>>> pattern.findall(my_string)
['App2 iPhone']
>>> pattern.match(my_string)
>>> pattern.search(my_string)
<_sre.SRE_Match object at 0x10ca2fbf8>
>>> match = pattern.search(my_string)
>>> match.group()
'App2 iPhone'

Python - Most elegant way to extract a substring, being given left and right borders [duplicate]

This question already has answers here:
How to extract the substring between two markers?
(22 answers)
Closed 4 years ago.
I have a string - Python :
string = "/foo13546897/bar/Atlantis-GPS-coordinates/bar457822368/foo/"
Expected output is :
"Atlantis-GPS-coordinates"
I know that the expected output is ALWAYS surrounded by "/bar/" on the left and "/" on the right :
"/bar/Atlantis-GPS-coordinates/"
Proposed solution would look like :
a = string.find("/bar/")
b = string.find("/",a+5)
output=string[a+5,b]
This works, but I don't like it.
Does someone know a beautiful function or tip ?
You can use split:
>>> string.split("/bar/")[1].split("/")[0]
'Atlantis-GPS-coordinates'
Some efficiency from adding a max split of 1 I suppose:
>>> string.split("/bar/", 1)[1].split("/", 1)[0]
'Atlantis-GPS-coordinates'
Or use partition:
>>> string.partition("/bar/")[2].partition("/")[0]
'Atlantis-GPS-coordinates'
Or a regex:
>>> re.search(r'/bar/([^/]+)', string).group(1)
'Atlantis-GPS-coordinates'
Depends on what speaks to you and your data.
What you haven't isn't all that bad. I'd write it as:
start = string.find('/bar/') + 5
end = string.find('/', start)
output = string[start:end]
as long as you know that /bar/WHAT-YOU-WANT/ is always going to be present. Otherwise, I would reach for the regular expression knife:
>>> import re
>>> PATTERN = re.compile('^.*/bar/([^/]*)/.*$')
>>> s = '/foo13546897/bar/Atlantis-GPS-coordinates/bar457822368/foo/'
>>> match = PATTERN.match(s)
>>> match.group(1)
'Atlantis-GPS-coordinates'
import re
pattern = '(?<=/bar/).+?/'
string = "/foo13546897/bar/Atlantis-GPS-coordinates/bar457822368/foo/"
result = re.search(pattern, string)
print string[result.start():result.end() - 1]
# "Atlantis-GPS-coordinates"
That is a Python 2.x example. What it does first is:
1. (?<=/bar/) means only process the following regex if this precedes it (so that /bar/ must be before it)
2. '.+?/' means any amount of characters up until the next '/' char
Hope that helps some.
If you need to do this kind of search a bunch it is better to 'compile' this search for performance, but if you only need to do it once don't bother.
Using re (slower than other solutions):
>>> import re
>>> string = "/foo13546897/bar/Atlantis-GPS-coordinates/bar457822368/foo/"
>>> re.search(r'(?<=/bar/)[^/]+(?=/)', string).group()
'Atlantis-GPS-coordinates'

Categories