python regular expression substitution with matched group - python

I'm trying to substitue the channel name for AndroidManifest.xml to batch generate a groups of channel apk packages for release.
<meta-data android:value="CHANNEL_NAME_TO_BE_DETERMINED" android:name="UMENG_CHANNEL"/>
from an xml file.
The channel configs are saved in a config file, sth like:
channel_name output_postfix valid
"androidmarket" "androidmarket" true
Here is what I tried:
manifest_original_xml_fh = open("../AndroidManifest_original.xml", "r")
manifest_xml_fh = open("../AndroidManifest.xml", "w")
pattern = re.compile('<meta-data\sandroid:value=\"(.*)\"\sandroid:name=\"UMENG_CHANNEL\".*')
for each_config_line in manifest_original_xml_fh:
each_config_line = re.sub(pattern, channel_name, each_config_line)
print each_config_line
It replaces the whole <meta-data android:value="CHANNEL_NAME_TO_BE_DETERMINED" android:name="UMENG_CHANNEL"/> to androidmarket which is obviously not my need. Then I figured out the problem is that pattern.match(each_config_line) return a match result ,and one of the result group is "CHANNEL_NAME_TO_BE_DETERMINED". I've also tried to give some replace implementation function, but still failed.
So, since I've successfully find the pattern, how can I replace the matched result group element correctly?

I suggest a different approach: save your xml as a template, with placeholders to be replaced with standard Python string operations.
E.g.
AndroidManifest_template.xml:
<meta-data android:value="%(channel_name)s" android:name="UMENG_CHANNEL"/>
python:
manifest_original_xml_fh = open("../AndroidManifest_template.xml", "r")
manifest_xml_fh = open("../AndroidManifest.xml", "w")
for each_config_line in manifest_original_xml_fh:
each_config_line = each_config_line % {'channel_name': channel_name}
print each_config_line

I think your misunderstanding is, everything that has been matched will be replaced. If you want to keep stuff from the pattern, you have to capture it and reinsert it in the replacement string.
Or match only what you want to replace by using lookaround assertions
Try this
pattern = re.compile('(?<=<meta-data\sandroid:value=\")[^"]+')
for each_config_line in manifest_original_xml_fh:
each_config_line = re.sub(pattern, channel_name, each_config_line)
(?<=<meta-data\sandroid:value=\") is a positive lookbehind assertion, it ensures that this text is before, but does not match it (so it will not be replaced)
[^"]+ will then match anything that is not a "
See it here on Regexr

To capture just the value of the meta-data tag you need to change the regex:
<meta-data\sandroid:value=\"([^"]*)\"\sandroid:name=\"UMENG_CHANNEL\".*
Specifically I changed this part:
\"(.*)\" - this is a greedy match, so it will go ahead and match as many characters as possible as long as the rest of the expression matches
to
\"([^"]*)\" - which will match anything that's not the double quote. The matching result will still be in the first capturing group
If you want to do the replace thing, a better idea might be to capture what you want to stay the same - I'm not a python expert but something like this would probably work:
re.sub(r'(<meta-data\sandroid:value=\")[^"]*(\"\sandroid:name=\"UMENG_CHANNEL\".*)'
, r'\1YourNewValue\2', s)
\1 is backreference 1 - i.e. it gets what the first capturing group matched

Related

Extracting a word between two path separators that comes after a specific word

I have the following path stored as a python string 'C:\ABC\DEF\GHI\App\Module\feature\src' and I would like to extract the word Module that is located between words \App\ and \feature\ in the path name. Note that there are file separators '\' in between which ought not to be extracted, but only the string Module has to be extracted.
I had the few ideas on how to do it:
Write a RegEx that matches a string between \App\ and \feature\
Write a RegEx that matches a string after \App\ --> App\\[A-Za-z0-9]*\\, and then split that matched string in order to find the Module.
I think the 1st solution is better, but that unfortunately it goes over my RegEx knowledge and I am not sure how to do it.
I would much appreciate any help.
Thank you in advance!
The regex you want is:
(?<=\\App\\).*?(?=\\feature\\)
Explanation of the regex:
(?<=behind)rest matches all instances of rest if there is behind immediately before it. It's called a positive lookbehind
rest(?=ahead) matches all instances of rest where there is ahead immediately after it. This is a positive lookahead.
\ is a reserved character in regex patterns, so to use them as part of the pattern itself, we have to escape it; hence, \\
.* matches any character, zero or more times.
? specifies that the match is not greedy (so we are implicitly assuming here that \feature\ only shows up once after \App\).
The pattern in general also assumes that there are no \ characters between \App\ and \feature\.
The full code would be something like:
str = 'C:\\ABC\\DEF\\GHI\\App\\Module\\feature\\src'
start = '\\App\\'
end = '\\feature\\'
pattern = rf"(?<=\{start}\).*?(?=\{end}\)"
print(pattern) # (?<=\\App\\).*?(?=\\feature\\)
print(re.search(pattern, str)[0]) # Module
A link on regex lookarounds that may be helpful: https://www.regular-expressions.info/lookaround.html
We can do that by str.find somethings like
str = 'C:\\ABC\\DEF\\GHI\\App\\Module\\feature\\src'
import re
start = '\\App\\'
end = '\\feature\\'
print( (str[str.find(start)+len(start):str.rfind(end)]))
print("\n")
output
Module
Your are looking for groups. With some small modificatians you can extract only the part between App and Feature.
(?:App\\\\)([A-Za-z0-9]*)(?:\\\\feature)
The brackets ( ) define a Match group which you can get by match.group(1). Using (?:foo) defines a non-matching group, e.g. one that is not included in your result. Try the expression here: https://regex101.com/r/24mkLO/1

Regex to replace filepaths in a string when there's more than one in Python

I'm having trouble finding a way to match multiple filepaths in a string while maintaining the rest of the string.
EDIT: forgot to add that the filepath might contain a dot, so edited "username" to user.name"
# filepath always starts with "file:///" and ends with file extension
text = """this is an example text extracted from file:///c:/users/user.name/download/temp/anecdote.pdf
1 of 4 page and I also continue with more text from
another path file:///c:/windows/system32/now with space in name/file (1232).html running out of text to write."""
I've found many answers that work, but fails when theres more than one filepath, also replacing the other characters in between.
import re
fp_pattern = r"file:\/\/\/(\w|\W){1,255}\.[\w]{3,4}"
print(re.sub(fp_pattern, "*IGOTREPLACED*", text, flags=re.MULTILINE))
>>>"this is an example text extracted from *IGOTREPLACED* running out of text to write."
I've also tried using a "stop when after finding a whitespace after the pattern" but I couldn't get one to work:
fp_pattern = r"file:\/\/\/(\w|\W){1,255}\.[\w]{3,4} ([^\s]+)"
>>> 0 matches
Note that {1,255} is a greedy quantifier, and will match as many chars as possible, you need to add ? after it.
However, just using a lazy {1,255}? quantifier won't solve the problem. You need to define where the match should end. It seems you only want to match these URLs when the extension is immediately followed with whitespace or end of string.
Hence, use
fp_pattern = r"file:///.{1,255}?\.\w{3,4}(?!\S)"
See the regex demo
The (?!\S) negative lookahead will fail any match if, immediately to the right of the current location, there is a non-whitespace char. .{1,255}? will match any 1 to 255 chars, as few as possible.
Use in Python as
re.sub(fp_pattern, "*IGOTREPLACED*", text, flags=re.S)
The re.MULTILINE (re.M) flag only redefines ^ and $ anchor behavior making them match start/end of lines rather than the whole string. The re.S flag allows . to match any chars, including line break chars.
Please never use (\w|\W){1,255}?, use .{1,255}? with re.S flag to match any char, else, performance will decrease.
You can try re.findall to find out how many time regex matches in string. Hope this helps.
import re
len(re.findall(pattern, string_to_search))

How to search for and replace a term within another search term

I have a url I get from parsing a swagger's api.json file in Python.
The URL looks something like this and I want to replace the dashes with underscores, but only inside the curly brackets.
10.147.48.10:8285/pet-store-account/{pet-owner}/version/{pet-type-id}/pet-details-and-name
So, {pet-owner} will become {pet_owner}, but pet-store-account will remain the same.
I am looking for a regular expression that will allow me to perform a non-greedy search and then do a search-replace on each of the first search's findings.
a Python re approach is what I am looking for, but I will also appreciate if you can suggest a Vim one liner.
The expected final result is:
10.147.48.10:8285/pet-store-account/{pet_owner}/version/{pet_type_id}/pet-details-and-name
Provided that you expect all '{...}' blocks to be consistent, you may use a trailing context to determine whether a given dash is inside a block, actually just requiring it to be followed by '...}' where '.' is not a '{'
exp = re.compile(r'(?=[^{]*})-')
...
substituted_url = re.sub(exp,'_',url_string)
Using lookahead and lookbehind in Vim:
s/\({[^}]*\)\#<=-\([^{]*}\)\#=/_/g
The pattern has three parts:
\({[^}]*\)\#<= matches, but does not consume, an opening brace followed by anything except a closing brace, immediately behind the next part.
- matches a hyphen.
\([^{]*}\)\#= matches, but does not consume, anything except an opening brace, followed by a closing brace, immediately ahead of the previous part.
The same technique can't be exactly followed in Python regular expressions, because they only allow fixed-width lookbehinds.
Result:
Before
outside-braces{inside-braces}out-again{in-again}out-once-more{in-once-more}
After
outside-braces{inside_braces}out-again{in_again}out-once-more{in_once_more}
Because it checks for braces in the right place both before and after the hyphen, this solution (unlike others which use only lookahead assertions) behaves sensibly in the face of unmatched braces:
Before
b-c{d-e{f-g}h-i
b-c{d-e}f-g}h-i
b-c{d-e}f-g{h-i
b-c}d-e{f-g}h-i
After
b-c{d-e{f_g}h-i
b-c{d_e}f-g}h-i
b-c{d_e}f-g{h-i
b-c}d-e{f_g}h-i
Use a two-step approach:
import re
url = "10.147.48.10:8285/pet-store-account/{pet-owner}/version/{pet-type-id}/pet-details-and-name"
rx = re.compile(r'{[^{}]+}')
def replacer(match):
return match.group(0).replace('-', '_')
url = rx.sub(replacer, url)
print(url)
Which yields
10.147.48.10:8285/pet-store-account/{pet_owner}/version/{pet_type_id}/pet-details-and-name
This looks for pairs of { and } and replaces every - with _ inside it.
There may be solutions with just one line but this one is likely to be understood in a couple of months as well.
Edit: For one-line-gurus:
url = re.sub(r'{[^{}]+}',
lambda x: x.group(0).replace('-', '_'),
url)
Solution in Vim:
%s/\({.*\)\#<=-\(.*}\)\#=/_/g
Explanation of matched pattern:
\({.*\)\#<=-\(.*}\)\#=
\({.*\)\#<= Forces the match to have a {.* behind
- Specifies a dash (-) as the match
\(.*}\)\#= Forces the match to have a .*} ahead
Use python lookahead to ignore the string enclosed within curly brackets {}:
Description:
(?=...):
Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.
Solution
a = "10.147.48.10:8285/pet-store-account/**{pet-owner}**/version/**{pet-type-id}**/pet-details-and-name"
import re
re.sub(r"(?=[^{]*})-", "_", a)
Output:
'10.147.48.10:8285/pet-store-account/**{pet_owner}**/version/**{pet_type_id}**/pet-details-and-name'
Another way to do in Vim is to use a sub-replace-expression:
:%s/{\zs[^}]*\ze}/\=substitute(submatch(0),'-','_','g')/g
Using \zs and \ze we set the match between the { & } characters. Using \={expr} will evaluate {expr} as the replacement for each substitution. Using VimScripts substitution function, substitute({text}, {pat}, {replace}, {flag}), on the entire match, submatch(0), to convert - to _.
For more help see:
:h sub-replace-expression
:h /\zs
:h submatch()
:h substitute()

How to search part of pattern in regex python

I can match pattern as it is. But can I search only part of the pattern? or I have to send it separately again.
e.g. pattern = '/(\w+)/(.+?)'
I can search this pattern using re.search and then use group to get individual groups.
But can I search only for say (\w+) ?
e.g.
pattern = '/(\w+)/(.+?)'
pattern_match = re.search(pattern, string)
print pattern_match.group(1)
Can I just search for part of pattern. e.g. pattern.group(1) or something
You can make any part of a regular expression optional by wrapping it in a non-matching group followed by a ?, i.e. (?: ... )?.
pattern = '/(\w+)(?:/(.+))?'
This will match /abc/def as well as /abc.
In both examples pattern_match.group(1) will be abc, but pattern_match.group(2) will be def in the first one and an empty string in the second one.
For further reference, have a look at (?:x) in the special characters table at https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions
EDIT
Changed the second group to (.+), since I assume you want to match more than one character. .+ is called a "greedy" match, which will try to match as much as possible. .+? on the other hand is a "lazy" match that will only match the minimum number of characters necessary. In case of /abc/def, this will only match the d from def.
That pattern is merely a character string; send the needed slice however you want. For instance:
re.search(pattern[:6], string)
uses only the first 6 characters of your pattern. If you need to detect the end of the first pattern -- and you have no intervening right-parens -- you can use
rparen_pos = pattern.index(')')
re.search(pattern[:rparen_pos+1], string)
Another possibility is
pat1 = '/(\w+)'
pat2 = '/(.+?)'
big_match = re.search(pat1+pat2, string)
small_match = re.search(pat1, string)
You can get more innovative with expression variables ($1, $2, etc.); see the links below for more help.
http://flockhart.virtualave.net/RBIF0100/regexp.html
https://docs.python.org/2/howto/regex.html

Match everything expect a specific string

I am using Python 2.7 and have a question with regards to regular expressions. My string would be something like this...
"SecurityGroup:Pub HDP SG"
"SecurityGroup:Group-Name"
"SecurityGroup:TestName"
My regular expression looks something like below
[^S^e^c^r^i^t^y^G^r^o^u^p^:].*
The above seems to work but I have the feeling it is not very efficient and also if the string has the word "group" in it, that will fail as well...
What I am looking for is the output should find anything after the colon (:). I also thought I can do something like using group 2 as my match... but the problem with that is, if there are spaces in the name then I won't be able to get the correct name.
(SecurityGroup):(\w{1,})
Why not just do
security_string.split(':')[1]
To grab the second part of the String after the colon?
You could use lookbehind:
pattern = re.compile(r"(?<=SecurityGroup:)(.*)")
matches = re.findall(pattern, your_string)
Breaking it down:
(?<= # positive lookbehind. Matches things preceded by the following group
SecurityGroup: # pattern you want your matches preceded by
) # end positive lookbehind
( # start matching group
.* # any number of characters
) # end matching group
When tested on the string "something something SecurityGroup:stuff and stuff" it returns matches = ['stuff and stuff'].
Edit:
As mentioned in a comment, pattern = re.compile(r"SecurityGroup:(.*)") accomplishes the same thing. In this case you are matching the string "SecurityGroup:" followed by anything, but only returning the stuff that follows. This is probably more clear than my original example using lookbehind.
Maybe this:
([^:"]+[^\s](?="))
Regex live here.

Categories