Assume that I want to modify all patterns in a script, take one line as an example:
line = "assert Solution().oddEvenList(genNode([2,1,3,5,6,4,7])) == genNode([2,3,6,7,1,5,4]), 'Example 2'"
Notice that function genNode is taking List[int] as the parameter. What I want is to remove the List, and keep the all the integers in the list, so that the function is actually taking *nums as the parameters.
Expecting:
line = "assert Solution().oddEvenList(genNode(2,1,3,5,6,4,7)) == genNode(2,3,6,7,1,5,4), 'Example 2'"
I've come up with a re pattern
r"([g][e][n][N][o][d][e][(])([[][0-9\,\s]*[]])([)])"
but I am not sure how I could use this... I can't get re.sub to work as it requires me to replace with a fixed string.
How can I achieve my desired result?
You can do:
re.sub(r'(genNode\()\[([^]]+)\]', r'\1\2', line)
(genNode\() matches genNode( and put it in captured group 1
\[ matches literal [
([^]]+) matches upto next ], and put it in captured group 2
\] matches literal ]
In the replacement, we've used the captured groups only i.e. dropped [ and ].
You can get rid of the first captured group by using a zero-width positive lookbehind to match the portion before [:
re.sub(r'(?<=genNode\()\[([^]]+)\]', r'\1', line)
Example:
In [444]: line = "assert Solution().oddEvenList(genNode([2,1,3,5,6,4,7])) == genNode([2,3,6,7,1,5,4]), 'Example 2'"
In [445]: re.sub(r'(genNode\()\[([^]]+)\]', r'\1\2', line)
Out[445]: "assert Solution().oddEvenList(genNode(2,1,3,5,6,4,7)) == genNode(2,3,6,7,1,5,4), 'Example 2'"
In [446]: re.sub(r'(?<=genNode\()\[([^]]+)\]', r'\1', line)
Out[446]: "assert Solution().oddEvenList(genNode(2,1,3,5,6,4,7)) == genNode(2,3,6,7,1,5,4), 'Example 2'"
FWIW, using typical non-greedy pattern .*? instead of [^]]+ would work as well:
re.sub(r'(?<=genNode\()\[(.*?)\]', r'\1', line)
Instead of writing [g][e][n][N][o][d][e][(] you could write getNode\(
The current character class that you use [0-9\,\s]* matches 0+ times any of the listed which could also for example match only comma's and does not make sure that there are comma separated digits.
To match the comma delimiter integers, you could match 1+ digits with a repeating group to match a comma and 1+ digits.
At the end use a positive lookahead to assert for the closing parenthesis or capture it in group 3 and also use that in the replacement.
With this pattern use r'\1\2 as the replacement.
(genNode\()\[(\d+(?:,\d+)*)\](?=\))
Explanation
(genNode\() Capture in group 1 matching genNode(
\[ Match [
( Capturing group 2
\d+(?:,\d+)* Match 1+ digits and repeat 0+ times a comma and 1+ digits (to also support a single digit)
) Close group 2
\] Match ]
(?=\)) Positive lookahead, assert what is on the right is a closing parenthesis )
Python demo | Regex demo
For example
import re
regex = r"(genNode\()\[(\d+(?:,\d+)*)\](?=\))"
line = "assert Solution().oddEvenList(genNode([2,1,3,5,6,4,7])) == genNode([2,3,6,7,1,5,4]), 'Example 2'"
result = re.sub(regex, r"\1\2", line)
if result:
print (result)
Result
assert Solution().oddEvenList(genNode(2,1,3,5,6,4,7)) == genNode(2,3,6,7,1,5,4), 'Example 2'
Related
I was trying out to solve a problem on regex:
There is an input sentence which is of one of these forms: Number1,2,3 or Number1/2/3 or Number1-2-3 these are the 3 delimiters: , / -
The expected output is: Number1,Number2,Number3
Pattern I've tried so far:
(?\<=,)\[^,\]+(?=,)
but this misses out on the edge cases i.e. 1st element and last element. I am also not able to generate for '/'.
You could separate out the key from values, then use a list comprehension to build the output you want.
inp = "Number1,2,3"
matches = re.search(r'(\D+)(.*)', inp)
output = [matches[1] + x for x in re.split(r'[,/]', matches[2])]
print(output) # ['Number1', 'Number2', 'Number3']
You can do it in several steps: 1) validate the string to match your pattern, and once validated 2) add the first non-digit chunk to the numbers while replacing - and / separator chars with commas:
import re
texts = ['Number1,2,3', 'Number1/2/3', 'Number1-2-3']
for text in texts:
m = re.search(r'^(\D+)(\d+(?=([,/-]))(?:\3\d+)*)$', text)
if m:
print( re.sub(r'(?<=,)(?=\d)', m.group(1).replace('\\', '\\\\'), text.replace('/',',').replace('-',',')) )
else:
print(f"NO MATCH in '{text}'")
See this Python demo.
Output:
Number1,Number2,Number3
Number1,Number2,Number3
Number1,Number2,Number3
The ^(\D+)(\d+(?=([,/-]))(?:\3\d+)*)$ regex validates your three types of input:
^ - start of string
(\D+) - Group 1: one or more non-digits
(\d+(?=([,/-]))(?:\3\d+)*) - Group 2: one or more digits, and then zero or more repetitions of ,, / or - and one or more digits (and the separator chars should be consistent due to the capture used in the positive lookahead and the \3 backreference to that value used in the non-capturing group)
$ - end of string.
The re.sub pattern, (?<=,)(?=\d), matches a location between a comma and a digit, the Group 1 value is placed there (note the .replace('\\', '\\\\') is necessary since the replacement is dynamic).
import re
for text in ("Number1,2,3", "Number1-2-3", "Number1/2/3"):
print(re.sub(r"(\D+)(\d+)[/,-](\d+)[/,-](\d+)", r"\1\2,\1\3,\1\4", text))
\D+ matches "Number" or any other non-number text
\d+ matches a number (or more than one)
[/,-] matches any of /, ,, -
The rest is copy paste 3 times.
The substitution consists of backreferences to the matched "Number" string (\1) and then each group of the (\d+)s.
This works if you're sure that it's always three numbers divided by that separator. This does not ensure that it's the same separator between each number. But it's short.
Output:
Number1,Number2,Number3
Number1,Number2,Number3
Number1,Number2,Number3
If you can make use of the pypi regex module you can use the captures collection with a named capture group.
([^\d\s,/]+)(?<num>\d+)([,/-])(?<num>\d+)(?:\3(?<num>\d+))*(?!\S)
([^\d\s,/]+) Capture group 1, match 1+ chars other than the listed
(?<num>\d+) Named capture group num matching 1+ digits
([,/-]) Capture either , / - in group 3
(?<num>\d+) Named capture group num matching 1+ digits
(?:\3(?<num>\d+))* Optionally repeat a backreference to group 3 to keep the separators the same and match 1+ digits in group num
(?!\S) Assert a whitspace boundary to the right to prevent a partial match
Regex demo | Python demo
import regex as re
pattern = r"([^\d\s,/]+)(?<num>\d+)([,/-])(?<num>\d+)(?:\3(?<num>\d+))*(?!\S)"
s = "Number1,2,3 or Number4/5/6 but not Number7/8,9"
matches = re.finditer(pattern, s)
for _, m in enumerate(matches, start=1):
print(','.join([m.group(1) + c for c in m.captures("num")]))
Output
Number1,Number2,Number3
Number4,Number5,Number6
I have a DataFrame with list of strings as below
df
text
,info_concern_blue,replaced_mod,replaced_rad
,info_concern,info_concern_red,replaced_unit
,replaced_link
I want to replace all words after info_concern for eg. info_concern_blue/info_concern_red to info_concern until it encounters comma.
I tried the following regex:
df['replaced_text'] = [re.sub(r'info_concern[^,]*.+?,', 'info_concern,',
x) for x in df['text']]
But this is giving me incorrect results.
Desired output:
replaced_text
,info_concern,replaced_mod,replaced_rad
,info_concern,info_concern,replaced_unit
,replaced_link
Please suggest/advise.
You can use
df['replaced_text'] = df['text'].str.replace(r'(info_concern)[^,]*', r'\1', regex=True)
See the regex demo.
If you want to make sure the match starts right after a comma or start of string, add the (?<![^,]) lookbehind at the start of the pattern:
df['replaced_text'] = df['text'].str.replace(r'(?<![^,])(info_concern)[^,]*', r'\1', regex=True)
See this regex demo. Details:
(?<![^,]) - right before, there should be either , or start of string
(info_concern) - Group 1: info_concern string
[^,]* - zero or more chars other than a comma.
The \1 replacement replaces the match with Group 1 value.
The issue is that the pattern info_concern[^,]*.+?, matches till before the first comma using [^,]*
Then this part .+?, matches at least a single character (which can also be a comma due to the .) and then till the next first comma.
So if there is a second comma, it will overmatch and remove too much.
You could also assert info_concern to the left, and match any char except a comma to be removed by an empty string.
If there has to be a comma to the right, you can assert it.
(?<=\binfo_concern)[^,]*(?=,)
The pattern matches:
(?<=\binfo_concern) Positive lookbehind, assert info_concern to the left
[^,]* Match 0+ times any char except ,
(?=,) Positive lookahead, assert , directly to the right
Regex demo
If the comma is not mandatory, you can omit the lookahead
(?<=\binfo_concern)[^,]*
For example
import pandas as pd
texts = [
",info_concern_blue,replaced_mod,replaced_rad",
",info_concern,info_concern_red,replaced_unit",
",replaced_link"
]
df = pd.DataFrame(texts, columns=["text"])
df['replaced_text'] = df['text'].str.replace(r'(?<=\binfo_concern)[^,]*(?=,)', '', regex=True)
print(df)
Output
text replaced_text
0 ,info_concern_blue,replaced_mod,replaced_rad ,info_concern,replaced_mod,replaced_rad
1 ,info_concern,info_concern_red,replaced_unit ,info_concern,info_concern,replaced_unit
2 ,replaced_link ,replaced_link
I would like to extract a filename from a path using regular expression:
mysting = '/content/drive/My Drive/data/happy (463).jpg'
How do I extract 'happy.jpg'?
I have tried this: '[^/]*$' but the result still includes the number in parenthesis which I do not want: 'happy (463).jpg'
How could I improve it?
You could use 2 capturing groups. In the first group match / and capture 1+ word chars in group 1.
Then match 1+ digits between parenthesis and capture .jpg asserting the end of the string in group 2.
^.*/(\w+)\s*\(\d+\)(\.jpg)$
In parts that will match
^.*/ Match until last /
(\w+) Catpure group 1, match 1+ word chars
\s* Match 1+ whitespace chars
\(\d+\) Match 1+ digits between parenthesis
(\.jpg) Capture group 2, match .jpg
$ End of string
Regex demo | Python demo
Then use group 1 and group 2 in the replacement to get happy.jpg
import re
regex = r"^.*/(\w+)\s*\(\d+\)(\.jpg)$"
test_str = "/content/drive/My Drive/data/happy (463).jpg"
result = re.sub(regex, r"\1\2", test_str, 1)
if result:
print (result)
Output
happy.jpg
Without Regex; str methods (str.partition and str.rpartition):
In [185]: filename = mysting.rpartition('/')[-1]
In [186]: filename
Out[186]: 'happy (463).jpg'
In [187]: f"{filename.partition(' ')[0]}.{filename.rpartition('.')[-1]}"
Out[187]: 'happy.jpg'
With Regex; re.sub:
re.sub(r'.*/(?!.*/)([^\s]+)[^.]+(\..*)', r'\1\2', mysting)
.*/ greedily matches upto last /
The zero-width negative lookahead (?!.*/) ensures there is no / in anyplace forward
([^\s]+) matches upto the next whitespace and put as the first captured group
[^.]+ matches upto next .
(\..*) matches a literal . followed by any number of characters and put as the second captured group; if you want to match more conservatively like 3 characters or even literal .jpg you can do that also
in the replacement, only the captured groups are used
Example:
In [183]: mysting = '/content/drive/My Drive/data/happy (463).jpg'
In [184]: re.sub(r'.*/(?!.*/)([^\s]+)[^.]+(\..*)', r'\1\2', mysting)
Out[184]: 'happy.jpg'
I use javascript.
In javascript case,
const myString="happy (463).jpg";
const result=myString.replace(/\s\(\d*\)/,'');
After you split path in slash separator,
you can apply this code.
Just getting to the next stage of understanding regex, hoping the community can help...
string = These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN-YIFY.cp(tt123456).MiLLENiUM.mp4
There are multiple test names preceded by a '-' hyphen which I derive from regex
\(?<=-)\w+\g
Result:
AUSVERSION
TEST
TESTAGAIN
YIFY
I can parse the very last result using greediness with regex \(?!.*-)(?<=-)\w+\g
Result:
YIFI (4th & last result)
Can you please help me parse either the 1st, 2nd, or 3rd result Globally using the same string?
In Python, you can get these matches with a simple -\s*(\w+) regex and re.findall and then access any match with the appropriate index:
See IDEONE demo:
import re
s = 'These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN-YIFY.cp(tt123456).MiLLENiUM.mp4'
r = re.findall(r'-\s*(\w+)', s)
print(r[0]) # => AUSVERSION
print(r[1]) # => TEST
print(r[2]) # => TESTAGAIN
print(r[3]) # => YIFY
The -\s*(\w+) pattern search for a hyphen, followed with 0+ whitespaces, and then captures 1+ digits, letters or underscores. re.findall only returns the texts captured with capturing groups, so you only get those Group 1 values captured with (\w+).
To get these matches one by one, with re.search, you can use ^(?:.*?-\s*(\w+)){n}, where n is the match index you want. Here is a regex demo.
A quick Python demo (in real code, assign the result of re.search and only access Group 1 value after checking if there was a match):
s = "These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN- YIFY.cp(tt123456).MiLLENiUM.mp4"
print(re.search(r'^(?:.*?-\s*(\w+))', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){2}', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){3}', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){4}', s).group(1))
Explanation of the pattern:
^ - start of string
(?:.*?-\s*(\w+)){2} - a non-capturing group that matches (here) 2 sequences of:
.*? - 0+ any characters other than a newline (since no re.DOTALL modifier is used) up to the first...
- - hyphen
\s* - 0 or more whitespaces
(\w+) - Group 1 capturing 1+ word characters (letters, digits or underscores).
I am trying to delete the single quotes surrounding regular text. For example, given the list:
alist = ["'ABC'", '(-inf-0.5]', '(4800-20800]', "'\\'(4.5-inf)\\''", "'\\'(2.75-3.25]\\''"]
I would like to turn "'ABC'" into "ABC", but keep other quotes, that is:
alist = ["ABC", '(-inf-0.5]', '(4800-20800]', "'\\'(4.5-inf)\\''", "'\\'(2.75-3.25]\\''"]
I tried to use look-head as below:
fixRepeatedQuotes = lambda text: re.sub(r'(?<!\\\'?)\'(?!\\)', r'', text)
print [fixRepeatedQuotes(str) for str in alist]
but received error message:
sre_constants.error: look-behind requires fixed-width pattern.
Any other workaround? Thanks a lot in advance!
Try should work:
result = re.sub("""(?s)(?:')([^'"]+)(?:')""", r"\1", subject)
explanation
"""
(?: # Match the regular expression below
' # Match the character “'” literally (but the ? makes it a non-capturing group)
)
( # Match the regular expression below and capture its match into backreference number 1
[^'"] # Match a single character NOT present in the list “'"” from this character class (aka any character matches except a single and double quote)
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
(?: # Match the regular expression below
' # Match the character “'” literally (but the ? makes it a non-capturing group)
)
"""
re.sub accepts a function as the replace text. Therefore,
re.sub(r"'([A-Za-z]+)'", lambda match: match.group(), "'ABC'")
yields
"ABC"