Python RegEx or-ing problems - python

Hey there I'm just trying to do some simple regEx. What i want is anything in between a ? and a &, or a & and a &, or & and the end of a string. So I've been reading the docks, and I feel, like I should at least be getting close with patters such as:
p = re.compile('(\?.*?&)|(&.*?&)|(&.*?$)')
or
re.compile('[&\?](.*?)&')
but all variants I try are a little wonky. An explanation of what you did would also be nice. An Example:
?k=091910918&ack=901828312&p=999998
and it should yeild:
k=091910918, ack=901828312, and p=999998
as answers. Thanks !

You can use the following regular expression:
>>> import re
>>> re.findall(r'[?&]([^?&]+)', '?k=091910918&ack=901828312&p=999998')
['k=091910918', 'ack=901828312', 'p=999998']
Regular expression:
[?&] # any character of: '?', '&'
( # group and capture to \1:
[^?&]+ # any character except: '?', '&' (1 or more times)
) # end of \1
You could just split here as well... assuming your string looks like this:
>>> filter(None, re.split('[?&]', '?k=091910918&ack=901828312&p=999998'))
['k=091910918', 'ack=901828312', 'p=999998']

If you don't mind only having one matched group, use this:
[\?&](\w+\=\d+)
If you want to have two matched groups for each one, use this:
[\?&](\w+)\=(\d+)
Pretty much the main problem you were having was that you were giving the regex too much flexibility on what to capture by using .*. If you restrict what each group can be a little, it ends up being much more cooperative.
Here's a demo of the first regex on Regex101

Related

how to replace multiple consecutive repeating characters into 1 character in python?

I have a string in python and I want to replace multiple consecutive repeating character into 1.
For example:
st = "UUUURRGGGEENNTTT"
print(st.replace(r'(\w){2,}',r'\1'))
But this command doesn't seems to be working, please can anybody help in finding what's wrong with this command?
There is one more way to solve this but wanted to understand why the above command fails and is there any way to correct it:
print(re.sub(r"([a-z])\1+",r"\1",st)) -- print URGENT
you need to use regex.
so you can do this:
import re
re.sub(r'[^\w\s]|(.)(?=\1)', '', 'UUURRRUU')
the result is UR.
this is a snapshot of what I have got:
for this regex: (.)(?=.*\1)
(.) means: match any char except new lines (line breaks)
?=. means: lookahead every char except new line (.)
* means: match a preceding token
\1 means: to mach the result of captured group, which is the U or R ...
then replace all matches with ''
also you can check this:
lookahead
also check this tool I solve my regex using it,
it describe everything and you can learn a lot from it:
regexer
The reason for why your code does not work is because str.replace does not support regex, you can only replace a substring with another string. You will need to use the re module if you want to replace by matching a regex pattern.
Secondly, your regex pattern is also incorrect, (\w){2,} will match any characters that occurs 2 or more times (doesn’t have to be the same character though), so it will not work. You will need to do something like this:
import re
st = "UUUURRGGGEENNTTT"
print(re.sub(r'(\w)\1+',r'\1', st)))
# URGENT
Now this will only match the same character 2 or more times.
An alternative, “unique” solution to this is that you can use the unique_justseen recipe that itertools provides:
from itertools import groupby
from operator import itemgetter
st = "UUUURRGGGEENNTTT"
new ="".join(map(next, map(itemgetter(1), groupby(st))))
print(new)
# URGENT
string.replace(s, old, new[, maxreplace]) only does substring replacement:
>>> '(\w){2,}'.replace(r'(\w){2,}',r'\1')
'\\1'
That's why it fails and it can't work with regex expression so no way to correct the first command.

REGEX formulating conditions

Just started learning python and regex.
My regex:
\b\d+\s+([A-Za-z]* |[A-Za-z]*\s+[A-Za-z]*)\s+\D+..
using https://regex101.com/
string 1: https://i.imgur.com/XNuXftW.jpg (why does Beer has whitespaces while carrot/chocolate dont have?)
string 2https://i.imgur.com/nrl2FPB.jpg (adding further more of \s+[A-Za-z] in the capture group doesnt seem to be working anymore, WHY?)
string 3: https://i.imgur.com/qH0Z7Hi.jpg (same as string 2 problem)
my question is how do i continue to formulate such that it will encompass the above conditions? thank you
in the case that you need to test it yourself, i have provided the strings as below.
=
You could use this regex, which takes advantage of look-behind (?<=) and look-ahead (?=) so it only captures the product names:
(?<=\s\s)\w+(?:\s\w+)*(?=\s\s)
See demo on regex101.com.
Use it with the g modifier.
Try this:
\d+\s+([A-Za-z ]*)\b *\D+
See on regex101.
I guess the the space before "|" is the one causes it captures "beer " in string 1 case "Chocolate cake" does not happen as "beer " as it is matched with the second regex which is
[A-Za-z]*\s+[A-Za-z]*
for string 2
[A-Za-z]\s+[A-Za-z]
regex matches exactly two words
How about try below regex, modified from trincot
(?<=\s\s)(\w+\s)+(\w+)(?=\s\s)

How to use regex substitute using different capture and match strings?

I want the regex machine to look for a certain pattern, and then only replace a subset of that pattern. The strings look like this:
string1 = 'r|gw|gwe|bbbss|gwe | s'
And, I want to replace some of the strings using a regex like this:
re.sub('\|(gw.*)\|','nn',string1)
So, I want to look for the stuff between the |'s, but I only want replace what's between them, and not the entire |(gw.*)|.
Is there a concise way to do this?
If you want to retain the pipe characters and match overlapping context, you need to use lookaround assertions. Because * is a greedy operator, it will consume as much as possible.
In this case you can use a negated character class or *? to prevent greediness.
>>> re.sub(r'(?<=\|)gw[^|]*(?=\|)', 'nn', s)
'r|nn|nn|bbbss|nn| s'
Or you could take a more general approach perhaps:
>>> '|'.join(['nn' if i.startswith('gw') else i for i in s.split('|')])
'r|nn|nn|bbbss|nn| s'

Match everything expect a specific string

I am using Python 2.7 and have a question with regards to regular expressions. My string would be something like this...
"SecurityGroup:Pub HDP SG"
"SecurityGroup:Group-Name"
"SecurityGroup:TestName"
My regular expression looks something like below
[^S^e^c^r^i^t^y^G^r^o^u^p^:].*
The above seems to work but I have the feeling it is not very efficient and also if the string has the word "group" in it, that will fail as well...
What I am looking for is the output should find anything after the colon (:). I also thought I can do something like using group 2 as my match... but the problem with that is, if there are spaces in the name then I won't be able to get the correct name.
(SecurityGroup):(\w{1,})
Why not just do
security_string.split(':')[1]
To grab the second part of the String after the colon?
You could use lookbehind:
pattern = re.compile(r"(?<=SecurityGroup:)(.*)")
matches = re.findall(pattern, your_string)
Breaking it down:
(?<= # positive lookbehind. Matches things preceded by the following group
SecurityGroup: # pattern you want your matches preceded by
) # end positive lookbehind
( # start matching group
.* # any number of characters
) # end matching group
When tested on the string "something something SecurityGroup:stuff and stuff" it returns matches = ['stuff and stuff'].
Edit:
As mentioned in a comment, pattern = re.compile(r"SecurityGroup:(.*)") accomplishes the same thing. In this case you are matching the string "SecurityGroup:" followed by anything, but only returning the stuff that follows. This is probably more clear than my original example using lookbehind.
Maybe this:
([^:"]+[^\s](?="))
Regex live here.

Regex - Using * with a set of characters

I'm fairly new at regex, and I've run into a problem that I cannot figure out:
I am trying to match a set of characters that start with an arbitrary number of A-Z, 0-9, and _ characters that can optionally be followed by a number enclosed in a single set of parentheses and can be separated from the original string by a space (or not)
Examples of what this should find:
_ABCD1E
_123FD(13)
ABDF1G (2)
This is my current regex expression:
[A-Z_0-9]+\s*\({0,1}[\d]*\){0,1}
It's finding everything just fine, but a problem exists if I have the following:
_ABCDE )
It should only grab _ABCDE and not the " )" but it currently grabs '_ABCDE )'
Is there some way I can grab the (#) but not get extra characters if that entire pattern does not exist?
If possible, please explain syntax as I am aiming to learn, not just get the answer.
ANSWER: The following code is working for what I needed so far:
[A-Z_0-9]+(\s*\([\d]+\)){0,1}
# or, as has been mentioned, the above can be simplified
# and cleaned up a bit to be
[A-Z_0-9]+(\s*\(\d+\))?
# The [] around \d are unnecessary and {0,1} is equivalent to ?
Adding the parentheses around the (#) pattern allows for the use of ? or {0,1} on the entire pattern. I also changed the [\d]* to be [\d]+ to ensure at least one number inside of the parentheses.
Thanks for the fast answers, all!
Your regex says that each paren (open & closed) may or may not be there, INDEPENDENTLY. Instead, you should say that the number-enclosed-in-parens may or may not be there:
(\([\d]*\)){0,1}
Note that this allows for there to be nothing in the parens; that's what your regex said, but I'm not clear that's what you actually want.
how about
^[A-Z0-9_]+\s*(\([0-9]+\))?$
btw, from your example, the first part accepts not only [A-Z_], but also [0-9]
This seems to do the job.
[1-9A-Z_]+\s*(?:\([1-9]*\))?
It seems like you want the following regex:
^[A-Z\d_]+(\s*\(\d+\))?$
I used a non-capturing group to avoid grouping matching in results:
>>> pattern = r'[A-Z_]+\s*(?:\(\d+\)|\d*)'
>>> l = ['_ABCD1E', '_123FD(13)', 'ABDF1G (2)', '_ABCDE )', 'A_B (15)', 'E (345']
>>> [re.search(pattern , i).group() for i in l]
['_ABCD1', '_123', 'ABDF1', '_ABCDE ', 'A_B (15)', 'E ']

Categories