Python Regex Negative Lookbehind - python

The pattern (?<!(asp|php|jsp))\?.* works in PCRE, but it doesn't work in Python.
So what can I do to get this regex working in Python? (Python 2.7)

It works perfectly fine for me. Are you maybe using it wrong? Make sure to use re.search instead of re.match:
>>> import re
>>> s = 'somestring.asp?1=123'
>>> re.search(r"(?<!(asp|php|jsp))\?.*", s)
>>> s = 'somestring.xml?1=123'
>>> re.search(r"(?<!(asp|php|jsp))\?.*", s)
<_sre.SRE_Match object at 0x0000000002DCB098>
Which is exactly how your pattern should behave. As glglgl mentioned, you can get the match if you assign that Match object to a variable (say m) and then call m.group(). That yields ?1=123.
By the way, you can leave out the inner parentheses. This pattern is equivalent:
(?<!asp|php|jsp)\?.*

Related

Regex: define a particular group and avoid repeating it each time [duplicate]

Consider this (very simplified) example string:
1aw2,5cx7
As you can see, it is two digit/letter/letter/digit values separated by a comma.
Now, I could match this with the following:
>>> from re import match
>>> match("\d\w\w\d,\d\w\w\d", "1aw2,5cx7")
<_sre.SRE_Match object at 0x01749D40>
>>>
The problem is though, I have to write \d\w\w\d twice. With small patterns, this isn't so bad but, with more complex Regexes, writing the exact same thing twice makes the end pattern enormous and cumbersome to work with. It also seems redundant.
I tried using a named capture group:
>>> from re import match
>>> match("(?P<id>\d\w\w\d),(?P=id)", "1aw2,5cx7")
>>>
But it didn't work because it was looking for two occurrences of 1aw2, not digit/letter/letter/digit.
Is there any way to save part of a pattern, such as \d\w\w\d, so it can be used latter on in the same pattern? In other words, can I reuse a sub-pattern in a pattern?
No, when using the standard library re module, regular expression patterns cannot be 'symbolized'.
You can always do so by re-using Python variables, of course:
digit_letter_letter_digit = r'\d\w\w\d'
then use string formatting to build the larger pattern:
match(r"{0},{0}".format(digit_letter_letter_digit), inputtext)
or, using Python 3.6+ f-strings:
dlld = r'\d\w\w\d'
match(fr"{dlld},{dlld}", inputtext)
I often do use this technique to compose larger, more complex patterns from re-usable sub-patterns.
If you are prepared to install an external library, then the regex project can solve this problem with a regex subroutine call. The syntax (?<digit>) re-uses the pattern of an already used (implicitly numbered) capturing group:
(\d\w\w\d),(?1)
^........^ ^..^
| \
| re-use pattern of capturing group 1
\
capturing group 1
You can do the same with named capturing groups, where (?<groupname>...) is the named group groupname, and (?&groupname), (?P&groupname) or (?P>groupname) re-use the pattern matched by groupname (the latter two forms are alternatives for compatibility with other engines).
And finally, regex supports the (?(DEFINE)...) block to 'define' subroutine patterns without them actually matching anything at that stage. You can put multiple (..) and (?<name>...) capturing groups in that construct to then later refer to them in the actual pattern:
(?(DEFINE)(?<dlld>\d\w\w\d))(?&dlld),(?&dlld)
^...............^ ^......^ ^......^
| \ /
creates 'dlld' pattern uses 'dlld' pattern twice
Just to be explicit: the standard library re module does not support subroutine patterns.
Note: this will work with PyPi regex module, not with re module.
You could use the notation (?group-number), in your case:
(\d\w\w\d),(?1)
it is equivalent to:
(\d\w\w\d),(\d\w\w\d)
Be aware that \w includes \d. The regex will be:
(\d[a-zA-Z]{2}\d),(?1)
I was troubled with the same problem and wrote this snippet
import nre
my_regex=nre.from_string('''
a=\d\w\w\d
b={{a}},{{a}}
c=?P<id>{{a}}),(?P=id)
''')
my_regex["b"].match("1aw2,5cx7")
For lack of a more descriptive name, I named the partial regexes as a,b and c.
Accessing them is as easy as {{a}}
import re
digit_letter_letter_digit = re.compile("\d\w\w\d") # we compile pattern so that we can reuse it later
all_finds = re.findall(digit_letter_letter_digit, "1aw2,5cx7") # finditer instead of findall
for value in all_finds:
print(re.match(digit_letter_letter_digit, value))
Since you're already using re, why not use string processing to manage the pattern repetition as well:
pattern = "P,P".replace("P",r"\d\w\w\d")
re.match(pattern, "1aw2,5cx7")
OR
P = r"\d\w\w\d"
re.match(f"{P},{P}", "1aw2,5cx7")
Try using back referencing, i believe it works something like below to match
1aw2,5cx7
You could use
(\d\w\w\d),\1
See here for reference http://www.regular-expressions.info/backref.html

Best regex to match IEEE Time Stamp format in Python

I did some searching but didn't find this specifically, and I'm sure it's going to be a quick answer.
I have a python script parsing IEEE date and time stamps out of strings, but I think I'm using python's match objects wrong.
import re
stir = "foo_2015-07-07-17-58-26.log"
timestamp = re.search("([0-9]+-){5}[0-9]+", stir).groups()
print timestamp
Produces
58-
When my intent is to get
2015-07-07-17-58-26
Is there a pre-canned regex that would work better here? Am I getting tripped up on re's capture groups? Why is the length of the groups() tuple only 1?
Edit
I was misinterpreting the way capture groups work in python's re module - there is only one set of parentheses in the statement, so the re module returned the most recently grabbed capture group - the "58-".
The way I ended up doing it was by referencing group(0), as Dawg suggests below.
timestamp = re.search("([0-9]+-){5}[0-9]+", stir)
print timestamp.group(0)
2015-07-07-17-58-26
You need a single capture group or groups:
(\d\d\d\d-\d\d-\d\d-\d\d-\d\d-\d\d)
Demo
Or, use nested capture groups:
>>> re.search(r'(\d{4}(?:-\d{2}){5})', 'foo_2015-07-07-17-58-26.log')
<_sre.SRE_Match object at 0x100b49dc8>
>>> _.group(1)
'2015-07-07-17-58-26'
Or, you can use your pattern and just use group(0) instead of groups():
>>> re.search("([0-9]+-){5}[0-9]+", "foo_2015-07-07-17-58-26.log").group(0)
'2015-07-07-17-58-26'
Or, use findall with an additional capture group (and the other a non capture group):
>>> re.findall("((?:[0-9]+-){5}[0-9]+)", 'foo_2015-07-07-17-58-26.log')
['2015-07-07-17-58-26']
But that will find the digits that are not part of the timestamp.
if you want the timestamp in one match object, i think this should work
\d{4}(?:\d{2}){5}
then use group() or group(0)
also, match.groups actually returns the number of group objects, you should try .group() instead (your code would still not work though because you grouped the 5 sets of numbers in and the final -58 would be omitted
I'd use below:
_(\d{4}-\d{2}-\d{2}-\d{2}-\d{2}-\d{2}).
_ and . to mark the starting and the end.
import re
r = r'_(\d{4}-\d{2}-\d{2}-\d{2}-\d{2}-\d{2}).'
s = 'some string'
lst = re.findall(s,r)
link
You might want
re.findall(r"([0-9-]+)", stir)
>>> import re
>>> stir = "foo_2015-07-07-17-58-26.log"
>>> re.findall(r"([0-9-]+)", stir)
['2015-07-07-17-58-26']

How can Python's regular expressions work with patterns that have escaped special characters?

Is there a way to get Python's regular expressions to work with patterns that have escaped special characters? As far as my limited understanding can tell, the following example should work, but the pattern fails to match.
import re
string = r'This a string with ^g\.$s' # A string to search
pattern = r'^g\.$s' # The pattern to use
string = re.escape(string) # Escape special characters
pattern = re.escape(pattern)
print(re.search(pattern, string)) # This prints "None"
Note:
Yes, this question has been asked elsewhere (like here). But as you can see, I'm already implementing the solution described in the answers and it's still not working.
Why on earth are you applying re.escape to the string?! You want to find the "special" characters in that! If you just apply it to the pattern, you'll get a match:
>>> import re
>>> string = r'This a string with ^g\.$s'
>>> pattern = r'^g\.$s'
>>> re.search(re.escape(pattern), re.escape(string)) # nope
>>> re.search(re.escape(pattern), string) # yep
<_sre.SRE_Match object at 0x025089F8>
For bonus points, notice that you just need to re.escape the pattern one more times than the string:
>>> re.search(re.escape(re.escape(pattern)), re.escape(string))
<_sre.SRE_Match object at 0x025D8DE8>

Regular Expression with python

I have a tricky regular expression and I can't succeed to implement it.
I need the regular expression for this :
AEBE52E7-03EE-455A-B3C4-E57283966239
I use it for an identification like this :
url(r'^user/(?P<identification>\<regular expression>)$', 'view_add')
I tried some expressions like these ones:
\[A-Za-z0-9]{8}^-{1}[A-Za-z0-9]{4}^-{1}[A-Za-z0-9]{4}^-{1}[A-Za-z0-9]{4}^-{1}[A-Za-z0-9]{12}
\........^-....^-....^-....^-............
Someone can help me?
Thanks.
Just remove all the ^ symbols present in your regex.
>>> s = 'AEBE52E7-03EE-455A-B3C4-E57283966239'
>>> re.match(r'[A-Za-z0-9]{8}-[A-Za-z0-9]{4}-[A-Za-z0-9]{4}-[A-Za-z0-9]{4}-[A-Za-z0-9]{12}$', s)
<_sre.SRE_Match object; span=(0, 36), match='AEBE52E7-03EE-455A-B3C4-E57283966239'>
>>> re.match(r'[A-Za-z0-9]{8}-[A-Za-z0-9]{4}-[A-Za-z0-9]{4}-[A-Za-z0-9]{4}-[A-Za-z0-9]{12}$', s).group()
'AEBE52E7-03EE-455A-B3C4-E57283966239'
-{1} would be written as - It seems like all delimited words are hex codes. So you could use [0-9a-fA-F] instead of [A-Za-z0-9] .
>>> re.match(r'[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}$', s).group()
'AEBE52E7-03EE-455A-B3C4-E57283966239'
You dont need ^ and for - dont need {1},you can use the following pattern :
\w{8}-\w{4}-\w{4}-\w{4}-\w{12}
Note that \w will match any word character (A-Za-z0-9)
Or :
\w{8}-(\w{4}-){3}\w{12}
And as mentioned in comment if you are using a UUID as a more efficient way you can use the following pattern :
[a-fA-F\d]{8}(-[a-fA-F\d]{4}){3}-[a-fA-F\d]{12}
DEMO

Python match regex always returning None

I have a python regex that match method always return None. I tested in pythex site and the pattern seems OK.
Pythex example
But when I try with re module, the result is always None:
import re
a = re.match(re.compile("\.aspx\?.*cp="), 'page.aspx?cpm=549&cp=168')
What am I doing wrong?
re.match() only matches at the start of a string. Use re.search() instead:
re.search(r"\.aspx\?.*cp=", 'page.aspx?cpm=549&cp=168')
Demo:
>>> import re
>>> re.search(r"\.aspx\?.*cp=", 'page.aspx?cpm=549&cp=168')
<_sre.SRE_Match object at 0x105d7e440>
>>> re.search(r"\.aspx\?.*cp=", 'page.aspx?cpm=549&cp=168').group(0)
'.aspx?cpm=549&cp='
Note that any re functions that take a pattern, accept a string and will call re.compile() for you (which caches compilation results). You only need to use re.compile() if you want to store the compiled expression for re-use, at which point you can call pattern.search() on it:
pattern = re.compile(r"\.aspx\?.*cp=")
pattern.search('page.aspx?cpm=549&cp=168')

Categories