How to search part of pattern in regex python - python

I can match pattern as it is. But can I search only part of the pattern? or I have to send it separately again.
e.g. pattern = '/(\w+)/(.+?)'
I can search this pattern using re.search and then use group to get individual groups.
But can I search only for say (\w+) ?
e.g.
pattern = '/(\w+)/(.+?)'
pattern_match = re.search(pattern, string)
print pattern_match.group(1)
Can I just search for part of pattern. e.g. pattern.group(1) or something

You can make any part of a regular expression optional by wrapping it in a non-matching group followed by a ?, i.e. (?: ... )?.
pattern = '/(\w+)(?:/(.+))?'
This will match /abc/def as well as /abc.
In both examples pattern_match.group(1) will be abc, but pattern_match.group(2) will be def in the first one and an empty string in the second one.
For further reference, have a look at (?:x) in the special characters table at https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions
EDIT
Changed the second group to (.+), since I assume you want to match more than one character. .+ is called a "greedy" match, which will try to match as much as possible. .+? on the other hand is a "lazy" match that will only match the minimum number of characters necessary. In case of /abc/def, this will only match the d from def.

That pattern is merely a character string; send the needed slice however you want. For instance:
re.search(pattern[:6], string)
uses only the first 6 characters of your pattern. If you need to detect the end of the first pattern -- and you have no intervening right-parens -- you can use
rparen_pos = pattern.index(')')
re.search(pattern[:rparen_pos+1], string)
Another possibility is
pat1 = '/(\w+)'
pat2 = '/(.+?)'
big_match = re.search(pat1+pat2, string)
small_match = re.search(pat1, string)
You can get more innovative with expression variables ($1, $2, etc.); see the links below for more help.
http://flockhart.virtualave.net/RBIF0100/regexp.html
https://docs.python.org/2/howto/regex.html

Related

Why doesn't \0 work in Python regexp substitutions, i.e. with sub() or expand(), while match.group(0) does, and also \1, \2, ...?

Why doesn't \0 work (i.e. to return the full match) in Python regexp substitutions, i.e. with sub() or match.expand(), while match.group(0) does, and also \1, \2, ... ?
This simple example (executed in Python 3.7) says it all:
import re
subject = '123'
regexp_pattern = r'\d(2)\d'
expand_template_full = r'\0'
expand_template_group = r'\1'
regexp_obj = re.compile(regexp_pattern)
match = regexp_obj.search(subject)
if match:
print('Full match, by method: {}'.format(match.group(0)))
print('Full match, by template: {}'.format(match.expand(expand_template_full)))
print('Capture group 1, by method: {}'.format(match.group(1)))
print('Capture group 1, by template: {}'.format(match.expand(expand_template_group)))
The output from this is:
Full match, by method: 123
Full match, by template:
Capture group 1, by method: 2
Capture group 1, by template: 2
Is there any other sequence I can use in the replacement/expansion template to get the full match? If not, for the love of god, why?
Is this a Python bug?
Huh, you're right, that is annoying!
Fortunately, Python's way ahead of you. The docs for sub say this:
In string-type repl arguments, in addition to the character escapes and backreferences described above, \g<name> will use the substring matched by the group named name, as defined by the (?P<name>...) syntax. \g<number> uses the corresponding group number.... The backreference \g<0> substitutes in the entire substring matched by the RE.
So your code example can be:
import re
subject = '123'
regexp_pattern = r'\d(2)\d'
expand_template_full = r'\g<0>'
regexp_obj = re.compile(regexp_pattern)
match = regexp_obj.search(subject)
if match:
print('Full match, by template: {}'.format(match.expand(expand_template_full)))
You also asked the far more interesting question of "why?". The rationale in the docs explains that you can use this to replace with more than 10 capture groups, because it's not clear whether \10 should be substituted with the 10th group, or with the first capture group followed by a zero, but doesn't explain why \0 doesn't work. I've not been able to find a PEP explaining the rationale, but here's my guess:
We want the repl argument to re.sub to use the same capture group backreferencing syntax as in regex matching. When regex matching, the concept of \0 "backreferencing" to the entire matched string is nonsensical; the hypothetical regex r'A\0' would match an infinitely long string of A characters and nothing else. So we cannot allow \0 to exist as a backreference. If you can't match with a backreference that looks like that, you shouldn't be able to replace with it either.
I can't say I agree with this logic, \g<> is already an arbitrary extension, but it's an argument that I can see someone making.
If you will look into docs, you will find next:
The backreference \g<0> substitutes in the entire substring matched by the RE.
A bit more deep in docs (back in 2003) you will find next tip:
There is a group 0, which is the entire matched pattern, but it can't be referenced with \0; instead, use \g<0>.
So, you need to follow this recommendations and use \g<0>:
expand_template_full = r'\g<0>'
Quoting from https://docs.python.org/3/library/re.html
\number
Matches the contents of the group of the same number. Groups are numbered starting from 1. For example, (.+) \1 matches 'the the' or '55 55', but not 'thethe' (note the space after the group). This special sequence can only be used to match one of the first 99 groups. If the first digit of number is 0, or number is 3 octal digits long, it will not be interpreted as a group match, but as the character with octal value number. Inside the '[' and ']' of a character class, all numeric escapes are treated as characters.
To summarize:
Use \1, \2 up to \99 provided no more digits are present after the numbered backreference
Use \g<0>, \g<1>, etc (not limited to 99) to robustly backreference a group
as far as I know, \g<0> is useful in replacement section to refer to entire matched portion but wouldn't make sense in search section
if you use the 3rd party regex module, then (?0) is useful in search section as well, for example to create recursively matching patterns

How to search/extract patterns in a string?

I have a pattern I want to search for in my message.
The patterns are:
1. "aaa-b3-c"
2. "a3-b6-c"
3. "aaaa-bb-c"
I know how to search for one of the patterns, but how do I search for all 3?
Also, how do you identify and extract dates in this format: 5/21 or 5/21/2019.
found = re.findall(r'.{3}-.{2}-.{1}', message)
Try this :
found = re.findall(r'a{2,4}-b{2}-c', message)
You could use
a{2,4}-bb-c
as a pattern.
Now you need to check the match for truthiness:
match = re.search(pattern, string)
if match:
# do sth. here
As from Python 3.8 you can use the walrus operator as in
if (match := re.search(pattern, string)) is not None:
# do sth. here
try this:
re.findall(r'a.*-b.*-c',message)
The first part could be a quantifier {2,4} instead of 3. The dot matches any character except a newline, [a-zA-Z0-9] will match a upper or lowercase char a-z or a digit:
\b[a-zA-Z0-9]{2,4}-[a-zA-Z0-9]{2}-[a-zA-Z0-9]\b
Demo
You could add word boundaries \b or anchors ^ and $ on either side if the characters should not be part of a longer word.
For the second pattern you could also use \d with a quantifier to match a digit and an optional patter to match the part with / and 4 digits:
\d{1,2}/\d{2}(?:/\d{4})?
Regex demo
Note that the format does not validate a date itself. Perhaps this page can help you creating / customize a more specific date format.
Here, we might just want to write three expressions, and swipe our inputs from left to right just to be safe and connect them using logical ORs and in case we had more patterns we can simply add to it, similar to:
([a-z]+-[a-z]+[0-9]+-[a-z]+)
([a-z]+[0-9]+-[a-z]+[0-9]+-[a-z])
([a-z]+-[a-z]+-[a-z])
which would add to:
([a-z]+-[a-z]+[0-9]+-[a-z]+)|([a-z]+[0-9]+-[a-z]+[0-9]+-[a-z])|([a-z]+-[a-z]+-[a-z])
Then, we might want to bound it with start and end chars:
^([a-z]+-[a-z]+[0-9]+-[a-z]+)$|^([a-z]+[0-9]+-[a-z]+[0-9]+-[a-z])$|^([a-z]+-[a-z]+-[a-z])$
or
^(([a-z]+-[a-z]+[0-9]+-[a-z]+)|([a-z]+[0-9]+-[a-z]+[0-9]+-[a-z])|([a-z]+-[a-z]+-[a-z]))$
RegEx
If this expression wasn't desired, it can be modified or changed in regex101.com.
RegEx Circuit
jex.im visualizes regular expressions:

Replacing single words with re.sub in python

Spoiler: Yes this is an assignment. It is solved, but for personal interest I want to know the below.
So at the moment working with a syntax marker for an assignment, in which we input a file, and using a dictionary of regexes, colour them (keywords) accordingly.
Having some issues, though.
for i in iterations:
pass
in this above line, using a regex
r'(\t*for.*in.*?:.?)' will work, but it will colour the entire line. While that is allowed, I would really like for it to only mark for/in.
Trying with r'(\bfor\b|\bin\b)' is not being kind, nor r'(for)', or r'(\sfor\s)'.
I read the whole code into one string and use re.sub() to replace all occurences with colour + r'\1' + colour_end where colour specifies colour sequences.
You may use capturing and backreferences:
^(\t*)(for\b)(.*)\b(in)\b(.*?:)
Replace with $1<color>$2</color>$3<color>$4</color>$5. See the regex demo.
Here, the expression is split into 5 subparts with (...) capturing groups. In the replacement pattern, those values captured are referred to with backreferences having $+n format where n is the ID of the capturing group inside the pattern.
If you have no chance to run 1 regex with multiple capturing groups, run two on end:
^(\t*)for\b(?=.*\bin\b.*?:) --> $1<color>for</color> (see this demo)
^(\t*for\b.*)\bin\b(?=.*?:) --> $1<color>in</color> (see another demo).
The single capturing group is around the part before the word, and the part after the word is not matched but checked with a positive lookahead.
Here's my solution:
import re
STR = """
for i in iterations:
pass
"""
pattern = r'(\b)(for|in|pass)(\b)'
change = r'\1<COLOR>\2</COLOR>\3'
print re.sub(pattern, change, STR)
so I'm capturing the keywords with whitespaces and give them back with as \1 and \3
this gives:
<COLOR>for</COLOR> i <COLOR>in</COLOR> iterations:
<COLOR>pass</COLOR>

Python regex: Matching a URL

I have some confusion regarding the pattern matching in the following expression. I tried to look up online but couldn't find an understandable solution:
imgurUrlPattern = re.compile(r'(http://i.imgur.com/(.*))(\?.*)?')
What exactly are the parentheses doing ? I understood up until the first asterisk , but I can't figure out what is happening after that.
Regular expressions can be represented as graphs to understand there operation. A parallel connection between nodes indicate that it is optional a serial connection indicates taht it is mandatory and a loop indicated repitition over the same node.
(http://i.imgur.com/(.*))(\?.*)?
Debuggex Demo
So this starts with an imgur URL http://i.imgur.com/(.*) (mandatorily) having any characters untill a '?'(optional) is encountered. Following any characters after the '?'. Notice '?' has been escaped of its regular behaviour. The pink highlights indicate the capture groups.
(http://i.imgur.com/(.*))(\?.*)?
The first capturing group (http://i.imgur.com/(.*)) means that the string should start with http://i.imgur.com/ followed by any number of characters (.*) (this is a poor regex, you shouldn't do it this way). (.*) is also the second capturing group.
The third capturing group (\?.*) means that this part of the string must start with ? and then contain any number of any characters, as above.
The last ? means that the last capturing group is optional.
EDIT:
These groups can then be used as:
p = re.compile(r'(http://i.imgur.com/(.*))(\?.*)?')
m = p.match('ab')
m.group(0);
m.group(2);
To improve the regex, you must limit the engine to what characters you need, like:
(http://i.imgur.com/([A-z0-9\-]+))(\?[[^/]+*)?
[A-z0-9\-]+ limit to alphanumeric characters
[^/] exclude /
The (.*) means any character repeated any amount of times, the (\?.*)? matches the query string of a url for example (a imgur search of "cat"):
http://imgur.com/search?q=cat
http://imgur.com/search is matched by the (http://i.imgur.com/(.*)) (the search is specifically matched by the (.*)) section of the regex. The ?q=cat is matched by the (\?.*)? of the regex. In the regex the ? in the end means optional, so it means there might or might not be a query string. There is no query string in the url http://www.imgur.com. The parenthesis are used for grouping. We want to group (http://i.imgur.com/(.*)) as one thing because it matches the url, and there is another group within this that matches the page you are request (this is (.*)). We want to group (\?.*)? because it matches the query string.
Here is a diagram to help you

Backreferencing in Python: findall() method output for HTML string

I am trying to learn some regular expressions in Python. The following does not produce the output I expected:
with open('ex06-11.html') as f:
a = re.findall("<div[^>]*id\\s*=\\s*([\"\'])header\\1[^>]*>(.*?)</div>", f.read())
# output: [('"', 'Some random text')]
The output I was expecting (same code, but without the backreference):
with open('ex06-11.html') as f:
print re.findall("<div[^>]*id\\s*=\\s*[\"\']header[\"\'][^>]*>(.*?)</div>", f.read())
# output: ['Some random text']
The question really boils down to: why is there a quotation mark in my first output, but not in my second? I thought that ([abc]) ... //1 == [abc] ... [abc]. Am I incorrect?
From the docs on re.findall:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
If you want the entire match to be returned, remove the capturing groups or change them to non-capturing groups by adding ?: after the opening paren. For example you would change (foo) in your regex to (?:foo).
Of course in this case you need the capturing group for the backreference, so your best bet is to keep your current regex and then use a list comprehension with re.finditer() to get a list of only the second group:
regex = re.compile(r"""<div[^>]*id\s*=\s*(["'])header\1[^>]*>(.*?)</div>""")
with open('ex06-11.html') as f:
a = [m.group(2) for m in regex.finditer(f.read())
A couple of side notes, you should really consider using an HTML parser like BeautifulSoup instead of regex. You should also use triple-quoted strings if you need to include single or double quotes within you string, and use raw string literals when writing regular expressions so that you don't need to escape the backslashes.
The behaviour is clearly documented. See re.findall:
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found.
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
So, if you have a capture group in your regex pattern, then findall method returns a list of tuple, containing all the captured groups for a particular match, plus the group(0).
So, either you use a non-capturing group - (?:[\"\']), or don't use any group at all, as in your 2nd case.
P.S: Use raw string literals for your regex pattern, to avoid escaping your backslashes. Also, compile your regex outside the loop, so that is is not re-compiled on every iteration. Use re.compile for that.
When I asked this question I was just starting with regular expressions. I have since read the docs completely, and I just wanted to share what I found out.
Firstly, what Rohit and F.J suggested, use raw strings (to make the regex more readable and less error-prone) and compile your regex beforehand using re.compile. To match an HTML string whose id is 'header':
s = "<div id='header'>Some random text</div>"
We would need a regex like:
p = re.compile(r'<div[^>]*id\s*=\s*([\"\'])header\1[^>]*>(.*?)</div>')
In the Python implementation of regex, a capturing group is made by enclosing part of your regex in parentheses (...). Capturing groups capture the span of text that they match. They are also needed for backreferencing. So in my regex above, I have two capturing groups: ([\"\']) and (.*?). The first one is needed to make the backreference \1 possible. The use of a backreferences (and the fact that they reference back to a capturing group) has consequences, however. As pointed out in the other answers to this question, when using findall on my pattern p, findall will return matches from all groups and put them in a list of tuples:
print p.findall(s)
# [("'", 'Some random text')]
Since we only want the plain text from between the HTML tags, this is not the output we're looking for.
(Arguably, we could use:
print p.findall(s)[0][1]
# Some random text
But this may be a bit contrived.)
So in order to return only the text from between the HTML tags (captured by the second group), we use the group() method on p.search():
print p.search(s).group(2)
# Some random text
I'm fully aware that all but the most simple HTML should not be handled by regex, and that instead you should use a parser. But this was just a tutorial example for me to grasp the basics of regex in Python.

Categories