Python regex: Matching a URL

Python regex: Matching a URL - python

I have some confusion regarding the pattern matching in the following expression. I tried to look up online but couldn't find an understandable solution:
imgurUrlPattern = re.compile(r'(http://i.imgur.com/(.*))(\?.*)?')
What exactly are the parentheses doing ? I understood up until the first asterisk , but I can't figure out what is happening after that.

Regular expressions can be represented as graphs to understand there operation. A parallel connection between nodes indicate that it is optional a serial connection indicates taht it is mandatory and a loop indicated repitition over the same node.
(http://i.imgur.com/(.*))(\?.*)?
Debuggex Demo
So this starts with an imgur URL http://i.imgur.com/(.*) (mandatorily) having any characters untill a '?'(optional) is encountered. Following any characters after the '?'. Notice '?' has been escaped of its regular behaviour. The pink highlights indicate the capture groups.

(http://i.imgur.com/(.*))(\?.*)?
The first capturing group (http://i.imgur.com/(.*)) means that the string should start with http://i.imgur.com/ followed by any number of characters (.*) (this is a poor regex, you shouldn't do it this way). (.*) is also the second capturing group.
The third capturing group (\?.*) means that this part of the string must start with ? and then contain any number of any characters, as above.
The last ? means that the last capturing group is optional.
EDIT:
These groups can then be used as:
p = re.compile(r'(http://i.imgur.com/(.*))(\?.*)?')
m = p.match('ab')
m.group(0);
m.group(2);
To improve the regex, you must limit the engine to what characters you need, like:
(http://i.imgur.com/([A-z0-9\-]+))(\?[[^/]+*)?
[A-z0-9\-]+ limit to alphanumeric characters
[^/] exclude /

The (.*) means any character repeated any amount of times, the (\?.*)? matches the query string of a url for example (a imgur search of "cat"):
http://imgur.com/search?q=cat
http://imgur.com/search is matched by the (http://i.imgur.com/(.*)) (the search is specifically matched by the (.*)) section of the regex. The ?q=cat is matched by the (\?.*)? of the regex. In the regex the ? in the end means optional, so it means there might or might not be a query string. There is no query string in the url http://www.imgur.com. The parenthesis are used for grouping. We want to group (http://i.imgur.com/(.*)) as one thing because it matches the url, and there is another group within this that matches the page you are request (this is (.*)). We want to group (\?.*)? because it matches the query string.
Here is a diagram to help you

Related

Python regex conditional, don't match if

Sorry for the somewhat unhelpful title, I'm having a really hard time explaining this issue.
I have a list of unique identifiers that can appear in a number of different ways and I'm trying to use regex to normalize them so I can compare across several databases. Here are some examples of them:
AB1201
AB-1201
AB1201-T
AB-12-01L1
AB1201-TER
AB1201 Transit
I've written a line of code that pulls out all hypens and spaces, and the used this regex:
([a-zA-Z]{2}[\d]{4})(L\d|Transit|T$)?
This works exactly as expected, returning a list looking like this:
AB1201
AB1201
AB1201T
AB1201L1
AB1201
AB1201T
The issue is, I have one identifier that looks like this: AB1201-02. I need this to be raised as an exception, and not included as a match.
Any ideas? I'm happy to provide more clarification if necessary. Thanks!
From Regex101 online tester

You can exclude matching the following hyphen and a digit (?!-\d) using a negative lookahead.
If it should start at the beginning of the string, you could use an anchor ^
Note that you could write [\d] as \d
^([a-zA-Z]{2}\d{4})(?!-\d)(L\d|Transit|T$)?
The pattern will look like
^ Start of string
( Capture group 1
[a-zA-Z]{2}\d{4} Match 2 times a-zA-Z and 4 digits
) Close group
(?!-\d) Negative lookahead, assert what is directly to the right is not - and a digit
(L\d|Transit|T$)? Optional capture group 2
Regex demo

Try this regular expression
^([a-zA-Z]{2}[\d]{4})(?!-\d)(L\d|Transit|T|-[A-Z]{3})?$
I have added the (?!...) Negative Lookahead to avoid matching with the -02.
(?!...) Negative Lookahead: Starting at the current position in the expression, ensures that the given pattern will not match. Does not consume characters.
You can view a demo on this link.

Filtering results of Python LDAP query

I am trying to do some processing of results of LDAP query using Python and ldap module. In the returned list of LDAP objects (actually list of lists where some elements are dictionariest) I have 'cn' attributes with values like this:
tag-<username>,
krh-<username>,
tag-<username>-ab,
tag-<username>-ac,
tag-<username>-ad,
rrt-<username>.
I would like to use just those with the exact pattern tag-<username> (starting with tag- and without -ab, -ac, or -ad at the end).
What would be the easiest way to do it? I assume matching with regular expressions but what would be the right regex to use in this case?
Thanks!

You can use this regex to filter the results : tag-([^-]+)$
The parentheses allows to capture the matched username but you don't necessarily need it.
tag- matches the characters tag- literally (case sensitive).
Capturing group ([^-]+).
[^-]+ matches a single character not present in the list below :
- matches the character - literally (case sensitive).
+ quantifier : matches between one and unlimited times, as many times as possible, giving back as needed (greedy).
$ asserts position at the end of the string, or before the line terminator right at the end of the string (if any).
For example, the string tag-usertest gives a full match, capturing group usertest, and strings like tag-usertest-<any> won't match.
You can run some test here : https://regex101.com/r/D4a7I7/1/tests

re.sub part of string: (?: ...) mystery [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 6 years ago.
I have a character string:
temp = '4424396.6\t1\tk__Bacteria\tp__Firmicutes\tc__Erysipelotrichi\to__Erysipelotrichales'
And I need to get rid of tabulations only in between taxonomy terms.
I tried
re.sub(r'(?:\D{1})\t', ',', temp)
It came quite close, but also replaced the letter before tabs:
'4424396.6\t1\tk__Bacteri,p__Firmicute,c__Erysipelotrich,o__Erysipelotrichales'
I am confused as re documentation for (?:...) goes:
...the substring matched by the group cannot be retrieved after
performing a match or referenced later in the pattern.
The last letter was within the parenthesis, so how could it be replaced?
PS
I used re.sub(r'(?<=\D{1})(\t)', ',', temp) and it works perfectly fine, but I can't understand what's wrong with the first regexp

The text matched by (?:...) does not form a capture group, as does (...), and therefore cannot be referred to later with a backreference such as \1. However, it's still part of the overall match, and is part of the text that re.sub() will replace.
The point of non-capturing groups is that they are slightly more efficient, and may be required in uses such as re.split() where the mere existence of capturing groups will affect the output.

According to the documentation, (?:...) specifies a non-capturing group. It explains:
Sometimes you’ll want to use a group to collect a part of a regular expression, but aren’t interested in retrieving the group’s contents.
What this means is that anything that matches the ... expression (in your case, the preceding letter) will not be captured as a group but will still be part of the match. The only thing special about this is that you won't be able to access the part of the input captured by this group using match.group:
Except for the fact that you can’t retrieve the contents of what the group matched, a non-capturing group behaves exactly the same as a capturing group
In contrast, (?<=...) is a positive lookbehind assertion; the regular expression will check to make sure any matches are preceded by text matching ..., but won't capture that part.

How to search part of pattern in regex python

I can match pattern as it is. But can I search only part of the pattern? or I have to send it separately again.
e.g. pattern = '/(\w+)/(.+?)'
I can search this pattern using re.search and then use group to get individual groups.
But can I search only for say (\w+) ?
e.g.
pattern = '/(\w+)/(.+?)'
pattern_match = re.search(pattern, string)
print pattern_match.group(1)
Can I just search for part of pattern. e.g. pattern.group(1) or something

You can make any part of a regular expression optional by wrapping it in a non-matching group followed by a ?, i.e. (?: ... )?.
pattern = '/(\w+)(?:/(.+))?'
This will match /abc/def as well as /abc.
In both examples pattern_match.group(1) will be abc, but pattern_match.group(2) will be def in the first one and an empty string in the second one.
For further reference, have a look at (?:x) in the special characters table at https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions
EDIT
Changed the second group to (.+), since I assume you want to match more than one character. .+ is called a "greedy" match, which will try to match as much as possible. .+? on the other hand is a "lazy" match that will only match the minimum number of characters necessary. In case of /abc/def, this will only match the d from def.

That pattern is merely a character string; send the needed slice however you want. For instance:
re.search(pattern[:6], string)
uses only the first 6 characters of your pattern. If you need to detect the end of the first pattern -- and you have no intervening right-parens -- you can use
rparen_pos = pattern.index(')')
re.search(pattern[:rparen_pos+1], string)
Another possibility is
pat1 = '/(\w+)'
pat2 = '/(.+?)'
big_match = re.search(pat1+pat2, string)
small_match = re.search(pat1, string)
You can get more innovative with expression variables ($1, $2, etc.); see the links below for more help.
http://flockhart.virtualave.net/RBIF0100/regexp.html
https://docs.python.org/2/howto/regex.html

Python Regex Behaviour

I'm trying to parse a text document with data in the following format: 24036 -977. I need to separate the numbers into separate values, and the way I've done that is with the following steps.
values = re.search("(.*?)\s(.*)")
x = values.group(1)
y = values.gropu(2)
This does the job, however I was curious about why using (.*?) in the second group causes the regex to fail? I tested it in the online regex tester(https://regex101.com/r/bM2nK1/1), and adding the ? in causes the second group to return nothing. Now as far as I know .*? means to take any value unlimited times, as few times as possible, and the .* is just the greedy version of that. What I'm confused about is why the non greedy version.*? takes that definition to mean capturing nothing?

Because it means to match the previous token, the *, as few times as possible, which is 0 times. If you would it to extend to the end of the string, add a $, which matches the end of string. If you would like it to match at least one, use + instead of *.
The reason the first group .*? matches 24036 is because you have the \s token after it, so the fewest amount of characters the .*? could match and be followed by a \s is 24036.

#iobender has pointed out the answer to your question.
But I think it's worth mentioning that if the numbers are separated by space, you can just use split:
>>> '24036 -977'.split()
['24036', '-977']
This is simpler, easier to understand and often faster than regex.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python regex: Matching a URL - python

Related

Python regex conditional, don't match if

Filtering results of Python LDAP query

re.sub part of string: (?: ...) mystery [duplicate]

How to search part of pattern in regex python

Python Regex Behaviour

Categories

Resources