Regex to extra part of the url - python

I'm trying to extract part of a url using regex. I'm trying todo this ideally in one line and word for both url types.
I'm trying the following but not sure how I should get the second url. I am trying to extract the 4FHP from both.
>>> import re
>>>
>>> a="/url_redirect/4FHP"
>>> b="/url/4FHP/asdfasdfas/"
>>>
>>> re.search('^\/(url_redirect|url)\/(.*)', a).group(2)
'4FHP'
>>> re.search('^\/(url_redirect|url)\/(.*)', b).group(2)
'4FHP/asdfasdfas/'

The following code will extract 4FHP from either string. Noticed that I changed .* (match a sequence of any non-newline character) to [^/]* (match a sequence of any non-/ character).
re.search('^\/(url_redirect|url)\/([^/]*)', b).group(2)

Your problem is that the * operator is 'greedy', so it will grab to the end of the string which is why you get '4FHP/asdfasdfas/' in your second example
you need to stop matching when you see another /, the easiest way is to use a character class that specifically excludes it, eg [^/]
you can also use non-capturing groups (?: <regex> ) to only return matched group that you're interested in
re.search('^\/(?:url_redirect|url)\/([^/]*)', b).group(1)

Related

Extracting a word between two path separators that comes after a specific word

I have the following path stored as a python string 'C:\ABC\DEF\GHI\App\Module\feature\src' and I would like to extract the word Module that is located between words \App\ and \feature\ in the path name. Note that there are file separators '\' in between which ought not to be extracted, but only the string Module has to be extracted.
I had the few ideas on how to do it:
Write a RegEx that matches a string between \App\ and \feature\
Write a RegEx that matches a string after \App\ --> App\\[A-Za-z0-9]*\\, and then split that matched string in order to find the Module.
I think the 1st solution is better, but that unfortunately it goes over my RegEx knowledge and I am not sure how to do it.
I would much appreciate any help.
Thank you in advance!
The regex you want is:
(?<=\\App\\).*?(?=\\feature\\)
Explanation of the regex:
(?<=behind)rest matches all instances of rest if there is behind immediately before it. It's called a positive lookbehind
rest(?=ahead) matches all instances of rest where there is ahead immediately after it. This is a positive lookahead.
\ is a reserved character in regex patterns, so to use them as part of the pattern itself, we have to escape it; hence, \\
.* matches any character, zero or more times.
? specifies that the match is not greedy (so we are implicitly assuming here that \feature\ only shows up once after \App\).
The pattern in general also assumes that there are no \ characters between \App\ and \feature\.
The full code would be something like:
str = 'C:\\ABC\\DEF\\GHI\\App\\Module\\feature\\src'
start = '\\App\\'
end = '\\feature\\'
pattern = rf"(?<=\{start}\).*?(?=\{end}\)"
print(pattern) # (?<=\\App\\).*?(?=\\feature\\)
print(re.search(pattern, str)[0]) # Module
A link on regex lookarounds that may be helpful: https://www.regular-expressions.info/lookaround.html
We can do that by str.find somethings like
str = 'C:\\ABC\\DEF\\GHI\\App\\Module\\feature\\src'
import re
start = '\\App\\'
end = '\\feature\\'
print( (str[str.find(start)+len(start):str.rfind(end)]))
print("\n")
output
Module
Your are looking for groups. With some small modificatians you can extract only the part between App and Feature.
(?:App\\\\)([A-Za-z0-9]*)(?:\\\\feature)
The brackets ( ) define a Match group which you can get by match.group(1). Using (?:foo) defines a non-matching group, e.g. one that is not included in your result. Try the expression here: https://regex101.com/r/24mkLO/1

how to replace multiple consecutive repeating characters into 1 character in python?

I have a string in python and I want to replace multiple consecutive repeating character into 1.
For example:
st = "UUUURRGGGEENNTTT"
print(st.replace(r'(\w){2,}',r'\1'))
But this command doesn't seems to be working, please can anybody help in finding what's wrong with this command?
There is one more way to solve this but wanted to understand why the above command fails and is there any way to correct it:
print(re.sub(r"([a-z])\1+",r"\1",st)) -- print URGENT
you need to use regex.
so you can do this:
import re
re.sub(r'[^\w\s]|(.)(?=\1)', '', 'UUURRRUU')
the result is UR.
this is a snapshot of what I have got:
for this regex: (.)(?=.*\1)
(.) means: match any char except new lines (line breaks)
?=. means: lookahead every char except new line (.)
* means: match a preceding token
\1 means: to mach the result of captured group, which is the U or R ...
then replace all matches with ''
also you can check this:
lookahead
also check this tool I solve my regex using it,
it describe everything and you can learn a lot from it:
regexer
The reason for why your code does not work is because str.replace does not support regex, you can only replace a substring with another string. You will need to use the re module if you want to replace by matching a regex pattern.
Secondly, your regex pattern is also incorrect, (\w){2,} will match any characters that occurs 2 or more times (doesn’t have to be the same character though), so it will not work. You will need to do something like this:
import re
st = "UUUURRGGGEENNTTT"
print(re.sub(r'(\w)\1+',r'\1', st)))
# URGENT
Now this will only match the same character 2 or more times.
An alternative, “unique” solution to this is that you can use the unique_justseen recipe that itertools provides:
from itertools import groupby
from operator import itemgetter
st = "UUUURRGGGEENNTTT"
new ="".join(map(next, map(itemgetter(1), groupby(st))))
print(new)
# URGENT
string.replace(s, old, new[, maxreplace]) only does substring replacement:
>>> '(\w){2,}'.replace(r'(\w){2,}',r'\1')
'\\1'
That's why it fails and it can't work with regex expression so no way to correct the first command.

Regular expression to filter out URLs with a literal dot after the last slash

I need the regex to identify urls that after the last forward slash
have a literal dot, such as
http://www.example.es/cat1/cat2/some-example_DH148439#.Rh1-js_4
do not have a literal dot, such as
http://www.example.es/cat1/cat2/cat3
So far I have only found the regular expression for matching everything before ^(.*[\\\/]) or after the last forward slash: [^/]+$ as well as to match everything after a literal point after the last slash (?!.*\.)(.*) Yet I am unable to come out with the above, please help.
\/([^\/]*\.+[^\/]*)$
The first / forces you to look after it. The $ forces end of string and
both class negations avoid any / between.
check # https://regex101.com/
Well, as usual, using a regex to match an URL is the wrong tool for the wrong job. You can use urlparse (or urllib.parse in python3) to do the job, in a very pythonic way:
>>> from urlparse import urlparse
>>> urlparse('http://www.example.es/cat1/cat2/some-example_DH148439#.Rh1-js_4')
ParseResult(scheme='http', netloc='www.example.es', path='/cat1/cat2/some-example_DH148439', params='', query='', fragment='.Rh1-js_4')
>>> urlparse('http://www.example.es/cat1/cat2/cat3')
ParseResult(scheme='http', netloc='www.example.es', path='/cat1/cat2/cat3', params='', query='', fragment='')
and if you really want a regex, the following regex is an example that would answer your question:
import re
>>> re.match(r'^[^:]+://([^.]+\.)+[^/]+/([^/]+/)+[^#]+(#.+)?$', 'http://www.example.es/cat1/cat2/some-example_DH148439#.Rh1-js_4') != None
True
>>> re.match(r'^[^:]+://([^.]+\.)+[^/]+/([^/]+/)+[^#]+(#.+)?$', 'http://www.example.es/cat1/cat2/cat3') != None
True
but the regex I'm giving is good enough to answer your question, but is not a good way to validate an URL, or to split it in pieces. I'd say its only interest is to actually answer your question.
Here's the automaton generated by the regex, to better understand it:
Beware of what you're asking, because JL's regex won't match:
http://www.example.es/cat1/cat2/cat3
as after rereading your question 3×, you're actually asking for the following regex:
\/([^/]*)$
which will match both your examples:
http://www.example.es/cat1/cat2/some-example_DH148439#.Rh1-js_4
http://www.example.es/cat1/cat2/cat3
What #jl-peyret suggests, is only how to match a litteral dot following a /, which is generating the following automaton:
So, whatever you really want:
use urlparse whenever you can to match parts of an URL
if you're trying to define a django route, then trying to match the fragment is hopeless
next time you do a question, please make it precise, and give an example of what you tried: help us help you.
I would use a look-ahead like so
(?=.*\.)([^/]+$)
Demo
(?= # Look-Ahead
. # Any character except line break
* # (zero or more)(greedy)
\. # "."
) # End of Look-Ahead
( # Capturing Group (1)
[^/] # Character not in [/] Character Class
+ # (one or more)(greedy)
$ # End of string/line
) # End of Capturing Group (1)
or a negative look-ahead like so
(?!.*\.)([^/]+$)
for the opposite case

How to fix my nonworking Python regex match?

I want to grab the whole number out of this string <some>some 344.3404.3 numbers<tag>.
Using the Pythex emulator website this works with [\d\.]* (a digit or point repeated zero or more times).
In Python i get back the whole string:
Input:
import re
re.match(r'[\d\.]*', '<some>some 344.3404.3 numbers<tag>').string
Output:
'<some>some 344.3404.3 numbers<tag>'
What am i missing?
Running python 3.3.5, win7, 64bit.
The string attribute of a regex match object contains the input string of the match, not the matched content.
If you want the (first) matching part, you need to change three things:
use re.search() because re.match() will only find a match at the start of the string,
access the group() method of the match object,
use + instead of * or you'll get an empty (zero-length) match unless the match happens to be at the start of the string.
Therefore, use
>>> re.search(r'[\d.]+', '<some>some 344.3404.3 numbers<tag>').group()
'344.3404.3'
or
>>> re.findall(r'[\d.]+', '<some>some 344.3404.3 numbers more 234.432<tag>')
['344.3404.3', '234.432']
if you expect more than one match.
You can use this:
re.search(r'[\d.]+', '<some>some 344.3404.3 numbers<tag>').group()
Notes: Your pattern didn't work because [\d.]* will match the empty string at the first position. This is why I have replaced the quantifier with + and changed the method from match to search.
There is no need to escape the dot inside a character class, since it is seen by default as a literal character.

Backreferencing in Python: findall() method output for HTML string

I am trying to learn some regular expressions in Python. The following does not produce the output I expected:
with open('ex06-11.html') as f:
a = re.findall("<div[^>]*id\\s*=\\s*([\"\'])header\\1[^>]*>(.*?)</div>", f.read())
# output: [('"', 'Some random text')]
The output I was expecting (same code, but without the backreference):
with open('ex06-11.html') as f:
print re.findall("<div[^>]*id\\s*=\\s*[\"\']header[\"\'][^>]*>(.*?)</div>", f.read())
# output: ['Some random text']
The question really boils down to: why is there a quotation mark in my first output, but not in my second? I thought that ([abc]) ... //1 == [abc] ... [abc]. Am I incorrect?
From the docs on re.findall:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
If you want the entire match to be returned, remove the capturing groups or change them to non-capturing groups by adding ?: after the opening paren. For example you would change (foo) in your regex to (?:foo).
Of course in this case you need the capturing group for the backreference, so your best bet is to keep your current regex and then use a list comprehension with re.finditer() to get a list of only the second group:
regex = re.compile(r"""<div[^>]*id\s*=\s*(["'])header\1[^>]*>(.*?)</div>""")
with open('ex06-11.html') as f:
a = [m.group(2) for m in regex.finditer(f.read())
A couple of side notes, you should really consider using an HTML parser like BeautifulSoup instead of regex. You should also use triple-quoted strings if you need to include single or double quotes within you string, and use raw string literals when writing regular expressions so that you don't need to escape the backslashes.
The behaviour is clearly documented. See re.findall:
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found.
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
So, if you have a capture group in your regex pattern, then findall method returns a list of tuple, containing all the captured groups for a particular match, plus the group(0).
So, either you use a non-capturing group - (?:[\"\']), or don't use any group at all, as in your 2nd case.
P.S: Use raw string literals for your regex pattern, to avoid escaping your backslashes. Also, compile your regex outside the loop, so that is is not re-compiled on every iteration. Use re.compile for that.
When I asked this question I was just starting with regular expressions. I have since read the docs completely, and I just wanted to share what I found out.
Firstly, what Rohit and F.J suggested, use raw strings (to make the regex more readable and less error-prone) and compile your regex beforehand using re.compile. To match an HTML string whose id is 'header':
s = "<div id='header'>Some random text</div>"
We would need a regex like:
p = re.compile(r'<div[^>]*id\s*=\s*([\"\'])header\1[^>]*>(.*?)</div>')
In the Python implementation of regex, a capturing group is made by enclosing part of your regex in parentheses (...). Capturing groups capture the span of text that they match. They are also needed for backreferencing. So in my regex above, I have two capturing groups: ([\"\']) and (.*?). The first one is needed to make the backreference \1 possible. The use of a backreferences (and the fact that they reference back to a capturing group) has consequences, however. As pointed out in the other answers to this question, when using findall on my pattern p, findall will return matches from all groups and put them in a list of tuples:
print p.findall(s)
# [("'", 'Some random text')]
Since we only want the plain text from between the HTML tags, this is not the output we're looking for.
(Arguably, we could use:
print p.findall(s)[0][1]
# Some random text
But this may be a bit contrived.)
So in order to return only the text from between the HTML tags (captured by the second group), we use the group() method on p.search():
print p.search(s).group(2)
# Some random text
I'm fully aware that all but the most simple HTML should not be handled by regex, and that instead you should use a parser. But this was just a tutorial example for me to grasp the basics of regex in Python.

Categories