Regex - Python matching between string and first occurence - python

I'm having a hard time grasping regex no matter how much documentation I read up on. I'm trying to match everything between a a string and the first occurrence of & this is what I have
link = "group.do?sys_id=69adb887157e450051e85118b6ff533c&&"
rex = re.compile("group\.do\?sys_id=(.?)&")
sysid = rex.search(link).groups()[0]
I'm using https://regex101.com/#python to help me validate my regex and I can kinda get rex = re.compile("user_group.do?sys_id=(.*)&") to work but the .* is greedy and matches to the last & and im looking to match to the first &
I thought .? matches zero to 1 time

You don't necessarily need regular expressions here. Use urlparse instead:
>>> from urlparse import urlparse, parse_qs
>>> parse_qs(urlparse(link).query)['sys_id'][0]
'69adb887157e450051e85118b6ff533c'
In case of Python 3 change the import to:
from urllib.parse import urlparse, parse_qs

You can simply regex out to the &amp instead of the final & like so:
import re
link = "user_group.do?sys_id=69adb887157e450051e85118b6ff533c&&"
rex = re.compile("user_group\.do\?sys_id=(.*)&&")
sysid = rex.search(link).groups()[0]
print(sysid)

.*
is greedy but
.*?
should not be in regex.
.?
would only look for any character 0-1 times while
.*?
will look for it up to the earliest matching occurrence. I hope that explains it.

Related

Modify variable in django template by rendering middle of variable

i have variable key.links.self (from json output) in template which is an URL:
https://ahostnamea.net:666/api/v1/
Now what i would like to do is render in template only ahostnamea from this variable.
I know it is possible to cut letters but when first letters always have same count (https:// = 8 letters), the rest is not that simple and it gets different.
Is there any way to split/cut string from / to . ? Or any other way?
You could use a pattern with a capturing group and a negated character class [^.]+ matching any char except a dot.
https?://([^.]+)
Regex demo | Python demo
For example
import re
regex = r"https?://([^.]+)"
test_str = "https://ahostnamea.net:666/api/v1/"
matches = re.search(regex, test_str)
if matches:
print(matches.group(1))
Result
ahostnamea
Edit
As suggested you could also use urllib.parse to get the hostname.
from urllib.parse import urlparse
o = urlparse("https://ahostnamea.net:666/api/v1/")
Python demo
The you could get the first part by for example splitting on a dot:
s = o.hostname.split('.', 1)[0]
print(s)
Result
ahostnamea
A proper solution would be {{ request.META.HTTP_HOST }}

Regular expression to filter out URLs with a literal dot after the last slash

I need the regex to identify urls that after the last forward slash
have a literal dot, such as
http://www.example.es/cat1/cat2/some-example_DH148439#.Rh1-js_4
do not have a literal dot, such as
http://www.example.es/cat1/cat2/cat3
So far I have only found the regular expression for matching everything before ^(.*[\\\/]) or after the last forward slash: [^/]+$ as well as to match everything after a literal point after the last slash (?!.*\.)(.*) Yet I am unable to come out with the above, please help.
\/([^\/]*\.+[^\/]*)$
The first / forces you to look after it. The $ forces end of string and
both class negations avoid any / between.
check # https://regex101.com/
Well, as usual, using a regex to match an URL is the wrong tool for the wrong job. You can use urlparse (or urllib.parse in python3) to do the job, in a very pythonic way:
>>> from urlparse import urlparse
>>> urlparse('http://www.example.es/cat1/cat2/some-example_DH148439#.Rh1-js_4')
ParseResult(scheme='http', netloc='www.example.es', path='/cat1/cat2/some-example_DH148439', params='', query='', fragment='.Rh1-js_4')
>>> urlparse('http://www.example.es/cat1/cat2/cat3')
ParseResult(scheme='http', netloc='www.example.es', path='/cat1/cat2/cat3', params='', query='', fragment='')
and if you really want a regex, the following regex is an example that would answer your question:
import re
>>> re.match(r'^[^:]+://([^.]+\.)+[^/]+/([^/]+/)+[^#]+(#.+)?$', 'http://www.example.es/cat1/cat2/some-example_DH148439#.Rh1-js_4') != None
True
>>> re.match(r'^[^:]+://([^.]+\.)+[^/]+/([^/]+/)+[^#]+(#.+)?$', 'http://www.example.es/cat1/cat2/cat3') != None
True
but the regex I'm giving is good enough to answer your question, but is not a good way to validate an URL, or to split it in pieces. I'd say its only interest is to actually answer your question.
Here's the automaton generated by the regex, to better understand it:
Beware of what you're asking, because JL's regex won't match:
http://www.example.es/cat1/cat2/cat3
as after rereading your question 3×, you're actually asking for the following regex:
\/([^/]*)$
which will match both your examples:
http://www.example.es/cat1/cat2/some-example_DH148439#.Rh1-js_4
http://www.example.es/cat1/cat2/cat3
What #jl-peyret suggests, is only how to match a litteral dot following a /, which is generating the following automaton:
So, whatever you really want:
use urlparse whenever you can to match parts of an URL
if you're trying to define a django route, then trying to match the fragment is hopeless
next time you do a question, please make it precise, and give an example of what you tried: help us help you.
I would use a look-ahead like so
(?=.*\.)([^/]+$)
Demo
(?= # Look-Ahead
. # Any character except line break
* # (zero or more)(greedy)
\. # "."
) # End of Look-Ahead
( # Capturing Group (1)
[^/] # Character not in [/] Character Class
+ # (one or more)(greedy)
$ # End of string/line
) # End of Capturing Group (1)
or a negative look-ahead like so
(?!.*\.)([^/]+$)
for the opposite case

How to extract slug from URL with regular expression in Python?

I'm struggling with Python's re. I don't know how to solve the following problem in a clean way.
I want to extract a part of an URL,
What I tried so far:
url = http://www.example.com/this-2-me-4/123456-subj
m = re.search('/[0-9]+-', url)
m = m.group(0).rstrip('-')
m = m.lstrip('/')
This leaves me with the desired output 123456, but I feel this is not the proper way to extract the slug.
How can I solve this quicker and cleaner?
Use a capturing group by putting parentheses around the part of the regex that you want to capture (...). You can get the contents of a capturing group by passing in its number as an argument to m.group():
>>> m = re.search('/([0-9]+)-', url)
>>> m.group(1)
123456
From the docs:
(...)
Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \number special sequence, described below. To match the literals '(' or ')', use \( or \), or enclose them inside a character class: [(] [)].
You may want to use urllib.parse combined with a capturing group for mildly cleaner code.
import urllib.parse, re
url = 'http://www.example.com/this-2-me-4/123456-subj'
parsed = urllib.parse.urlparse(url)
path = parsed.path
slug = re.search(r'/([\d]+)-', path).group(1)
print(slug)
Result:
123456
In Python 2, use urlparse instead of urllib.parse.
if you wants to find all the slugs available in a URL you can use this code.
from slugify import slugify
url = "https://www.allrecipes.com/recipe/79300/real-poutine?search=random/some-name/".split("/")
for i in url:
i = i.split("?")[0] if "?" in i else i
if "-" in i and slugify(i) == i:
print(i)
This will provide with an output of
real-poutine
some-name

re.search greedy matching all combinations (letters, special characters, numbers)

I am trying to find a pattern using re.search
How do I search a url like
blah&match=Z-300&
and get what comes after match=
So in this case,
I want to get Z-300
As tends to be my favorite answer to regex questions, don't use regex.
Use urlparse. (in py3, urllib.parse)
from urlparse import parse_qs
parse_qs('blah&match=Z-300&')
Out[22]: {'match': ['Z-300']}
import re
s = 'blah&match=Z-300&'
print re.search('&match=(.*)',s).group(1) #Z-300&
print re.search('&match=(.*)&',s).group(1) #Z-300
match = re.search('&match=(.*?)&', text).group(1)

How do I use Regex to find the ID in a YouTube link?

when I try to extract this video ID (AIiMa2Fe-ZQ) with a regex expression, I can't get the dash an all the letters after.
>>> id = re.search('(?<=\?v\=)\w+', 'http://www.youtube.com/watch?v=AIiMa2Fe-ZQ')
>>> print id.group(0)
>>> AIiMa2Fe
Intead of \w+ use below. Word character (\w) doesn't include a dash. It only includes [a-zA-Z_0-9].
[\w-]+
I don't know the pattern for youtube hashes, but just include the "-" in the possibilities as it is not considered an alpha:
import re
id = re.search('(?<=\?v\=)[\w-]+', 'http://www.youtube.com/watch?v=AIiMa2Fe-ZQ')
print id.group(0)
I have edited the above because as it turns out:
>>> re.search("[\w|-]", "|").group(0)
'|'
The "|" in the character definition does not act as a special character but does indeed match the "|" pipe. My apologies.
>>> re.search('(?<=v=)[\w-]+', 'http://www.youtube.com/watch?v=AIiMa2Fe-ZQ').group()
'AIiMa2Fe-ZQ'
\w is a short-hand for [a-zA-Z0-9_] in python2.x, you'll have to use re.A flag in py3k. You quite clearly have additional character in that videoid, i.e., hyphen. I've also removed redundant escape backslashes from the lookbehind.
Use the urlparse module instead of regex for such kind of things.
import urlparse
parsed_url = urlparse.urlparse(url)
if parsed_url.netloc.find('youtube.com') != -1 and parsed_url.path == '/watch':
video = urlparse.parse_qs(parsed_url.query).get('v', None)
if video is None:
video = urlparse.parse_qs(parsed_url.fragment.strip('!')).get('v', None)
if video is not None:
print video[0]
EDIT: Updated for the upcoming new youtube url format.
/(?:/v/|/watch\?v=|/watch#!v=)([A-Za-z0-9_-]+)/
Explain the RE
There are three alternate YouTube formats: /v/[ID] and watch?v= and the new AJAX watch#!v= This RE captures all three. There is also new YouTube URL for user pages that is of the form /user/[user]?content={complex URI} This is not captured here by any regex...
I'd try this:
>>> import re
>>> a = re.compile(r'.*(\-\w+)$')
>>> a.search('http://www.youtube.com/watch?v=AIiMa2Fe-ZQ').group(1)
'-ZQ'

Categories