How to extract slug from URL with regular expression in Python?

How to extract slug from URL with regular expression in Python? - python

I'm struggling with Python's re. I don't know how to solve the following problem in a clean way.
I want to extract a part of an URL,
What I tried so far:
url = http://www.example.com/this-2-me-4/123456-subj
m = re.search('/[0-9]+-', url)
m = m.group(0).rstrip('-')
m = m.lstrip('/')
This leaves me with the desired output 123456, but I feel this is not the proper way to extract the slug.
How can I solve this quicker and cleaner?

Use a capturing group by putting parentheses around the part of the regex that you want to capture (...). You can get the contents of a capturing group by passing in its number as an argument to m.group():
>>> m = re.search('/([0-9]+)-', url)
>>> m.group(1)
123456
From the docs:
(...)
Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \number special sequence, described below. To match the literals '(' or ')', use \( or \), or enclose them inside a character class: [(] [)].

You may want to use urllib.parse combined with a capturing group for mildly cleaner code.
import urllib.parse, re
url = 'http://www.example.com/this-2-me-4/123456-subj'
parsed = urllib.parse.urlparse(url)
path = parsed.path
slug = re.search(r'/([\d]+)-', path).group(1)
print(slug)
Result:
123456
In Python 2, use urlparse instead of urllib.parse.

if you wants to find all the slugs available in a URL you can use this code.
from slugify import slugify
url = "https://www.allrecipes.com/recipe/79300/real-poutine?search=random/some-name/".split("/")
for i in url:
i = i.split("?")[0] if "?" in i else i
if "-" in i and slugify(i) == i:
print(i)
This will provide with an output of
real-poutine
some-name

Related

Modify variable in django template by rendering middle of variable

i have variable key.links.self (from json output) in template which is an URL:
https://ahostnamea.net:666/api/v1/
Now what i would like to do is render in template only ahostnamea from this variable.
I know it is possible to cut letters but when first letters always have same count (https:// = 8 letters), the rest is not that simple and it gets different.
Is there any way to split/cut string from / to . ? Or any other way?

You could use a pattern with a capturing group and a negated character class [^.]+ matching any char except a dot.
https?://([^.]+)
Regex demo | Python demo
For example
import re
regex = r"https?://([^.]+)"
test_str = "https://ahostnamea.net:666/api/v1/"
matches = re.search(regex, test_str)
if matches:
print(matches.group(1))
Result
ahostnamea
Edit
As suggested you could also use urllib.parse to get the hostname.
from urllib.parse import urlparse
o = urlparse("https://ahostnamea.net:666/api/v1/")
Python demo
The you could get the first part by for example splitting on a dot:
s = o.hostname.split('.', 1)[0]
print(s)
Result
ahostnamea

A proper solution would be {{ request.META.HTTP_HOST }}

Regex URL Help: Word or Phrase

I am an absolute noob at regex (I kind of know the basics and need to help a word, or a phrase. If it is a phrase, then separate each word with a hyphen - :
This is my current regex, which only matches one word:
r'^streams/search/(?P<stream_query>\w+)/$
The ?P just allows the URL to take a parameter.
Extra note: I am using python re module with the Django urls.py
Any suggestions?
Here are some examples:
game
gsl
starcraft-2014
final-fantasy-iv
word1-word2-word-3
Updated explanation:
I basically need a regular expression to expand the current one, so inside the same regex, no other one:
r'^streams/search/(?P<stream_query>\w+)/$
So include the new regex INSIDE this one, where ?P\w+ is any word that Django considers a parameter (and is passed into a function).
URL definition, which includes the regex:
url(r'^streams/search/(?P\w+)/$', 'stream_search', name='stream_search')
Then, Django passes that parameter into the stream_search function, which takes that parameter:
def stream_search(request, stream_query):
#here I manipulate the stream_query string, ie: removing the hyphens
So, once again, I need an re to match a word or phrase, that are passed into the stream_query parameter (or if necessary, a second one).
So, what I want stream_query to have is:
word1
or
word1-word2-word3

If I understand your question correctly then you might not have to use regexs at all.
Based on your example:
example.com/streams/search/rocket-league-fsdfs-fsdfs
It seems that the term you want to deal with is always found after the last /. So you can rsplit and then check for -. Here is an example:
url = "example.com/streams/search/rocket-league-fsdfs-fsdfs"
result = url.rsplit("/", 1)[-1]
#result = ["example.com/streams/search", "rocket-league-fsdfs-fsdfs"]
if "-" in result:
#do whatever you want with the string
else:
#do whatever you want with the string
or a regex that would match either word or word-word-word would be: [\w-]+

Try this,
import re
str = "http://example.com/something?id=123&action=yes"
regex = "(query\d+)=(\w+)"
re.findall(regex, str)
You can also use Python's urlparse library,
from urlparse import url parse
urlparse = urlparse("http://example.com/something?id=123&action=yes")
Just call url parse to return
ParseResult(scheme='http', netloc='example.com', path='/something', params='', query='id=123&action=yes', fragment='')

Regex returning extra, unwanted values upon searching for file names in URLS

So, if I have a string "http://www.images.com/place/folder/file_name.gif"
I want a regex that returns:
"file_name.gif"
So far I have this (in python):
re.findall(r'([\w]+\.*?(gif|jpeg|jpg|png))',f)
but it returns
( "file_name.gif" , "gif" )
What am I doing wrong?

In your expression, you have two capture groups. Keep in mind that a set of () is a capture group. You want to combine the extension and the filename in one capture group, so that they are both returned try this one:
>>> exp = r'(\w+\.\w+)$'
>>> url = 'http://www.foo.com/hello.html'
>>> re.findall(exp, url)
['hello.html']
This expression is one or more word characters, followed by a ., then one or more word characters.
You can further enhance this by adding your specific extensions in place of the second \w. As long as you keep it in one set of (), you'll get the entire result of the expression as one match.
There is a basic flaw in that a valid URL like http://www.example.com/this-file.gif will fail:
>>> url = 'http://www.example.com/this-link.gif'
>>> re.findall(exp, url)
['link.gif']
Because \w does not include -, which is a valid file name. You can mitigate this by adding it in a character class:
>>> exp = r'([\w-]+\.\w+)$'
>>> re.findall(exp, url)
['this-link.gif']
This is rather in-elegant in that it doesn't match urls that have a fragment or a query string.
It will also be easily fooled if your URL doesn't end in a file name:
>>> url = 'http://www.example.com/this-is-a-valid-url'
>>> re.findall(exp, url)
[]
Since its specifically looking for a ., but then it will also be tripped up by this:
>>> url = 'http://www.example.com/this.is.a.url.gif'
>>> re.findall(exp, url)
['url.gif']
You could take that and build up on it, but as its difficult to predict the many combinations of possible URL endings beyond the very basic, it is recommended to use the existing tools:
>>> import os
>>> import urlparse
>>> os.path.basename(urlparse.urlsplit(url).path)
'this.is.a.url.gif'
In Python 3, use urllib.parse.

Python Get Tags from URL

I have the following URL:
http://google.com/sadfasdfsd$AA=mytag&SS=sdfsdf
What is the best way in Python to get mytag from the string ~$AA=mytag&~?

Try this,
>>> import re
>>> str = 'http://google.com/sadfasdfsd$AA=mytag&SS=sdfsdf'
>>> m = re.search(r'.*\$AA=([^&]*)\&.*', str)
>>> m.group(1)
'mytag'
There is a special meaning for $ and & in regex, so you have to escape those characters to tell python interpreter that these characters are literal $ and &.

Use this regex =(.+)&
import re
regex = "=(.+)&"
print re.findall(regex,"http://google.com/sadfasdfsd$AA=mytag&SS=sdfsdf")[0]

To retrieve the mytag that comes after $AA, you can use this simple regex (see demo):
(?<=\$AA=)[^&]+
In Python:
match = re.search(r"(?<=\$AA=)[^&]+", subject)
Explain Regex
(?<= # look behind to see if there is:
\$ # '$'
AA= # 'AA='
) # end of look-behind
[^&]+ # any character except: '&' (1 or more times
# (matching the most amount possible))

I'm just going to throw this one out there to show there are other ways of doing this:
import urlparse
url = "http://google.com/sadfasdfsd?AA=mytag&SS=sdfsdf"
query = urlparse.urlparse(url).query # Extract the query string from the full URL
parsed_query = urlparse.parse_qs(query) # Parses the query string into a dict
print parsed_query["AA"][0]
# mytag
See here: https://docs.python.org/2/library/urlparse.html for documentation on the urlparse module.
NB parse_qs returns a list so we use [0] to get the first result.
Also, I have assumed the question has a typo and have amended the url so that it represents a traditional query string.

How do I use Regex to find the ID in a YouTube link?

when I try to extract this video ID (AIiMa2Fe-ZQ) with a regex expression, I can't get the dash an all the letters after.
>>> id = re.search('(?<=\?v\=)\w+', 'http://www.youtube.com/watch?v=AIiMa2Fe-ZQ')
>>> print id.group(0)
>>> AIiMa2Fe

Intead of \w+ use below. Word character (\w) doesn't include a dash. It only includes [a-zA-Z_0-9].
[\w-]+

I don't know the pattern for youtube hashes, but just include the "-" in the possibilities as it is not considered an alpha:
import re
id = re.search('(?<=\?v\=)[\w-]+', 'http://www.youtube.com/watch?v=AIiMa2Fe-ZQ')
print id.group(0)
I have edited the above because as it turns out:
>>> re.search("[\w|-]", "|").group(0)
'|'
The "|" in the character definition does not act as a special character but does indeed match the "|" pipe. My apologies.

>>> re.search('(?<=v=)[\w-]+', 'http://www.youtube.com/watch?v=AIiMa2Fe-ZQ').group()
'AIiMa2Fe-ZQ'
\w is a short-hand for [a-zA-Z0-9_] in python2.x, you'll have to use re.A flag in py3k. You quite clearly have additional character in that videoid, i.e., hyphen. I've also removed redundant escape backslashes from the lookbehind.

Use the urlparse module instead of regex for such kind of things.
import urlparse
parsed_url = urlparse.urlparse(url)
if parsed_url.netloc.find('youtube.com') != -1 and parsed_url.path == '/watch':
video = urlparse.parse_qs(parsed_url.query).get('v', None)
if video is None:
video = urlparse.parse_qs(parsed_url.fragment.strip('!')).get('v', None)
if video is not None:
print video[0]
EDIT: Updated for the upcoming new youtube url format.

/(?:/v/|/watch\?v=|/watch#!v=)([A-Za-z0-9_-]+)/
Explain the RE
There are three alternate YouTube formats: /v/[ID] and watch?v= and the new AJAX watch#!v= This RE captures all three. There is also new YouTube URL for user pages that is of the form /user/[user]?content={complex URI} This is not captured here by any regex...

I'd try this:
>>> import re
>>> a = re.compile(r'.*(\-\w+)$')
>>> a.search('http://www.youtube.com/watch?v=AIiMa2Fe-ZQ').group(1)
'-ZQ'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract slug from URL with regular expression in Python? - python

Related

Modify variable in django template by rendering middle of variable

Regex URL Help: Word or Phrase

Regex returning extra, unwanted values upon searching for file names in URLS

Python Get Tags from URL

How do I use Regex to find the ID in a YouTube link?

Categories

Resources