Regex URL Help: Word or Phrase - python

I am an absolute noob at regex (I kind of know the basics and need to help a word, or a phrase. If it is a phrase, then separate each word with a hyphen - :
This is my current regex, which only matches one word:
r'^streams/search/(?P<stream_query>\w+)/$
The ?P just allows the URL to take a parameter.
Extra note: I am using python re module with the Django urls.py
Any suggestions?
Here are some examples:
game
gsl
starcraft-2014
final-fantasy-iv
word1-word2-word-3
Updated explanation:
I basically need a regular expression to expand the current one, so inside the same regex, no other one:
r'^streams/search/(?P<stream_query>\w+)/$
So include the new regex INSIDE this one, where ?P\w+ is any word that Django considers a parameter (and is passed into a function).
URL definition, which includes the regex:
url(r'^streams/search/(?P\w+)/$', 'stream_search', name='stream_search')
Then, Django passes that parameter into the stream_search function, which takes that parameter:
def stream_search(request, stream_query):
#here I manipulate the stream_query string, ie: removing the hyphens
So, once again, I need an re to match a word or phrase, that are passed into the stream_query parameter (or if necessary, a second one).
So, what I want stream_query to have is:
word1
or
word1-word2-word3

If I understand your question correctly then you might not have to use regexs at all.
Based on your example:
example.com/streams/search/rocket-league-fsdfs-fsdfs
It seems that the term you want to deal with is always found after the last /. So you can rsplit and then check for -. Here is an example:
url = "example.com/streams/search/rocket-league-fsdfs-fsdfs"
result = url.rsplit("/", 1)[-1]
#result = ["example.com/streams/search", "rocket-league-fsdfs-fsdfs"]
if "-" in result:
#do whatever you want with the string
else:
#do whatever you want with the string
or a regex that would match either word or word-word-word would be: [\w-]+

Try this,
import re
str = "http://example.com/something?id=123&action=yes"
regex = "(query\d+)=(\w+)"
re.findall(regex, str)
You can also use Python's urlparse library,
from urlparse import url parse
urlparse = urlparse("http://example.com/something?id=123&action=yes")
Just call url parse to return
ParseResult(scheme='http', netloc='example.com', path='/something', params='', query='id=123&action=yes', fragment='')

Related

use dynamic int variable inside regex pattern python

I'm in my initial days of learning python, sorry if this question is already been asked.
I'm writing here as those didn't help me, my requirement is reading a file and printing all the URL's inside in it.Inside a for loop the regex pattern i had used is [^https://][\w\W]*, it worked fine. But I wanted to know if can I dynamically pass the length of line which is after https:// and get the output with occurrences instead of *
I had tried [^https://][\w\W]{var}} where var=len(line)-len(https://)
These are some other patterns I had tried like
pattern = '[^https://][\w\W]{'+str(int(var))+'}'
pattern = r'[^https://][\w\W]{{}}'.format(var)
pattern = r'[^https://][\w\W]{%s}'%var
I might be misunderstanding your question, but if you know that the url is always starting with https:// then that would be the first eight characters. Then you can get the length after finding the urls:
# Example of list containing urls - you should fill that with your for loop
list_urls = ['https://stackoverflow.com/questions/61006253/use-dynamic-int-variable-inside-regex-pattern-python', 'https://google.com', 'https://stackoverflow.com']
for url in list_urls:
print(url[8:])
Out
stackoverflow.com/questions/61006253/use-dynamic-int-variable-inside-regex-pattern-python
google.com
stackoverflow.com
Instead of a for loop you could find all urls using re.findall
import re
url_pattern = "((https:\/\/)([\w-]+\.)+[\w-]+[.+]+([\w%\/~\+#]*))"
# text referes to your document, that should be read before this
urls = re.findall(url_pattern, text)
# Using list comprehensions
# Get the unique urls by using set
# Only get text after https:// using [8:]
# Only parse the first element of the group that is returned by re.findall using [0]
unique_urls = list(set([x[0][8:] for x in urls]))
# print the urls
print(unique_urls)
In your pattern you use [^https://] which is a negated character class [^ which will match any char except the listed.
One option is to make use of literal string interpolation. Assuming your links do not contain spaces, you could use \S instead of [\w\W] as the latter variant will match any character including spaces and newlines.
\bhttps://\S{{{var}}}(?!\S)
Regex demo
The assertion (?!\S) at the end is a whitespace boundary to prevent partial matches and the word boundary \b will prevent http being part of a larger word.
Python demo
For example
import re
line = "https://www.test.com"
lines = "https://www.test.com https://thisisatestt https://www.dontmatchme"
var=len(line)-len('https://')
pattern = rf"\bhttps://\S{{{var}}}(?!\S)"
print(re.findall(pattern, lines))
Output
['https://www.test.com', 'https://thisisatestt']

Regex to extra part of the url

I'm trying to extract part of a url using regex. I'm trying todo this ideally in one line and word for both url types.
I'm trying the following but not sure how I should get the second url. I am trying to extract the 4FHP from both.
>>> import re
>>>
>>> a="/url_redirect/4FHP"
>>> b="/url/4FHP/asdfasdfas/"
>>>
>>> re.search('^\/(url_redirect|url)\/(.*)', a).group(2)
'4FHP'
>>> re.search('^\/(url_redirect|url)\/(.*)', b).group(2)
'4FHP/asdfasdfas/'
The following code will extract 4FHP from either string. Noticed that I changed .* (match a sequence of any non-newline character) to [^/]* (match a sequence of any non-/ character).
re.search('^\/(url_redirect|url)\/([^/]*)', b).group(2)
Your problem is that the * operator is 'greedy', so it will grab to the end of the string which is why you get '4FHP/asdfasdfas/' in your second example
you need to stop matching when you see another /, the easiest way is to use a character class that specifically excludes it, eg [^/]
you can also use non-capturing groups (?: <regex> ) to only return matched group that you're interested in
re.search('^\/(?:url_redirect|url)\/([^/]*)', b).group(1)

Regular expression to filter out URLs with a literal dot after the last slash

I need the regex to identify urls that after the last forward slash
have a literal dot, such as
http://www.example.es/cat1/cat2/some-example_DH148439#.Rh1-js_4
do not have a literal dot, such as
http://www.example.es/cat1/cat2/cat3
So far I have only found the regular expression for matching everything before ^(.*[\\\/]) or after the last forward slash: [^/]+$ as well as to match everything after a literal point after the last slash (?!.*\.)(.*) Yet I am unable to come out with the above, please help.
\/([^\/]*\.+[^\/]*)$
The first / forces you to look after it. The $ forces end of string and
both class negations avoid any / between.
check # https://regex101.com/
Well, as usual, using a regex to match an URL is the wrong tool for the wrong job. You can use urlparse (or urllib.parse in python3) to do the job, in a very pythonic way:
>>> from urlparse import urlparse
>>> urlparse('http://www.example.es/cat1/cat2/some-example_DH148439#.Rh1-js_4')
ParseResult(scheme='http', netloc='www.example.es', path='/cat1/cat2/some-example_DH148439', params='', query='', fragment='.Rh1-js_4')
>>> urlparse('http://www.example.es/cat1/cat2/cat3')
ParseResult(scheme='http', netloc='www.example.es', path='/cat1/cat2/cat3', params='', query='', fragment='')
and if you really want a regex, the following regex is an example that would answer your question:
import re
>>> re.match(r'^[^:]+://([^.]+\.)+[^/]+/([^/]+/)+[^#]+(#.+)?$', 'http://www.example.es/cat1/cat2/some-example_DH148439#.Rh1-js_4') != None
True
>>> re.match(r'^[^:]+://([^.]+\.)+[^/]+/([^/]+/)+[^#]+(#.+)?$', 'http://www.example.es/cat1/cat2/cat3') != None
True
but the regex I'm giving is good enough to answer your question, but is not a good way to validate an URL, or to split it in pieces. I'd say its only interest is to actually answer your question.
Here's the automaton generated by the regex, to better understand it:
Beware of what you're asking, because JL's regex won't match:
http://www.example.es/cat1/cat2/cat3
as after rereading your question 3×, you're actually asking for the following regex:
\/([^/]*)$
which will match both your examples:
http://www.example.es/cat1/cat2/some-example_DH148439#.Rh1-js_4
http://www.example.es/cat1/cat2/cat3
What #jl-peyret suggests, is only how to match a litteral dot following a /, which is generating the following automaton:
So, whatever you really want:
use urlparse whenever you can to match parts of an URL
if you're trying to define a django route, then trying to match the fragment is hopeless
next time you do a question, please make it precise, and give an example of what you tried: help us help you.
I would use a look-ahead like so
(?=.*\.)([^/]+$)
Demo
(?= # Look-Ahead
. # Any character except line break
* # (zero or more)(greedy)
\. # "."
) # End of Look-Ahead
( # Capturing Group (1)
[^/] # Character not in [/] Character Class
+ # (one or more)(greedy)
$ # End of string/line
) # End of Capturing Group (1)
or a negative look-ahead like so
(?!.*\.)([^/]+$)
for the opposite case

Python Regex for URL doesn't work

I'm using python and trying to use a regex to see whether there is a url within my string. I've tried multiple different regexes but they always come out with 'None', even if the string is clearly a website.
Example:
>>> print re.search(r'/((?:https?\:\/\/|www\.)(?:[-a-z0-9]+\.)*[-a-z0-9]+.*)/i','www.google.com')
None
Any help would be appreciated!
What about, as in Python Regex for URL doesn't work , switching to something like:
r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
For a detailed survey of many, many regexes validating URLs, see https://mathiasbynens.be/demo/url-regex ...
If you want to check if a string is an URL you can use:
print re.search(r'(^(https?://|www\.)([a-z0-9-]+\.)+([a-z0-9]+)$)','www.google.com', re.I)
If you want to verify if a string contains a URL, you only need to remove the ^ and $ patterns:
print re.search(r'((https?://|www\.)([a-z0-9-]+\.)+([a-z0-9]+))','www.google.com', re.I)
Remember: re.I is for case-insensitive matching, the '^' matches beginning of line and $ matches end of line.
The grammar for a valid URL has been explained here in this Wiki. Based on that this regex can match a string if it has valid URL.
^((?:https?|ftp):\/{2}[\w.\/]+(?::\d{1,4})?\/?[?\w_#\/.]+)
And in case if you want to keep the scheme part of the URL optional.
^((?:https?|ftp)?(?::\/{2})?[\w.\/]+(?::\d{1,4})?\/?[?\w_#\/.]+)
Output
>>> re.search(r'^((?:https?|ftp)?(?::\/{2})?[\w.\/]+(?::\d{1,4})?\/?[?\w_#\/.]+)','www.google.com').group()
'www.google.com'
>>> re.search(r'^((?:https?|ftp)?(?::\/{2})?[\w.\/]+(?::\d{1,4})?\/?[?\w_#\/.]+)','http://www.google.com').group()
'http://www.google.com'
>>> re.search(r'^((?:https?|ftp)?(?::\/{2})?[\w.\/]+(?::\d{1,4})?\/?[?\w_#\/.]+)','https://www.google.com').group()
'https://www.google.com'
You can see a detailed demo and explanation about how it work here.
i've used the following regex in order to verify that the inserted string is a URL:
r'((http|https)\:\/\/)?[a-zA-Z0-9\.\/\?\:#\-_=#]+\.([a-zA-Z]){2,6}([a-zA-Z0-9\.\&\/\?\:#\-_=#])*'

How to extract slug from URL with regular expression in Python?

I'm struggling with Python's re. I don't know how to solve the following problem in a clean way.
I want to extract a part of an URL,
What I tried so far:
url = http://www.example.com/this-2-me-4/123456-subj
m = re.search('/[0-9]+-', url)
m = m.group(0).rstrip('-')
m = m.lstrip('/')
This leaves me with the desired output 123456, but I feel this is not the proper way to extract the slug.
How can I solve this quicker and cleaner?
Use a capturing group by putting parentheses around the part of the regex that you want to capture (...). You can get the contents of a capturing group by passing in its number as an argument to m.group():
>>> m = re.search('/([0-9]+)-', url)
>>> m.group(1)
123456
From the docs:
(...)
Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \number special sequence, described below. To match the literals '(' or ')', use \( or \), or enclose them inside a character class: [(] [)].
You may want to use urllib.parse combined with a capturing group for mildly cleaner code.
import urllib.parse, re
url = 'http://www.example.com/this-2-me-4/123456-subj'
parsed = urllib.parse.urlparse(url)
path = parsed.path
slug = re.search(r'/([\d]+)-', path).group(1)
print(slug)
Result:
123456
In Python 2, use urlparse instead of urllib.parse.
if you wants to find all the slugs available in a URL you can use this code.
from slugify import slugify
url = "https://www.allrecipes.com/recipe/79300/real-poutine?search=random/some-name/".split("/")
for i in url:
i = i.split("?")[0] if "?" in i else i
if "-" in i and slugify(i) == i:
print(i)
This will provide with an output of
real-poutine
some-name

Categories