I'm using python and trying to use a regex to see whether there is a url within my string. I've tried multiple different regexes but they always come out with 'None', even if the string is clearly a website.
Example:
>>> print re.search(r'/((?:https?\:\/\/|www\.)(?:[-a-z0-9]+\.)*[-a-z0-9]+.*)/i','www.google.com')
None
Any help would be appreciated!
What about, as in Python Regex for URL doesn't work , switching to something like:
r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
For a detailed survey of many, many regexes validating URLs, see https://mathiasbynens.be/demo/url-regex ...
If you want to check if a string is an URL you can use:
print re.search(r'(^(https?://|www\.)([a-z0-9-]+\.)+([a-z0-9]+)$)','www.google.com', re.I)
If you want to verify if a string contains a URL, you only need to remove the ^ and $ patterns:
print re.search(r'((https?://|www\.)([a-z0-9-]+\.)+([a-z0-9]+))','www.google.com', re.I)
Remember: re.I is for case-insensitive matching, the '^' matches beginning of line and $ matches end of line.
The grammar for a valid URL has been explained here in this Wiki. Based on that this regex can match a string if it has valid URL.
^((?:https?|ftp):\/{2}[\w.\/]+(?::\d{1,4})?\/?[?\w_#\/.]+)
And in case if you want to keep the scheme part of the URL optional.
^((?:https?|ftp)?(?::\/{2})?[\w.\/]+(?::\d{1,4})?\/?[?\w_#\/.]+)
Output
>>> re.search(r'^((?:https?|ftp)?(?::\/{2})?[\w.\/]+(?::\d{1,4})?\/?[?\w_#\/.]+)','www.google.com').group()
'www.google.com'
>>> re.search(r'^((?:https?|ftp)?(?::\/{2})?[\w.\/]+(?::\d{1,4})?\/?[?\w_#\/.]+)','http://www.google.com').group()
'http://www.google.com'
>>> re.search(r'^((?:https?|ftp)?(?::\/{2})?[\w.\/]+(?::\d{1,4})?\/?[?\w_#\/.]+)','https://www.google.com').group()
'https://www.google.com'
You can see a detailed demo and explanation about how it work here.
i've used the following regex in order to verify that the inserted string is a URL:
r'((http|https)\:\/\/)?[a-zA-Z0-9\.\/\?\:#\-_=#]+\.([a-zA-Z]){2,6}([a-zA-Z0-9\.\&\/\?\:#\-_=#])*'
Related
The Regex I use to find the URL in Python is as follows.
import re
regex = r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))"
This regex helps me find urls like:
https://google.com
http://google.com
google.com/index.php?q=test
www.test.google.com/index.php
But I also want it to accept a url like below:
//ads.google.com
What do I need to change for this in the mixed regex above?
well for this, i would never try to write my own regex, there are already some of them created. For urls I am using this one which works for you case :)
regex = r"(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!#)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!#)))"
I have a python script using BeautifulSoup to scrape. This is my code:
re.findall('stream:\/\/.+', link)
Which is designed to find links like:
stream://987cds9c8ujru56236te2ys28u99u2s
But it also returns strings like this:
stream://987cds9c8ujru56236te2ys28u99u2s [SD] Spanish - (9.15am)
i.e. with spaces and extra stuff which I don't want. How can I express the
re.findall
So it only returns the link first part?
(Thanks in advance)
You can use a non-greedy match (adding ? to the pattern) with a word boundary character '\b':
>>> re.findall(r'stream:\/\/.+?\b', link)
['stream://987cds9c8ujru56236te2ys28u99u2s']
Or if you want to match only word characters you can simply use '\w+':
>>> re.findall(r'stream:\/\/\w+', link)
['stream://987cds9c8ujru56236te2ys28u99u2s']
I have a set of links like:
['http://www.nytimes.com/2016/12/31/us/politics/house-republicans-health-care-suit.html?partner=rss&emc=rss" rel="standout"></atom:link>',
'http://www.nytimes.com/2016/12/31/nyregion/bronx-murder-40th-precinct-police-residents.html</guid>',
'http://www.nytimes.com/2016/12/30/movies/tyrus-wong-dies-bambi-disney.html?partner=rss&emc=rss',
'http://www.nytimes.com/2016/12/30/obituaries/among-deaths-in-2016-a-heavy-toll-in-pop-music.html</guid>',
'http://www.nytimes.com/video/world/100000004830728/daybreak-around-the-world.html?partner=rss&emc=rss']
I'm trying to iterate over them to remove everything that comes after html. So I have:
cleanitems = []
for item in links:
cleanitems.append(re.sub(r'html(.*)', '', item))
Which returns:
['http://www.nytimes.com/2016/12/31/us/politics/house-republicans-health-care-suit.',
'http://www.nytimes.com/2016/12/31/nyregion/bronx-murder-40th-precinct-police-residents.',
'http://www.nytimes.com/2016/12/30/movies/tyrus-wong-dies-bambi-disney.',
'http://www.nytimes.com/2016/12/30/obituaries/among-deaths-in-2016-a-heavy-toll-in-pop-music.',
'http://www.nytimes.com/video/world/100000004830728/daybreak-around-the-world.]
Confused as to why it's including html in the capture group. Thanks for any help.
html is part of the matched text too, not just the (...) group. re.sub() replaces all of the whole matched text.
Include the literal html text in the replacement:
cleanitems.append(re.sub(r'html(.*)', 'html', item))
or, alternatively, capture that part in a group instead:
cleanitems.append(re.sub(r'(html).*', r'\1', item))
You may want to consider using a non-greedy match, and a $ end-of-string anchor to prevent cutting off a URL that contains html in the path more than once, and including the . dot to make sure you are really only matching the .html extension:
cleanitems.append(re.sub(r'\.html.*?$', r'.html', item))
However, if your goal is to remove the query string from a URL, consider parsing the URL using urllib.parse.urlparse(), and re-building it without the query string or fragment identifiers:
from urlib.parse import urlparse
cleanitems.append(urlparse(item)._replace(query='', fragment='').geturl())
This won't remove the eroneous HTML chunks however; if you are parsing these URLs from a HTML document, consider using a real HTML parser rather than regex.
Just a complement to Martijn's answer.
You could also use a lookbehind assertion to only match the text following html:
cleanitems.append(re.sub(r'(?<=html).*', '', item))
or use a replacement string to keep the initial part:
cleanitems.append(re.sub(r'(html).*', r'\1', item))
But as already said by Martin, you'd better use the urllib module to correctly parse URLs
I have a string like this
http://example.com/path/topage.htmlhttp://twitter.com/p/xyanhshttp://httpget.org/get.zipwww.google.com/privacy.htmlhttps://goodurl.net/
I would like to extract all url / webaddress into a Array. for example
urls = ['http://example.com/path/topage.html','http://twitter.com/p/xyan',.....]
Here is my approach which didn't work.
import re
strings = "http://example.com/path/topage.htmlhttp://twitter.com/p/xyanhshttp://httpget.org/get.zipwww.google.com/privacy.htmlhttps://goodurl.net/"
links = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', strings)
print links
// result always same as strings
The problem is that your regex pattern is too inclusive. It includes all urls. You can use lookahead by using (?=)
Try this:
re.findall("((www\.|http://|https://)(www\.)*.*?(?=(www\.|http://|https://|$)))", strings)
Your problem is that http:// is being accepted as a valid part of a url. This is because of this token right here:
[$-_#.&+]
or more specifically:
$-_
This matches all characters with the range from $ to _, which includes a lot more characters than you probably intended to do.
You can change this to [$\-_#.&+] but this causes problems since now, / characters will not match. So add it by using [$\-_#.&+/]. However, this will again cause problems since http://example.com/path/topage.htmlhttp would be considered a valid match.
The final addition is to add a lookahead to ensure that you are not matching http:// or https://, which just so happens to be the first part of your regex!
http[s]?://(?:(?!http[s]?://)[a-zA-Z]|[0-9]|[$\-_#.&+/]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+
tested here
A simple answer without getting into much complication:
import re
url_list = []
for x in re.split("http://", l):
url_list.append(re.split("https://",x))
url_list = [item for sublist in url_list for item in sublist]
In case you want to append the string http:// and https:// back to the urls, do appropriate changes to the code. Hope i convey the idea.
here's mine
(r’http[s]?://[a-zA-Z]{3}\.[a-zA-Z0-9]+\.[a-zA-Z]+')
I am an absolute noob at regex (I kind of know the basics and need to help a word, or a phrase. If it is a phrase, then separate each word with a hyphen - :
This is my current regex, which only matches one word:
r'^streams/search/(?P<stream_query>\w+)/$
The ?P just allows the URL to take a parameter.
Extra note: I am using python re module with the Django urls.py
Any suggestions?
Here are some examples:
game
gsl
starcraft-2014
final-fantasy-iv
word1-word2-word-3
Updated explanation:
I basically need a regular expression to expand the current one, so inside the same regex, no other one:
r'^streams/search/(?P<stream_query>\w+)/$
So include the new regex INSIDE this one, where ?P\w+ is any word that Django considers a parameter (and is passed into a function).
URL definition, which includes the regex:
url(r'^streams/search/(?P\w+)/$', 'stream_search', name='stream_search')
Then, Django passes that parameter into the stream_search function, which takes that parameter:
def stream_search(request, stream_query):
#here I manipulate the stream_query string, ie: removing the hyphens
So, once again, I need an re to match a word or phrase, that are passed into the stream_query parameter (or if necessary, a second one).
So, what I want stream_query to have is:
word1
or
word1-word2-word3
If I understand your question correctly then you might not have to use regexs at all.
Based on your example:
example.com/streams/search/rocket-league-fsdfs-fsdfs
It seems that the term you want to deal with is always found after the last /. So you can rsplit and then check for -. Here is an example:
url = "example.com/streams/search/rocket-league-fsdfs-fsdfs"
result = url.rsplit("/", 1)[-1]
#result = ["example.com/streams/search", "rocket-league-fsdfs-fsdfs"]
if "-" in result:
#do whatever you want with the string
else:
#do whatever you want with the string
or a regex that would match either word or word-word-word would be: [\w-]+
Try this,
import re
str = "http://example.com/something?id=123&action=yes"
regex = "(query\d+)=(\w+)"
re.findall(regex, str)
You can also use Python's urlparse library,
from urlparse import url parse
urlparse = urlparse("http://example.com/something?id=123&action=yes")
Just call url parse to return
ParseResult(scheme='http', netloc='example.com', path='/something', params='', query='id=123&action=yes', fragment='')