Finding a url in a string using Python Regex - python

The Regex I use to find the URL in Python is as follows.
import re
regex = r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))"
This regex helps me find urls like:
https://google.com
http://google.com
google.com/index.php?q=test
www.test.google.com/index.php
But I also want it to accept a url like below:
//ads.google.com
What do I need to change for this in the mixed regex above?

well for this, i would never try to write my own regex, there are already some of them created. For urls I am using this one which works for you case :)
regex = r"(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!#)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!#)))"

Related

How do I remove a part of an URL using regex in Python?

I have an a list of URL that looks like this:
'https://www.superpopgadget.com/collections/best-sellers/products/sushi-roll-bazooka?Ffbclid=IwAR3WfVizYJF0RCP2AsSoulLjJK2_OUwQZ0Y1eep_b3Einm1XNJbcF_K3wYI'
I wanna scrape it to just get:
'https://www.superpopgadget.com/collections/best-sellers/products/sushi-roll-bazooka'
Not sure if there is any other more efficient method but this might work fine:
(.+)\?(.+)
It matches in the first group everything before the character ? and the second group is everything after it. What you need is the first group.
Example in Regex101

re.sub replacing too much text

I have a set of links like:
['http://www.nytimes.com/2016/12/31/us/politics/house-republicans-health-care-suit.html?partner=rss&emc=rss" rel="standout"></atom:link>',
'http://www.nytimes.com/2016/12/31/nyregion/bronx-murder-40th-precinct-police-residents.html</guid>',
'http://www.nytimes.com/2016/12/30/movies/tyrus-wong-dies-bambi-disney.html?partner=rss&emc=rss',
'http://www.nytimes.com/2016/12/30/obituaries/among-deaths-in-2016-a-heavy-toll-in-pop-music.html</guid>',
'http://www.nytimes.com/video/world/100000004830728/daybreak-around-the-world.html?partner=rss&emc=rss']
I'm trying to iterate over them to remove everything that comes after html. So I have:
cleanitems = []
for item in links:
cleanitems.append(re.sub(r'html(.*)', '', item))
Which returns:
['http://www.nytimes.com/2016/12/31/us/politics/house-republicans-health-care-suit.',
'http://www.nytimes.com/2016/12/31/nyregion/bronx-murder-40th-precinct-police-residents.',
'http://www.nytimes.com/2016/12/30/movies/tyrus-wong-dies-bambi-disney.',
'http://www.nytimes.com/2016/12/30/obituaries/among-deaths-in-2016-a-heavy-toll-in-pop-music.',
'http://www.nytimes.com/video/world/100000004830728/daybreak-around-the-world.]
Confused as to why it's including html in the capture group. Thanks for any help.
html is part of the matched text too, not just the (...) group. re.sub() replaces all of the whole matched text.
Include the literal html text in the replacement:
cleanitems.append(re.sub(r'html(.*)', 'html', item))
or, alternatively, capture that part in a group instead:
cleanitems.append(re.sub(r'(html).*', r'\1', item))
You may want to consider using a non-greedy match, and a $ end-of-string anchor to prevent cutting off a URL that contains html in the path more than once, and including the . dot to make sure you are really only matching the .html extension:
cleanitems.append(re.sub(r'\.html.*?$', r'.html', item))
However, if your goal is to remove the query string from a URL, consider parsing the URL using urllib.parse.urlparse(), and re-building it without the query string or fragment identifiers:
from urlib.parse import urlparse
cleanitems.append(urlparse(item)._replace(query='', fragment='').geturl())
This won't remove the eroneous HTML chunks however; if you are parsing these URLs from a HTML document, consider using a real HTML parser rather than regex.
Just a complement to Martijn's answer.
You could also use a lookbehind assertion to only match the text following html:
cleanitems.append(re.sub(r'(?<=html).*', '', item))
or use a replacement string to keep the initial part:
cleanitems.append(re.sub(r'(html).*', r'\1', item))
But as already said by Martin, you'd better use the urllib module to correctly parse URLs

Regex to extract all urls from string

I have a string like this
http://example.com/path/topage.htmlhttp://twitter.com/p/xyanhshttp://httpget.org/get.zipwww.google.com/privacy.htmlhttps://goodurl.net/
I would like to extract all url / webaddress into a Array. for example
urls = ['http://example.com/path/topage.html','http://twitter.com/p/xyan',.....]
Here is my approach which didn't work.
import re
strings = "http://example.com/path/topage.htmlhttp://twitter.com/p/xyanhshttp://httpget.org/get.zipwww.google.com/privacy.htmlhttps://goodurl.net/"
links = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', strings)
print links
// result always same as strings
The problem is that your regex pattern is too inclusive. It includes all urls. You can use lookahead by using (?=)
Try this:
re.findall("((www\.|http://|https://)(www\.)*.*?(?=(www\.|http://|https://|$)))", strings)
Your problem is that http:// is being accepted as a valid part of a url. This is because of this token right here:
[$-_#.&+]
or more specifically:
$-_
This matches all characters with the range from $ to _, which includes a lot more characters than you probably intended to do.
You can change this to [$\-_#.&+] but this causes problems since now, / characters will not match. So add it by using [$\-_#.&+/]. However, this will again cause problems since http://example.com/path/topage.htmlhttp would be considered a valid match.
The final addition is to add a lookahead to ensure that you are not matching http:// or https://, which just so happens to be the first part of your regex!
http[s]?://(?:(?!http[s]?://)[a-zA-Z]|[0-9]|[$\-_#.&+/]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+
tested here
A simple answer without getting into much complication:
import re
url_list = []
for x in re.split("http://", l):
url_list.append(re.split("https://",x))
url_list = [item for sublist in url_list for item in sublist]
In case you want to append the string http:// and https:// back to the urls, do appropriate changes to the code. Hope i convey the idea.
here's mine
(r’http[s]?://[a-zA-Z]{3}\.[a-zA-Z0-9]+\.[a-zA-Z]+')

Python Regex for URL doesn't work

I'm using python and trying to use a regex to see whether there is a url within my string. I've tried multiple different regexes but they always come out with 'None', even if the string is clearly a website.
Example:
>>> print re.search(r'/((?:https?\:\/\/|www\.)(?:[-a-z0-9]+\.)*[-a-z0-9]+.*)/i','www.google.com')
None
Any help would be appreciated!
What about, as in Python Regex for URL doesn't work , switching to something like:
r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
For a detailed survey of many, many regexes validating URLs, see https://mathiasbynens.be/demo/url-regex ...
If you want to check if a string is an URL you can use:
print re.search(r'(^(https?://|www\.)([a-z0-9-]+\.)+([a-z0-9]+)$)','www.google.com', re.I)
If you want to verify if a string contains a URL, you only need to remove the ^ and $ patterns:
print re.search(r'((https?://|www\.)([a-z0-9-]+\.)+([a-z0-9]+))','www.google.com', re.I)
Remember: re.I is for case-insensitive matching, the '^' matches beginning of line and $ matches end of line.
The grammar for a valid URL has been explained here in this Wiki. Based on that this regex can match a string if it has valid URL.
^((?:https?|ftp):\/{2}[\w.\/]+(?::\d{1,4})?\/?[?\w_#\/.]+)
And in case if you want to keep the scheme part of the URL optional.
^((?:https?|ftp)?(?::\/{2})?[\w.\/]+(?::\d{1,4})?\/?[?\w_#\/.]+)
Output
>>> re.search(r'^((?:https?|ftp)?(?::\/{2})?[\w.\/]+(?::\d{1,4})?\/?[?\w_#\/.]+)','www.google.com').group()
'www.google.com'
>>> re.search(r'^((?:https?|ftp)?(?::\/{2})?[\w.\/]+(?::\d{1,4})?\/?[?\w_#\/.]+)','http://www.google.com').group()
'http://www.google.com'
>>> re.search(r'^((?:https?|ftp)?(?::\/{2})?[\w.\/]+(?::\d{1,4})?\/?[?\w_#\/.]+)','https://www.google.com').group()
'https://www.google.com'
You can see a detailed demo and explanation about how it work here.
i've used the following regex in order to verify that the inserted string is a URL:
r'((http|https)\:\/\/)?[a-zA-Z0-9\.\/\?\:#\-_=#]+\.([a-zA-Z]){2,6}([a-zA-Z0-9\.\&\/\?\:#\-_=#])*'

python regex match querystring path

I'm trying to write a regex to match any path that contains /? to determine whether it is a querystring or not.
a sample string to be matched would be this: /mysite/path/to/whatever/?page=1
so far I thought this would match re.match(r'/\?', '/mysite/path/to/whatever/?page=1')
but it doesn't seem to be matching
This code is already written for you. No need to reinvent the wheel:
import urlparse
print urlparse.urlparse('/mysite/path/to/whatever/?page=1')
http://docs.python.org/library/urlparse.html#module-urlparse
Your problem is that you're using re.match. That function looks for matches at the beginning of the string. So, either you change your regexp to '.*/\?', or use re.search instead.
You don't need a regular expression here. Just use the in operator: '/?' in the_string.
The problem is that re.match only looks at the beginning of the string.
You could use re.search instead, if you need the power of REs.

Categories