I want to check an URL in definition pattern list.
My pattern list is:
pat = ['/FoodListAdminCP/Login[/]?', '/FoodListAdminCP[/]?']
I used this code for check the URL matched with one item of this list
import re
f = re.compile('|'.join(pat))
if f.match(self.request.uri):
self.login = True
else:
self.login = False
Now, if I request /FoodListAdminCP/Dashboard as URL, that matched with pattern. Because start of this URL matched with '/FoodListAdminCP[/]?' who is in my list.
I want my request URL matched with entire of list item not part of that.
How I can do it?
If you want to match the entire URL against your pattern, you can use '^' and '$' to match the beginning and the end of the string to match.
In your example you could use
f = re.compile('|'.join( '(^'+p+'$)' for p in pat ))
to get the regular expression
'(^/FoodListAdminCP/Login[/]?$)|(^/FoodListAdminCP[/]?$)'
from your pat list.
If you'd rather not concatenate patterns, which may not be flexible enough, but match them separately, you can use [re.compile(p).match(uri) for p in pat] list comprehrension to get a list of match results for all the patterns:
>>> import re
>>> pat = ['/FoodListAdminCP/Login[/]?', '/FoodListAdminCP[/]?']
>>> uri = '/FoodListAdminCP/Dashboard'
>>> match_results = [re.compile(p).match(uri) for p in pat]
>>> match_results
[None, <_sre.SRE_Match object at 0x101c05d30>]
Then you can ask if all of the results are matches using all, which is what you want your login to be:
>>> login = all(match_results)
>>> login
False
Or in short:
login = all([re.compile(p).match(uri) for p in pat])
\Z at the end of your regex.
f = re.compile('(' + '|'.join(pat) + ')\Z')
Related
I need to filter urls with regex with all last pathes except of several that shoul be skipped. For example:
import re
urls_to_exclude = ["example_1", "example_2", "example_3"]
url_1 = "htttps://site.com/api/user/endpath"
url_2 = "htttps://site.com/api/user/other_end?limit=10"
url_3 = "htttps://site.com/api/customer/example_1#tag"
url_4 = "htttps://site.com/api/blog/example_2"
>>> match = re.findall(r"...magic_regex...", url_1)
>>> 'endpath'
>>> match = re.findall(r"...magic_regex...", url_2)
>>> 'other_end'
>>> match = re.findall(r"...magic_regex...", url_3)
>>> 'example_1'
>>> match = re.findall(r"...magic_regex...", url_4)
>>> 'example_2'
It should be regex string of compile odject.
Thank you
You can try regex, it will not give you the exact last path, but you can easily evaluate using result[1:-1]
Regex:
/[\w\d_-]+[?^"#]
So, I have this URL: https://www.last.fm/music/Limp+Bizkit/Significant+Other
I want to split it, to only keep the Limp+Bizkit and Significant+Other part of the URL. These are variables, and can be different each time. These are needed to create a new URL (which I know how to do).
I want the Limp+Bizkit and Significant+Other to be two different variables. How do I do this?
You can use the str.split method and use the forward slash as the separator.
>>> url = "https://www.last.fm/music/Limp+Bizkit/Significant+Other"
>>> *_, a, b = url.split("/")
>>> a
'Limp+Bizkit'
>>> b
'Significant+Other'
You can replace https://www.last.fm/music/ in the URL to just get Limp+Bizkit/Significant+Other. Then you can split it in half at the / character to break it into two strings. Then the URL will be a list and you can access the indices with url[0] and url[1]:
>>> url = "https://www.last.fm/music/Limp+Bizkit/Significant+Other"
>>> url = url.replace("https://www.last.fm/music/",'').split('/')
>>> first_value = url[0]
>>> second_value = url[1]
>>> first_value
'Limp+Bizkit'
>>> second_value
'Significant+Other'
You can use regular expressions to achieve this.
import regex as re
url = "https://www.last.fm/music/Limp+Bizkit/Significant+Other"
match = re.match("^.*\/\/.*\/.*\/(.*)\/(.*)", url)
print(match.group(1))
print(match.group(2))
I want to input a URL and extract the domain name which is the string that comes after http:// or https:// and contains strings, numbers, dots, underscores, or dashes.
I wrote the regex and used the python's re module as follows:
import re
m = re.search('https?://([A-Za-z_0-9.-]+).*', 'https://google.co.uk?link=something')
m.group(1)
print(m)
My understanding is that m.group(1) will extract the part between () in the re.search.
The output that I expect is: google.co.uk
But I am getting this:
<_sre.SRE_Match object; span=(0, 35), match='https://google.co.uk?link=something'>
Can you point to me how to use re to achieve my requirement?
You need to write
print(m.group(1))
Even better yet - have a condition before:
m = re.search('https?://([A-Za-z_0-9.-]+).*', 'https://google.co.uk?link=something')
if m:
print(m.group(1))
Jan has already provided solution for this. But just to note, we can implement the same without using re. All it needs is !"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~ for validation purposes. The same can be obtained from string package.
def domain_finder(link):
import string
dot_splitter = link.split('.')
seperator_first = 0
if '//' in dot_splitter[0]:
seperator_first = (dot_splitter[0].find('//') + 2)
seperator_end = ''
for i in dot_splitter[2]:
if i in string.punctuation:
seperator_end = i
break
if seperator_end:
end_ = dot_splitter[2].split(seperator_end)[0]
else:
end_ = dot_splitter[2]
domain = [dot_splitter[0][seperator_first:], dot_splitter[1], end_]
domain = '.'.join(domain)
return domain
link = 'https://google.co.uk?link=something'
domain = domain_finder(link=link)
print(domain) # prints ==> 'google.co.uk'
This is just another way of solving the same without re.
There is an library called tldextract which is very reliable in this case.
Here is how it will work
import tldextract
def extractDomain(url):
if "http" in str(url) or "www" in str(url):
parsed = tldextract.extract(url)
parsed = ".".join([i for i in parsed if i])
return parsed
else: return "NA"
op = open("out.txt",'w')
# with open("test.txt") as ptr:
# for lines in ptr.read().split("\n"):
# op.write(str(extractDomain(lines)) + "\n")
print(extractDomain("https://test.pythonhosted.org/Flask-Mail/"))
output as follows,
test.pythonhosted.org
I'm working on a regular expression and was wondering how to extract URL from a HTML page.
I want to print out the url from this line:
Website is: http://www.somesite.com
Everytime that link is found, I want to just extract what URL is there after **Website is:**
Any help will be appreciated.
Will this suffice or do you need to be more specific?
In [230]: s = 'Website is: http://www.somesite.com '
In [231]: re.findall('Website is:\s+(\S+)', s)
Out[231]: ['http://www.somesite.com']
You could match each line to a regular expression with a capturing group, like so:
for l in page:
m = re.match("Website is: (.*)")
if m:
print m.groups()[0]
This would both check if each line matched the pattern, and extract the link from it.
A few pitfalls:
This assumes that the "Website is" expression is always at the start of the line. If it's not, you could use re.search.
This assumes there is exactly one space between the colon and the website. If that's not true, you could change the expression to something like Website is:\s+(http.*).
The specifics will depend on the page you are trying to parse.
Regex might be overkill for this since it's so simple.
def main():
urls = []
file = prepare_file("<yourfile>.html")
for i in file:
if "www" in i or "http://" in i:
urls.append(i)
return urls
def prepare_file(filename):
file = open(filename)
a = file.readlines() #splits on new lines
a = [ i.strip() for i in [ x for x in a ] ] #remove white space
a = filter(lambda x : x != '', a) #remove empty elements
return a
URL are awkward to capture with regex, according to what I've read
Probably using the following regex pattern will be good for you:
pat = 'Website is: (%s)' % fireball
where fireball is a pattern to catch URLs that you'll find here:
daringfireball.net/2010/07/improved_regex_for_matching_urls
For example:
string = "This is a link http://www.google.com"
How could I extract 'http://www.google.com' ?
(Each link will be of the same format i.e 'http://')
There may be few ways to do this but the cleanest would be to use regex
>>> myString = "This is a link http://www.google.com"
>>> print re.search("(?P<url>https?://[^\s]+)", myString).group("url")
http://www.google.com
If there can be multiple links you can use something similar to below
>>> myString = "These are the links http://www.google.com and http://stackoverflow.com/questions/839994/extracting-a-url-in-python"
>>> print re.findall(r'(https?://[^\s]+)', myString)
['http://www.google.com', 'http://stackoverflow.com/questions/839994/extracting-a-url-in-python']
>>>
There is another way how to extract URLs from text easily. You can use urlextract to do it for you, just install it via pip:
pip install urlextract
and then you can use it like this:
from urlextract import URLExtract
extractor = URLExtract()
urls = extractor.find_urls("Let's have URL stackoverflow.com as an example.")
print(urls) # prints: ['stackoverflow.com']
You can find more info on my github page: https://github.com/lipoja/URLExtract
NOTE: It downloads a list of TLDs from iana.org to keep you up to date. But if the program does not have internet access then it's not for you.
In order to find a web URL in a generic string, you can use a regular expression (regex).
A simple regex for URL matching like the following should fit your case.
regex = r'('
# Scheme (HTTP, HTTPS, FTP and SFTP):
regex += r'(?:(https?|s?ftp):\/\/)?'
# www:
regex += r'(?:www\.)?'
regex += r'('
# Host and domain (including ccSLD):
regex += r'(?:(?:[A-Z0-9][A-Z0-9-]{0,61}[A-Z0-9]\.)+)'
# TLD:
regex += r'([A-Z]{2,6})'
# IP Address:
regex += r'|(?:\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'
regex += r')'
# Port:
regex += r'(?::(\d{1,5}))?'
# Query path:
regex += r'(?:(\/\S+)*)'
regex += r')'
If you want to be even more precise, in the TLD section, you should ensure that the TLD is a valid TLD (see the entire list of valid TLDs here: https://data.iana.org/TLD/tlds-alpha-by-domain.txt):
# TLD:
regex += r'(com|net|org|eu|...)'
Then, you can simply compile the former regex and use it to find possible matches:
import re
string = "This is a link http://www.google.com"
find_urls_in_string = re.compile(regex, re.IGNORECASE)
url = find_urls_in_string.search(string)
if url is not None and url.group(0) is not None:
print("URL parts: " + str(url.groups()))
print("URL" + url.group(0).strip())
Which, in case of the string "This is a link http://www.google.com" will output:
URL parts: ('http://www.google.com', 'http', 'google.com', 'com', None, None)
URL: http://www.google.com
If you change the input with a more complex URL, for example "This is also a URL https://www.host.domain.com:80/path/page.php?query=value&a2=v2#foo but this is not anymore" the output will be:
URL parts: ('https://www.host.domain.com:80/path/page.php?query=value&a2=v2#foo', 'https', 'host.domain.com', 'com', '80', '/path/page.php?query=value&a2=v2#foo')
URL: https://www.host.domain.com:80/path/page.php?query=value&a2=v2#foo
NOTE: If you are looking for more URLs in a single string, you can still use the same regex, but just use findall() instead of search().
This extracts all urls with parameters, somehow all above examples haven't worked for me
import re
data = 'https://net2333.us3.list-some.com/subscribe/confirm?u=f3cca8a1ffdee924a6a413ae9&id=6c03fa85f8&e=6bbacccc5b'
WEB_URL_REGEX = r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!#)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!#)))"""
re.findall(WEB_URL_REGEX, text)
You can extract any URL from a string using the following patterns,
1.
>>> import re
>>> string = "This is a link http://www.google.com"
>>> pattern = r'[(http://)|\w]*?[\w]*\.[-/\w]*\.\w*[(/{1})]?[#-\./\w]*[(/{1,})]?'
>>> re.search(pattern, string)
http://www.google.com
>>> TWEET = ('New Pybites article: Module of the Week - Requests-cache '
'for Repeated API Calls - http://pybit.es/requests-cache.html '
'#python #APIs')
>>> re.search(pattern, TWEET)
http://pybit.es/requests-cache.html
>>> tweet = ('Pybites My Reading List | 12 Rules for Life - #books '
'that expand the mind! '
'http://pbreadinglist.herokuapp.com/books/'
'TvEqDAAAQBAJ#.XVOriU5z2tA.twitter'
' #psychology #philosophy')
>>> re.findall(pattern, TWEET)
['http://pbreadinglist.herokuapp.com/books/TvEqDAAAQBAJ#.XVOriU5z2tA.twitter']
to take the above pattern to the next level, we can also detect hashtags including URL the following ways
2.
>>> pattern = r'[(http://)|\w]*?[\w]*\.[-/\w]*\.\w*[(/{1})]?[#-\./\w]*[(/{1,})]?|#[.\w]*'
>>> re.findall(pattern, tweet)
['#books', http://pbreadinglist.herokuapp.com/books/TvEqDAAAQBAJ#.XVOriU5z2tA.twitter', '#psychology', '#philosophy']
The above example for taking URL and hashtags can be shortened to
>>> pattern = r'((?:#|http)\S+)'
>>> re.findall(pattern, tweet)
['#books', http://pbreadinglist.herokuapp.com/books/TvEqDAAAQBAJ#.XVOriU5z2tA.twitter', '#psychology', '#philosophy']
The pattern below can matches two alphanumeric separated by "." as URL
>>> pattern = pattern = r'(?:http://)?\w+\.\S*[^.\s]'
>>> tweet = ('PyBites My Reading List | 12 Rules for Life - #books '
'that expand the mind! '
'www.google.com/telephone/wire.... '
'http://pbreadinglist.herokuapp.com/books/'
'TvEqDAAAQBAJ#.XVOriU5z2tA.twitter '
"http://-www.pip.org "
"google.com "
"twitter.com "
"facebook.com"
' #psychology #philosophy')
>>> re.findall(pattern, tweet)
['www.google.com/telephone/wire', 'http://pbreadinglist.herokuapp.com/books/TvEqDAAAQBAJ#.XVOriU5z2tA.twitter', 'www.pip.org', 'google.com', 'twitter.com', 'facebook.com']
You can try any complicated URL with the number 1 & 2 pattern.
To learn more about re module in python, do check this out
REGEXES IN PYTHON by Real Python.
Cheers!
I've used a slight variation from #Abhijit's accepted answer.
This one uses \S instead of [^\s], which is equivalent but more concise. It also doesn't use a named group, because there is just one and we can ommit the name for simplicity reasons:
import re
my_string = "This is my tweet check it out http://example.com/blah"
print(re.search(r'(https?://\S+)', my_string).group())
Of course, if there are multiple links to extract, just use .findall():
print(re.findall(r'(https?://\S+)', my_string))