This question already has answers here:
How can I split a URL string up into separate parts in Python?
(6 answers)
Closed 6 years ago.
I'm trying to split a url into parts so that I can work with these separately.
For e.g. the url:
'https://api.somedomain.co.uk/api/addresses?postcode=XXSDF&houseNo=34'
How can I split this into:
1) the source/origin (i.e. protocol + subdomain + domain)
2) path '/api/addresses'
3) Query: '?postcode=XXSDF&houseNo=34'
You can just use python's urlparse.
>>> from urlparse import urlparse
>>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
>>> o
ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
params='', query='', fragment='')
>>> o.scheme
'http'
>>> o.port
80
>>> o.geturl()
'http://www.cwi.nl:80/%7Eguido/Python.html'
The urlparse library, found in urllib in Python3, is designed for this. Example adapted from the documentation:
>>> from urllib.parse import urlparse
>>> o = urlparse('https://api.somedomain.co.uk/api/addresses?postcode=XXSDF&houseNo=34')
>>> o
ParseResult(scheme='https', netloc='api.somedomain.co.uk', path='/api/addresses', params='', query='postcode=XXSDF&houseNo=34', fragment='')
>>> o.scheme
'http'
>>> o.port
None
>>> o.geturl()
'https://api.somedomain.co.uk/api/addresses?postcode=XXSDF&houseNo=34'
In order to get host, path and query, the API is straighforward:
>>> print(o.hostname, o.path, o.query)
Returns:
api.somedomain.co.uk /api/addresses postcode=XXSDF&houseNo=34
In order to get the subdomain itself, the only way seems to split by ..
Note that the urllib.parse.urlsplit should be used instead urlparse, according to the documentation:
This should generally be used instead of urlparse(https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlsplit) if the more recent URL syntax allowing parameters to be applied to each segment of the path portion of the URL (see RFC 2396) is wanted
You probably want the stdlib module urlparse on Python 2, or urllib.parse on Python 3. This will split the URL up more finely than you're asking for, but it's not difficult to put the pieces back together again.
Related
I am trying to use urlparse Python library to parse some custom URIs.
I noticed that for some well-known schemes params are parsed correctly:
>>> from urllib.parse import urlparse
>>> urlparse("http://some.domain/some/nested/endpoint;param1=value1;param2=othervalue2?query1=val1&query2=val2#fragment")
ParseResult(scheme='http', netloc='some.domain', path='/some/nested/endpoint', params='param1=value1;param2=othervalue2', query='query1=val1&query2=val2', fragment='fragment')
>>> urlparse("ftp://some.domain/some/nested/endpoint;param1=value1;param2=othervalue2?query1=val1&query2=val2#fragment")
ParseResult(scheme='ftp', netloc='some.domain', path='/some/nested/endpoint', params='param1=value1;param2=othervalue2', query='query1=val1&query2=val2', fragment='fragment')
...but for custom ones - they are not. params field remains empty. Instead, params are treated as a part of path:
>>> urlparse("scheme://some.domain/some/nested/endpoint;param1=value1;param2=othervalue2?query1=val1&query2=val2#fragment")
ParseResult(scheme='scheme', netloc='some.domain', path='/some/nested/endpoint;param1=value1;param2=othervalue2', params='', query='query1=val1&query2=val2', fragment='fragment')
Why there is a difference in parsing depending on schema? How can I parse params within urlparse library using custom schema?
This is because urlparse assumes that only a set of schemes will uses parameters in their URL format. You can see that check with in the source code.
if scheme in uses_params and ';' in url:
url, params = _splitparams(url)
else:
params = ''
Which means urlparse will attempt to parse parameters only if the scheme is in uses_params (which is a list of known schemes).
uses_params = ['', 'ftp', 'hdl', 'prospero', 'http', 'imap',
'https', 'shttp', 'rtsp', 'rtspu', 'sip', 'sips',
'mms', 'sftp', 'tel']
So to get the expected output you can append your custom scheme into uses_params list and perform the urlparse call again.
>>> from urllib.parse import uses_params, urlparse
>>>
>>> uses_params.append('scheme')
>>> urlparse("scheme://some.domain/some/nested/endpoint;param1=value1;param2=othervalue2?query1=val1&query2=val2#fragment")
ParseResult(scheme='scheme', netloc='some.domain', path='/some/nested/endpoint', params='param1=value1;param2=othervalue2', query='query1=val1&query2=val2', fragment='fragment')
Can you remove that custom schemes from the url?
That allways will return the params
urlparse("//some.domain/some/nested/endpoint;param1=value1;param2=othervalue2?query1=val1&query2=val2#fragment")
ParseResult(scheme='', netloc='some.domain', path='/some/nested/endpoint', params='param1=value1;param2=othervalue2', query='query1=val1&query2=val2', fragment='fragment')
I'm trying to make a simple browser using PyQt5 (By following a tutorial). It's mostly working except for one tiny problem -
def navigate_to_url(self):
q = QUrl(self.urlbar.text())
print(type(q))
if q.scheme() == "":
q.setScheme("http")
self.tabs.currentWidget().setUrl(q)
Whenever I type something in the address bar it searches it up but it adds a 'http://'. But if I want to search something like 'cats' I want it to work like a normal browser i.e. bring me links that are associated with cats.
Normal Pic:
However because 'https://' is added it gives me a NAME_NOT_RESERVED error.
Error Pic:
Is there any way to fix this?
You could try doing something that checks if it's just a normal word and if so then don't do the http://, for example ive got a txt document with A LOT of English words that you can use to check if its a normal word, like so:
if (re.findall(r'\b'+ re.escape(word1) + r'\b', contents, re.MULTILINE))
assign word1 to your word and contents to the dictionary
here is another example:
import re
with open('dictionary.txt') as fh:
contents = fh.read()
Consider setting the url to the search engine search explicitly if the query does or doesn't match some criteria
At its most basic, you could use urllib.parse.urlparse for this, though it may not be an exact fit for all addresses, as it expects a scheme prefix, which most people don't bother with and let the http(s) be implicitly added by the browser
>>> import urllib.parse
>>> urllib.parse.urlparse("https://example.com") # full example
ParseResult(scheme='https', netloc='example.com', path='', params='', query='', fragment='')
>>> urllib.parse.urlparse("cats") # search works
ParseResult(scheme='', netloc='', path='cats', params='', query='', fragment='')
>>> urllib.parse.urlparse("example.com") # fails for missing scheme
ParseResult(scheme='', netloc='', path='example.com', params='', query='', fragment='')
A quick test for an intended URL without a scheme to hint that an address is a netloc would be if the parsed path contains .
Alternatively, you could add some character (perhaps a space or keyword like d or s before searches)
You may also need to URL encode your string (exchanging for +, ? for %3F, etc.), which can also be done by urllib.parse's urllib.parse.quote_plus
>>> urllib.parse.quote_plus("What does a url-encoded cat query look like?")
'What+does+a+url-encoded+cat+query+look+like%3F'
Duck Duck Go Search Parameters
All together
import urllib.parse
url_search_template = "https://duckduckgo.com/?q={}"
keyword_search = "d "
text = self.urlbar.text()
def probably_a_search(s):
# check for prefix first to prevent matches against a search like 3.1415
if s.startswith(keyword_search):
return True, s[len(keyword_search):] # slice search prefix
parsed_url = urllib.parse.urlparse(s)
if parsed_url.scheme or parsed_url.netloc:
return False, s
if "." in parsed_url.path:
return False, s
return True, s
is_search, text = probably_a_search(text)
if is_search:
text = url_search_template.format(urllib.parse.quote_plus(text.strip()))
q = QUrl(text)
To get a more accurate test against the TLD (rather than the simple presence of .), a 3rd-party library like https://pypi.org/project/tld/ may work better for you
I jus want to validate the linkedin public profile url. I tried the concept like below
a = "https://in.linkedin.com/afadasdf"
p = re.compile('(http(s?)://|[a-zA-Z0-9\-]+\.|[linkedin])[linkedin/~\-]+\.[a-zA-Z0-9/~\-_,&=\?\.;]+[^\.,\s<]')
p.match(a)
The above concept is working fine. But when i give the url https://www.linkedin.com means that it's not working. Can anyone help me to validate both concepts.
It is the oring between the http(s) and www. which has given you the above problem. You could change them to * (i.e. 0 or more).
import re
a = "https://www.linkedin.com/afadasdf"
p = re.compile('((http(s?)://)*([a-zA-Z0-9\-])*\.|[linkedin])[linkedin/~\-]+\.[a-zA-Z0-9/~\-_,&=\?\.;]+[^\.,\s<]')
print p.match(a)
Although you might want to restrict it to www rather than any numbers or letters? So maybe:
p = re.compile('((http(s?)://)*([www])*\.|[linkedin])[linkedin/~\-]+\.[a-zA-Z0-9/~\-_,&=\?\.;]+[^\.,\s<]')
This pattern may help.
^((http|https):\/\/)?+(www.linkedin.com\/)+[a-z]+(\/)+[a-zA-Z0-9-]{5,30}+$
I have tested it and it works fine for me.
Instead of matching the url with a regex you could use the urllib module:
In [1]: import urllib
In [2]: u = "https://in.linkedin.com/afadasdf"
In [3]: urllib.parse.urlparse(u)
Out[3]: ParseResult(scheme='https', netloc='in.linkedin.com', path='/afadasdf', params='', query='', fragment='')
Now you can check for the netloc and path property.
This question already has answers here:
Get protocol + host name from URL
(16 answers)
Closed 9 years ago.
How do i truncate the below URL next to the domain "com" using python. i.e you tube.com only
youtube.com/video/AiL6nL
yahoo.com/video/Hhj9B2
youtube.com/video/MpVHQ
google.com/video/PGuTN
youtube.com/video/VU34MI
Is it possible to truncate like this?
Check out Pythons urlparse library. It is a standard library so nothing else needs to be installed.
So you could do the following:
import urlparse
import re
def check_and_add_http(url):
# checks if 'http://' is present at the start of the URL and adds it if not.
http_regex = re.compile(r'^http[s]?://')
if http_regex.match(url):
# 'http://' or 'https://' is present
return url
else:
# add 'http://' for urlparse to work.
return 'http://' + url
for url in url_list:
url = check_and_add_http(url)
print(urlparse.urlsplit(url)[1])
You can read more about urlsplit() in the documentation, including the indexes if you want to read the other parts of the URL.
You can use split():
myUrl.split(r"/")[0]
to get "youtube.com"
and:
myUrl.split(r"/", 1)[1]
to get everything else
I'd use the function urlsplit from the standard library:
from urlparse import urlsplit # python 2
from urllib.parse import urlsplit # python 3
myurl = "http://docs.python.org/2/library/urlparse.html"
urlsplit(myurl)[1] # returns 'docs.python.org'
No library function can tell that those strings are supposed to be absolute URLs, since, formally, they are relative ones. So, you have to prepend //.
>>> url = 'youtube.com/bla/foo'
>>> urlparse.urlsplit('//' + url)[1]
> 'youtube.com'
Just a crazy alternative solution using tldextract:
>>> import tldextract
>>> ext = tldextract.extract('youtube.com/video/AiL6nL')
>>> ".".join(ext[1:3])
'youtube.com'
For your particular input, you could use str.partition() or str.split():
print('youtube.com/video/AiL6nL'.partition('/')[0])
# -> youtube.com
Note: urlparse module (that you could use in general to parse an url) doesn't work in this case:
import urlparse
urlparse.urlsplit('youtube.com/video/AiL6nL')
# -> SplitResult(scheme='', netloc='', path='youtube.com/video/AiL6nL',
# query='', fragment='')
In general, it is safe to use a regex here if you know that all lines start with a hostname and otherwise each line contains a well-formed uri:
import re
print("\n".join(re.findall(r"(?m)^\s*([^\/?#]*)", text)))
Output
youtube.com
yahoo.com
youtube.com
google.com
youtube.com
Note: it doesn't remove the optional port part -- host:port.
I can be given a string in any of these formats:
url: e.g http://www.acme.com:456
string: e.g www.acme.com:456, www.acme.com 456, or www.acme.com
I would like to extract the host and if present a port. If the port value is not present I would like it to default to 80.
I have tried urlparse, which works fine for the url, but not for the other format. When I use urlparse on hostname:port for example, it puts the hostname in the scheme rather than netloc.
I would be happy with a solution that uses urlparse and a regex, or a single regex that could handle both formats.
You can use urlparse to get hostname from URL string:
from urlparse import urlparse
print urlparse("http://www.website.com/abc/xyz.html").hostname # prints www.website.com
>>> from urlparse import urlparse
>>> aaa = urlparse('http://www.acme.com:456')
>>> aaa.hostname
'www.acme.com'
>>> aaa.port
456
>>>
I'm not that familiar with urlparse, but using regex you'd do something like:
p = '(?:http.*://)?(?P<host>[^:/ ]+).?(?P<port>[0-9]*).*'
m = re.search(p,'http://www.abc.com:123/test')
m.group('host') # 'www.abc.com'
m.group('port') # '123'
Or, without port:
m = re.search(p,'http://www.abc.com/test')
m.group('host') # 'www.abc.com'
m.group('port') # '' i.e. you'll have to treat this as '80'
EDIT: fixed regex to also match 'www.abc.com 123'
The reason it fails for:
www.acme.com 456
is because it is not a valid URI. Why don't you just:
Replace the space with a :
Parse the resulting string by using the standard urlparse method
Try and make use of default functionality as much as possible, especially when it comes to things like parsing well know formats like URI's.
Method using urllib -
from urllib.parse import urlparse
url = 'https://stackoverflow.com/questions'
print(urlparse(url))
Output -
ParseResult(scheme='https', netloc='stackoverflow.com',
path='/questions', params='', query='', fragment='')
Reference - https://www.tutorialspoint.com/urllib-parse-parse-urls-into-components-in-python