I can be given a string in any of these formats:
url: e.g http://www.acme.com:456
string: e.g www.acme.com:456, www.acme.com 456, or www.acme.com
I would like to extract the host and if present a port. If the port value is not present I would like it to default to 80.
I have tried urlparse, which works fine for the url, but not for the other format. When I use urlparse on hostname:port for example, it puts the hostname in the scheme rather than netloc.
I would be happy with a solution that uses urlparse and a regex, or a single regex that could handle both formats.
You can use urlparse to get hostname from URL string:
from urlparse import urlparse
print urlparse("http://www.website.com/abc/xyz.html").hostname # prints www.website.com
>>> from urlparse import urlparse
>>> aaa = urlparse('http://www.acme.com:456')
>>> aaa.hostname
'www.acme.com'
>>> aaa.port
456
>>>
I'm not that familiar with urlparse, but using regex you'd do something like:
p = '(?:http.*://)?(?P<host>[^:/ ]+).?(?P<port>[0-9]*).*'
m = re.search(p,'http://www.abc.com:123/test')
m.group('host') # 'www.abc.com'
m.group('port') # '123'
Or, without port:
m = re.search(p,'http://www.abc.com/test')
m.group('host') # 'www.abc.com'
m.group('port') # '' i.e. you'll have to treat this as '80'
EDIT: fixed regex to also match 'www.abc.com 123'
The reason it fails for:
www.acme.com 456
is because it is not a valid URI. Why don't you just:
Replace the space with a :
Parse the resulting string by using the standard urlparse method
Try and make use of default functionality as much as possible, especially when it comes to things like parsing well know formats like URI's.
Method using urllib -
from urllib.parse import urlparse
url = 'https://stackoverflow.com/questions'
print(urlparse(url))
Output -
ParseResult(scheme='https', netloc='stackoverflow.com',
path='/questions', params='', query='', fragment='')
Reference - https://www.tutorialspoint.com/urllib-parse-parse-urls-into-components-in-python
Related
I am trying to use urlparse Python library to parse some custom URIs.
I noticed that for some well-known schemes params are parsed correctly:
>>> from urllib.parse import urlparse
>>> urlparse("http://some.domain/some/nested/endpoint;param1=value1;param2=othervalue2?query1=val1&query2=val2#fragment")
ParseResult(scheme='http', netloc='some.domain', path='/some/nested/endpoint', params='param1=value1;param2=othervalue2', query='query1=val1&query2=val2', fragment='fragment')
>>> urlparse("ftp://some.domain/some/nested/endpoint;param1=value1;param2=othervalue2?query1=val1&query2=val2#fragment")
ParseResult(scheme='ftp', netloc='some.domain', path='/some/nested/endpoint', params='param1=value1;param2=othervalue2', query='query1=val1&query2=val2', fragment='fragment')
...but for custom ones - they are not. params field remains empty. Instead, params are treated as a part of path:
>>> urlparse("scheme://some.domain/some/nested/endpoint;param1=value1;param2=othervalue2?query1=val1&query2=val2#fragment")
ParseResult(scheme='scheme', netloc='some.domain', path='/some/nested/endpoint;param1=value1;param2=othervalue2', params='', query='query1=val1&query2=val2', fragment='fragment')
Why there is a difference in parsing depending on schema? How can I parse params within urlparse library using custom schema?
This is because urlparse assumes that only a set of schemes will uses parameters in their URL format. You can see that check with in the source code.
if scheme in uses_params and ';' in url:
url, params = _splitparams(url)
else:
params = ''
Which means urlparse will attempt to parse parameters only if the scheme is in uses_params (which is a list of known schemes).
uses_params = ['', 'ftp', 'hdl', 'prospero', 'http', 'imap',
'https', 'shttp', 'rtsp', 'rtspu', 'sip', 'sips',
'mms', 'sftp', 'tel']
So to get the expected output you can append your custom scheme into uses_params list and perform the urlparse call again.
>>> from urllib.parse import uses_params, urlparse
>>>
>>> uses_params.append('scheme')
>>> urlparse("scheme://some.domain/some/nested/endpoint;param1=value1;param2=othervalue2?query1=val1&query2=val2#fragment")
ParseResult(scheme='scheme', netloc='some.domain', path='/some/nested/endpoint', params='param1=value1;param2=othervalue2', query='query1=val1&query2=val2', fragment='fragment')
Can you remove that custom schemes from the url?
That allways will return the params
urlparse("//some.domain/some/nested/endpoint;param1=value1;param2=othervalue2?query1=val1&query2=val2#fragment")
ParseResult(scheme='', netloc='some.domain', path='/some/nested/endpoint', params='param1=value1;param2=othervalue2', query='query1=val1&query2=val2', fragment='fragment')
SO I have the following URL: https://foo.bar?query1=value1&query2=value2&query3=value3
I'd need a function that can strip just query2 for example, so that the result would be:
https://foo.bar?query1=value1&query3=value3
I think maybe urllib.parse or furl can do this in an easy and clean way?
You should use urllib.parse as it's designed exactly for these purposes. I'm unclear the reason for anyone reinventing the wheel here.
Basically 3 steps:
Use urlparse to parse the url into it's component parts
Use parse_qs to parse the query string part of that keeping blanks (if relevant intact)
Remove the erroneous query2 and re-encode the query string and url back
From the docs:
Parse a URL into six components, returning a 6-item named tuple. This
corresponds to the general structure of a URL:
scheme://netloc/path;parameters?query#fragment. Each tuple item is a
string, possibly empty.
from urllib.parse import urlparse, urlencode, parse_qs, urlunparse
url = "https://foo.bar?query1=value1&query2=value2&query3=value3"
url_bits = list(urlparse(url))
print(url_bits)
query_string = parse_qs(url_bits[4], keep_blank_values=True)
print(query_string)
del(query_string['query2'])
url_bits[4] = urlencode(query_string, doseq=True)
new_url = urlunparse(url_bits)
print(new_url)
# >>>['https', 'foo.bar', '', '', 'query1=value1&query2=value2&query3=value3', '']
# >>>{'query1': ['value1'], 'query2': ['value2'], 'query3': ['value3']}
# >>>https://foo.bar?query1=value1&query3=value3
If you want by position:
url="https://foo.bar?query1=value1&query2=value2&query3=value3"
findindex1=url.find("&")
findindex2=url.find("&",findindex1+1)
url=url[0:findindex1]+url[findindex2:len(url)]
if you want by the name:
url="https://foo.bar?query1=value1&query3=value3&query2=value2"
findindex1=url.find("query2")
findindex2=url.find("&",findindex1+1)
if findindex2==-1:
url=url[0:findindex1-1]
else:
url=url[0:findindex1-1]+url[findindex2:len(url)]
Hi you could try it with regular expressions.
re.sub("ThePatternOfTheURL","ThePatternYouWantToHave", "TheInput")
so it could look something like that
pattern = "'(https\:\/\/)([a-zA-Z.?0-9=]+)([&]query2=value2)([&][a-zA-Z0-9=]+)'"
#filters the third group out with query2
filter = r"\1\2\4"
yourUrl = "https://foo.bar?query1=value1&query2=value2&query3=value3"
newURL=re.sub(pattern, filter, yourUrl)
I think this should work for you
This question already has answers here:
How can I split a URL string up into separate parts in Python?
(6 answers)
Closed 6 years ago.
I'm trying to split a url into parts so that I can work with these separately.
For e.g. the url:
'https://api.somedomain.co.uk/api/addresses?postcode=XXSDF&houseNo=34'
How can I split this into:
1) the source/origin (i.e. protocol + subdomain + domain)
2) path '/api/addresses'
3) Query: '?postcode=XXSDF&houseNo=34'
You can just use python's urlparse.
>>> from urlparse import urlparse
>>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
>>> o
ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
params='', query='', fragment='')
>>> o.scheme
'http'
>>> o.port
80
>>> o.geturl()
'http://www.cwi.nl:80/%7Eguido/Python.html'
The urlparse library, found in urllib in Python3, is designed for this. Example adapted from the documentation:
>>> from urllib.parse import urlparse
>>> o = urlparse('https://api.somedomain.co.uk/api/addresses?postcode=XXSDF&houseNo=34')
>>> o
ParseResult(scheme='https', netloc='api.somedomain.co.uk', path='/api/addresses', params='', query='postcode=XXSDF&houseNo=34', fragment='')
>>> o.scheme
'http'
>>> o.port
None
>>> o.geturl()
'https://api.somedomain.co.uk/api/addresses?postcode=XXSDF&houseNo=34'
In order to get host, path and query, the API is straighforward:
>>> print(o.hostname, o.path, o.query)
Returns:
api.somedomain.co.uk /api/addresses postcode=XXSDF&houseNo=34
In order to get the subdomain itself, the only way seems to split by ..
Note that the urllib.parse.urlsplit should be used instead urlparse, according to the documentation:
This should generally be used instead of urlparse(https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlsplit) if the more recent URL syntax allowing parameters to be applied to each segment of the path portion of the URL (see RFC 2396) is wanted
You probably want the stdlib module urlparse on Python 2, or urllib.parse on Python 3. This will split the URL up more finely than you're asking for, but it's not difficult to put the pieces back together again.
I jus want to validate the linkedin public profile url. I tried the concept like below
a = "https://in.linkedin.com/afadasdf"
p = re.compile('(http(s?)://|[a-zA-Z0-9\-]+\.|[linkedin])[linkedin/~\-]+\.[a-zA-Z0-9/~\-_,&=\?\.;]+[^\.,\s<]')
p.match(a)
The above concept is working fine. But when i give the url https://www.linkedin.com means that it's not working. Can anyone help me to validate both concepts.
It is the oring between the http(s) and www. which has given you the above problem. You could change them to * (i.e. 0 or more).
import re
a = "https://www.linkedin.com/afadasdf"
p = re.compile('((http(s?)://)*([a-zA-Z0-9\-])*\.|[linkedin])[linkedin/~\-]+\.[a-zA-Z0-9/~\-_,&=\?\.;]+[^\.,\s<]')
print p.match(a)
Although you might want to restrict it to www rather than any numbers or letters? So maybe:
p = re.compile('((http(s?)://)*([www])*\.|[linkedin])[linkedin/~\-]+\.[a-zA-Z0-9/~\-_,&=\?\.;]+[^\.,\s<]')
This pattern may help.
^((http|https):\/\/)?+(www.linkedin.com\/)+[a-z]+(\/)+[a-zA-Z0-9-]{5,30}+$
I have tested it and it works fine for me.
Instead of matching the url with a regex you could use the urllib module:
In [1]: import urllib
In [2]: u = "https://in.linkedin.com/afadasdf"
In [3]: urllib.parse.urlparse(u)
Out[3]: ParseResult(scheme='https', netloc='in.linkedin.com', path='/afadasdf', params='', query='', fragment='')
Now you can check for the netloc and path property.
This question already has answers here:
Get protocol + host name from URL
(16 answers)
Closed 9 years ago.
How do i truncate the below URL next to the domain "com" using python. i.e you tube.com only
youtube.com/video/AiL6nL
yahoo.com/video/Hhj9B2
youtube.com/video/MpVHQ
google.com/video/PGuTN
youtube.com/video/VU34MI
Is it possible to truncate like this?
Check out Pythons urlparse library. It is a standard library so nothing else needs to be installed.
So you could do the following:
import urlparse
import re
def check_and_add_http(url):
# checks if 'http://' is present at the start of the URL and adds it if not.
http_regex = re.compile(r'^http[s]?://')
if http_regex.match(url):
# 'http://' or 'https://' is present
return url
else:
# add 'http://' for urlparse to work.
return 'http://' + url
for url in url_list:
url = check_and_add_http(url)
print(urlparse.urlsplit(url)[1])
You can read more about urlsplit() in the documentation, including the indexes if you want to read the other parts of the URL.
You can use split():
myUrl.split(r"/")[0]
to get "youtube.com"
and:
myUrl.split(r"/", 1)[1]
to get everything else
I'd use the function urlsplit from the standard library:
from urlparse import urlsplit # python 2
from urllib.parse import urlsplit # python 3
myurl = "http://docs.python.org/2/library/urlparse.html"
urlsplit(myurl)[1] # returns 'docs.python.org'
No library function can tell that those strings are supposed to be absolute URLs, since, formally, they are relative ones. So, you have to prepend //.
>>> url = 'youtube.com/bla/foo'
>>> urlparse.urlsplit('//' + url)[1]
> 'youtube.com'
Just a crazy alternative solution using tldextract:
>>> import tldextract
>>> ext = tldextract.extract('youtube.com/video/AiL6nL')
>>> ".".join(ext[1:3])
'youtube.com'
For your particular input, you could use str.partition() or str.split():
print('youtube.com/video/AiL6nL'.partition('/')[0])
# -> youtube.com
Note: urlparse module (that you could use in general to parse an url) doesn't work in this case:
import urlparse
urlparse.urlsplit('youtube.com/video/AiL6nL')
# -> SplitResult(scheme='', netloc='', path='youtube.com/video/AiL6nL',
# query='', fragment='')
In general, it is safe to use a regex here if you know that all lines start with a hostname and otherwise each line contains a well-formed uri:
import re
print("\n".join(re.findall(r"(?m)^\s*([^\/?#]*)", text)))
Output
youtube.com
yahoo.com
youtube.com
google.com
youtube.com
Note: it doesn't remove the optional port part -- host:port.