How to extract some data from url using Python

How to extract some data from url using Python - python

I have an url as follows:
https://some_url/vivi/v2/ZUxOZmVrdzJqTURxV20wQ0RvRld6SytEQWNocThwMGVnbFJ4RDQrZzJMeGRBcnhPYnUzV1pRPT0=/BE?category=PASSENGER&make=30&model=124&regmonth=3&regdate=2015-03&body=443,4781&facelift=252&seats=4&bodyHeight=443&bodyLength=443&weight=-1&engine=1394&wheeldrive=196&transmission=400
What I need is to get the string after v2/, thus ZUxOZmVrdzJqTURxV20wQ0RvRld6SytEQWNocThwMGVnbFJ4RDQrZzJMeGRBcnhPYnUzV1pRPT0=
I use furl to extract the parameter value. I do it as follows:
furl(url).args['category'] // gives PASSENGER
But here I do not have the name of the parameter.
How can I do that?

If you don't need a generalized solution but for the url you have provided in question. Then you can do the following:
url="https://some_url/vivi/v2/ZUxOZmVrdzJqTURxV20wQ0RvRld6SytEQWNocThwMGVnbFJ4RDQrZzJMeGRBcnhPYnUzV1pRPT0=/BE?category=PASSENGER&make=30&model=124&regmonth=3&regdate=2015-03&body=443,4781&facelift=252&seats=4&bodyHeight=443&bodyLength=443&weight=-1&engine=1394&wheeldrive=196&transmission=400"
answer=url.split('/')[5]

Use following code:
l=url.split('/')
m=l[l.index('v2')+1]
print(m)

Desired output using re.
import re
url = "https://some_url/vivi/v2/ZUxOZmVrdzJqTURxV20wQ0RvRld6SytEQWNocThwMGVnbFJ4RDQrZzJMeGRBcnhPYnUzV1pRPT0=/BE?category=PASSENGER&make=30&model=124&regmonth=3&regdate=2015-03&body=443,4781&facelift=252&seats=4&bodyHeight=443&bodyLength=443&weight=-1&engine=1394&wheeldrive=196&transmission=400"
re.findall(r'v2/(.*)/', url)
Resulting with ['ZUxOZmVrdzJqTURxV20wQ0RvRld6SytEQWNocThwMGVnbFJ4RDQrZzJMeGRBcnhPYnUzV1pRPT0='].
But it's safer to use split() the way other mentioned, because when api version changes to v3 this re code won't work anymore.

The string that you are after is not a query parameter, it is part of the URL path.
In the general case you can use the urllib.parse module to parse the URL into its components, then access the path. Then extract the required part of the path:
import base64
from urllib.parse import urlparse, parse_qs
parsed_url = urlparse(url)
s = parsed_url.path.split('/')[-2] # second last component of path
>>> s
'ZUxOZmVrdzJqTURxV20wQ0RvRld6SytEQWNocThwMGVnbFJ4RDQrZzJMeGRBcnhPYnUzV1pRPT0='
>>> base64.b64decode(s)
b'eLNfekw2jMDqWm0CDoFWzK+DAchq8p0eglRxD4+g2LxdArxObu3WZQ=='
The keys and values of the query string can also be processed into a dictionary and accessed by key:
params = parse_qs(parsed_url.query)
>>> params
{'category': ['PASSENGER'], 'make': ['30'], 'model': ['124'], 'regmonth': ['3'], 'regdate': ['2015-03'], 'body': ['443,4781'], 'facelift': ['252'], 'seats': ['4'], 'bodyHeight': ['443'], 'bodyLength': ['443'], 'weight': ['-1'], 'engine': ['1394'], 'wheeldrive': ['196'], 'transmission': ['400']}
>>> params['category']
['PASSENGER']

Related

urlparse doesn't return params for custom schema

I am trying to use urlparse Python library to parse some custom URIs.
I noticed that for some well-known schemes params are parsed correctly:
>>> from urllib.parse import urlparse
>>> urlparse("http://some.domain/some/nested/endpoint;param1=value1;param2=othervalue2?query1=val1&query2=val2#fragment")
ParseResult(scheme='http', netloc='some.domain', path='/some/nested/endpoint', params='param1=value1;param2=othervalue2', query='query1=val1&query2=val2', fragment='fragment')
>>> urlparse("ftp://some.domain/some/nested/endpoint;param1=value1;param2=othervalue2?query1=val1&query2=val2#fragment")
ParseResult(scheme='ftp', netloc='some.domain', path='/some/nested/endpoint', params='param1=value1;param2=othervalue2', query='query1=val1&query2=val2', fragment='fragment')
...but for custom ones - they are not. params field remains empty. Instead, params are treated as a part of path:
>>> urlparse("scheme://some.domain/some/nested/endpoint;param1=value1;param2=othervalue2?query1=val1&query2=val2#fragment")
ParseResult(scheme='scheme', netloc='some.domain', path='/some/nested/endpoint;param1=value1;param2=othervalue2', params='', query='query1=val1&query2=val2', fragment='fragment')
Why there is a difference in parsing depending on schema? How can I parse params within urlparse library using custom schema?

This is because urlparse assumes that only a set of schemes will uses parameters in their URL format. You can see that check with in the source code.
if scheme in uses_params and ';' in url:
url, params = _splitparams(url)
else:
params = ''
Which means urlparse will attempt to parse parameters only if the scheme is in uses_params (which is a list of known schemes).
uses_params = ['', 'ftp', 'hdl', 'prospero', 'http', 'imap',
'https', 'shttp', 'rtsp', 'rtspu', 'sip', 'sips',
'mms', 'sftp', 'tel']
So to get the expected output you can append your custom scheme into uses_params list and perform the urlparse call again.
>>> from urllib.parse import uses_params, urlparse
>>>
>>> uses_params.append('scheme')
>>> urlparse("scheme://some.domain/some/nested/endpoint;param1=value1;param2=othervalue2?query1=val1&query2=val2#fragment")
ParseResult(scheme='scheme', netloc='some.domain', path='/some/nested/endpoint', params='param1=value1;param2=othervalue2', query='query1=val1&query2=val2', fragment='fragment')

Can you remove that custom schemes from the url?
That allways will return the params
urlparse("//some.domain/some/nested/endpoint;param1=value1;param2=othervalue2?query1=val1&query2=val2#fragment")
ParseResult(scheme='', netloc='some.domain', path='/some/nested/endpoint', params='param1=value1;param2=othervalue2', query='query1=val1&query2=val2', fragment='fragment')

Delete only a specific query from an URL

SO I have the following URL: https://foo.bar?query1=value1&query2=value2&query3=value3
I'd need a function that can strip just query2 for example, so that the result would be:
https://foo.bar?query1=value1&query3=value3
I think maybe urllib.parse or furl can do this in an easy and clean way?

You should use urllib.parse as it's designed exactly for these purposes. I'm unclear the reason for anyone reinventing the wheel here.
Basically 3 steps:
Use urlparse to parse the url into it's component parts
Use parse_qs to parse the query string part of that keeping blanks (if relevant intact)
Remove the erroneous query2 and re-encode the query string and url back
From the docs:
Parse a URL into six components, returning a 6-item named tuple. This
corresponds to the general structure of a URL:
scheme://netloc/path;parameters?query#fragment. Each tuple item is a
string, possibly empty.
from urllib.parse import urlparse, urlencode, parse_qs, urlunparse
url = "https://foo.bar?query1=value1&query2=value2&query3=value3"
url_bits = list(urlparse(url))
print(url_bits)
query_string = parse_qs(url_bits[4], keep_blank_values=True)
print(query_string)
del(query_string['query2'])
url_bits[4] = urlencode(query_string, doseq=True)
new_url = urlunparse(url_bits)
print(new_url)
# >>>['https', 'foo.bar', '', '', 'query1=value1&query2=value2&query3=value3', '']
# >>>{'query1': ['value1'], 'query2': ['value2'], 'query3': ['value3']}
# >>>https://foo.bar?query1=value1&query3=value3

If you want by position:
url="https://foo.bar?query1=value1&query2=value2&query3=value3"
findindex1=url.find("&")
findindex2=url.find("&",findindex1+1)
url=url[0:findindex1]+url[findindex2:len(url)]
if you want by the name:
url="https://foo.bar?query1=value1&query3=value3&query2=value2"
findindex1=url.find("query2")
findindex2=url.find("&",findindex1+1)
if findindex2==-1:
url=url[0:findindex1-1]
else:
url=url[0:findindex1-1]+url[findindex2:len(url)]

Hi you could try it with regular expressions.
re.sub("ThePatternOfTheURL","ThePatternYouWantToHave", "TheInput")
so it could look something like that
pattern = "'(https\:\/\/)([a-zA-Z.?0-9=]+)([&]query2=value2)([&][a-zA-Z0-9=]+)'"
#filters the third group out with query2
filter = r"\1\2\4"
yourUrl = "https://foo.bar?query1=value1&query2=value2&query3=value3"
newURL=re.sub(pattern, filter, yourUrl)
I think this should work for you

Python- regular expression to print word within link

I am using Jupyter Notebook to get docid=PE209374738 as my output using reg ex. It is currently stored in a dictionary in this format:
{'Url': 'https://backtoschool.com/document.php?docid=PE209374738&datasource=PHE&vid=3326&referrer=api'}.
This is my code:
results= xmldoc.getElementsByTagName("result")
dict= {}
for a in results:
url= 'Url'
dict[url] = a.getElementsByTagName("url")[0].childNodes[0].nodeValue
docid= re.search(r'\?(.*?)&')
Does anyone have any suggestions on how to print that id?

The standard library already has methods for parsing URLs properly, no need for regex.
In Python 3:
from urllib.parse import urlparse, parse_qs
url = 'https://backtoschool.com/document.php?docid=PE209374738&datasource=PHE&vid=3326&referrer=api'
print(parse_qs(urlparse(url).query)['docid'][0]) # PE209374738
In Python 2 the first line is:
from urlparse import urlparse, parse_qs

#alex-hall is correct, you probably should better parse this using a proper URL parser.
That said, your original question was about doing it with using regexps, so here is the solution (which you nearly nailed already):
s = 'https://backtoschool.com/document.php?docid=PE209374738&datasource=PHE&vid=3326&referrer=api'
m = re.search(r'\?docid=(.*?)&', s)
print m.groups()[0]
This will print the desired PE209374738.

Parsing hostname and port from string or url

I can be given a string in any of these formats:
url: e.g http://www.acme.com:456
string: e.g www.acme.com:456, www.acme.com 456, or www.acme.com
I would like to extract the host and if present a port. If the port value is not present I would like it to default to 80.
I have tried urlparse, which works fine for the url, but not for the other format. When I use urlparse on hostname:port for example, it puts the hostname in the scheme rather than netloc.
I would be happy with a solution that uses urlparse and a regex, or a single regex that could handle both formats.

You can use urlparse to get hostname from URL string:
from urlparse import urlparse
print urlparse("http://www.website.com/abc/xyz.html").hostname # prints www.website.com

>>> from urlparse import urlparse
>>> aaa = urlparse('http://www.acme.com:456')
>>> aaa.hostname
'www.acme.com'
>>> aaa.port
456
>>>

I'm not that familiar with urlparse, but using regex you'd do something like:
p = '(?:http.*://)?(?P<host>[^:/ ]+).?(?P<port>[0-9]*).*'
m = re.search(p,'http://www.abc.com:123/test')
m.group('host') # 'www.abc.com'
m.group('port') # '123'
Or, without port:
m = re.search(p,'http://www.abc.com/test')
m.group('host') # 'www.abc.com'
m.group('port') # '' i.e. you'll have to treat this as '80'
EDIT: fixed regex to also match 'www.abc.com 123'

The reason it fails for:
www.acme.com 456
is because it is not a valid URI. Why don't you just:
Replace the space with a :
Parse the resulting string by using the standard urlparse method
Try and make use of default functionality as much as possible, especially when it comes to things like parsing well know formats like URI's.

Method using urllib -
from urllib.parse import urlparse
url = 'https://stackoverflow.com/questions'
print(urlparse(url))
Output -
ParseResult(scheme='https', netloc='stackoverflow.com',
path='/questions', params='', query='', fragment='')
Reference - https://www.tutorialspoint.com/urllib-parse-parse-urls-into-components-in-python

How to add custom parameters to an URL query string with Python?

I need to add custom parameters to an URL query string using Python
Example:
This is the URL that the browser is fetching (GET):
/scr.cgi?q=1&ln=0
then some python commands are executed, and as a result I need to set following URL in the browser:
/scr.cgi?q=1&ln=0&SOMESTRING=1
Is there some standard approach?

You can use urlsplit() and urlunsplit() to break apart and rebuild a URL, then use urlencode() on the parsed query string:
from urllib import urlencode
from urlparse import parse_qs, urlsplit, urlunsplit
def set_query_parameter(url, param_name, param_value):
"""Given a URL, set or replace a query parameter and return the
modified URL.
>>> set_query_parameter('http://example.com?foo=bar&biz=baz', 'foo', 'stuff')
'http://example.com?foo=stuff&biz=baz'
"""
scheme, netloc, path, query_string, fragment = urlsplit(url)
query_params = parse_qs(query_string)
query_params[param_name] = [param_value]
new_query_string = urlencode(query_params, doseq=True)
return urlunsplit((scheme, netloc, path, new_query_string, fragment))
Use it as follows:
>>> set_query_parameter("/scr.cgi?q=1&ln=0", "SOMESTRING", 1)
'/scr.cgi?q=1&ln=0&SOMESTRING=1'

Use urlsplit() to extract the query string, parse_qsl() to parse it (or parse_qs() if you don't care about argument order), add the new argument, urlencode() to turn it back into a query string, urlunsplit() to fuse it back into a single URL, then redirect the client.

You can use python's url manipulation library furl.
import furl
f = furl.furl("/scr.cgi?q=1&ln=0")
f.args['SOMESTRING'] = 1
print(f.url)

import urllib
url = "/scr.cgi?q=1&ln=0"
param = urllib.urlencode({'SOME&STRING':1})
url = url.endswith('&') and (url + param) or (url + '&' + param)
the docs

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract some data from url using Python - python

Use following code: l=url.split('/') m=l[l.index('v2')+1] print(m)

Related

urlparse doesn't return params for custom schema

Delete only a specific query from an URL

Python- regular expression to print word within link

Parsing hostname and port from string or url

How to add custom parameters to an URL query string with Python?

Categories

Resources