Delete only a specific query from an URL - python

SO I have the following URL: https://foo.bar?query1=value1&query2=value2&query3=value3
I'd need a function that can strip just query2 for example, so that the result would be:
https://foo.bar?query1=value1&query3=value3
I think maybe urllib.parse or furl can do this in an easy and clean way?

You should use urllib.parse as it's designed exactly for these purposes. I'm unclear the reason for anyone reinventing the wheel here.
Basically 3 steps:
Use urlparse to parse the url into it's component parts
Use parse_qs to parse the query string part of that keeping blanks (if relevant intact)
Remove the erroneous query2 and re-encode the query string and url back
From the docs:
Parse a URL into six components, returning a 6-item named tuple. This
corresponds to the general structure of a URL:
scheme://netloc/path;parameters?query#fragment. Each tuple item is a
string, possibly empty.
from urllib.parse import urlparse, urlencode, parse_qs, urlunparse
url = "https://foo.bar?query1=value1&query2=value2&query3=value3"
url_bits = list(urlparse(url))
print(url_bits)
query_string = parse_qs(url_bits[4], keep_blank_values=True)
print(query_string)
del(query_string['query2'])
url_bits[4] = urlencode(query_string, doseq=True)
new_url = urlunparse(url_bits)
print(new_url)
# >>>['https', 'foo.bar', '', '', 'query1=value1&query2=value2&query3=value3', '']
# >>>{'query1': ['value1'], 'query2': ['value2'], 'query3': ['value3']}
# >>>https://foo.bar?query1=value1&query3=value3

If you want by position:
url="https://foo.bar?query1=value1&query2=value2&query3=value3"
findindex1=url.find("&")
findindex2=url.find("&",findindex1+1)
url=url[0:findindex1]+url[findindex2:len(url)]
if you want by the name:
url="https://foo.bar?query1=value1&query3=value3&query2=value2"
findindex1=url.find("query2")
findindex2=url.find("&",findindex1+1)
if findindex2==-1:
url=url[0:findindex1-1]
else:
url=url[0:findindex1-1]+url[findindex2:len(url)]

Hi you could try it with regular expressions.
re.sub("ThePatternOfTheURL","ThePatternYouWantToHave", "TheInput")
so it could look something like that
pattern = "'(https\:\/\/)([a-zA-Z.?0-9=]+)([&]query2=value2)([&][a-zA-Z0-9=]+)'"
#filters the third group out with query2
filter = r"\1\2\4"
yourUrl = "https://foo.bar?query1=value1&query2=value2&query3=value3"
newURL=re.sub(pattern, filter, yourUrl)
I think this should work for you

Related

How to print only a specific link in Python

I'm still a newbie in Python but I'm trying to make my first little program.
My intention is to print only the link ending with .m3u8 (if available) istead of printing the whole web page.
The code I'm currently using:
import requests
channel1 = requests.get('https://website.tv/user/111111')
print(channel1.content)
print('\n')
channel2 = requests.get('https://website.tv/user/222222')
print(channel2.content)
print('\n')
input('Press Enter to Exit...')
The link I'm looking for always has 47 characters in total, and it's always the same model just changing the stream id represented as X:
https://website.tv/live/streamidXXXXXXXXX.m3u8
Can anyone help me?
You can use regex for this problem.
Explanation:
here in the expression portion .*? means to consider everything and whatever enclosed in \b(expr)\b needs to be present there mandatorily.
For e.g.:
import re
link="https://website.tv/live/streamidXXXXXXXXX.m3u8"
p=re.findall(r'.*?\b.m3u8\b',link)
print(p)
OUTPUT:
['https://website.tv/live/streamidXXXXXXXXX.m3u8']
There are a few ways to go about this, one that springs to mind which others have touched upon is using regex with findall that returns back a list of matched urls from our url_list.
Another option could also be BeautifulSoup but without more information regarding the html structure it may not be the best tool here.
Using Regex
from re import findall
from requests import get
def check_link(response):
result = findall(
r'.*?\b.m3u8\b',
str(response.content),
)
return result
def main(url):
response = get(url)
if response.ok:
link_found = check_link(response)
if link_found:
print('link {} found at {}'.format(
link_found,
url,
),
)
if __name__ == '__main__':
url_list = [
'http://www.test_1.com',
'http://www.test_2.com',
'http://www.test_3.com',
]
for url in url_list:
main(url)
print("All finished")
If I understand your question correctly I think you want to use Python's .split() string method. If your goal is to take a string like "https://website.tv/live/streamidXXXXXXXXX.m3u8" and extract just "streamidXXXXXXXXX.m3u8" then you could do that with the following code:
web_address = "https://website.tv/live/streamidXXXXXXXXX.m3u8"
specific_file = web_address.split('/')[-1]
print(specific_file)
The calling .split('/') on the string like that will return a list of strings where each item in the list is a different part of the string (first part being "https:", etc.). The last one of these (index [-1]) will be the file extension you want.
This will extract all URLs from webpage and filter only those which contain your required keyword ".m3u8"
import requests
import re
def get_desired_url(data):
urls = []
for url in re.findall(r'(https?://\S+)', data):
if ".m3u8" in url:
urls.append(url)
return urls
channel1 = requests.get('https://website.tv/user/111111')
urls = get_desired_url(channel1 )
Try this, I think this will be robust
import re
links=[re.sub('^<[ ]*a[ ]+.*href[ ]*=[ ]*', '', re.sub('.*>$', '', link) for link in re.findall(r'<[ ]*a[ ]+.*href[ ]*=[]*"http[s]*://.+\.m3u8".*>',channel2.content)]

How to conver url to subdomain using Python

So I have a list of URLs in a urls.txt file containing URL like examples given below:
https://benetech.blogspot.com/2019/02/robin-seaman-agent-of-inclusion.html
https://nikpeachey.blogspot.com/2020/01/digital-tools-for-teachers-trainers.html
https://blogurls245.blogspot.com/
Now I want to convert all URLs of that urls.txt to the subdomain, like the example given below:
https://benetech.blogspot.com
https://nikpeachey.blogspot.com
https://blogurls245.blogspot.com
I tried to do it using the TLD module but being an extreme beginner into Python couldn't figure out!
It'd be great if someone could help me with this getting done via Python.
Use the urllib.parse module to parse the URL into its constituent parts and assemble it back together, omitting parts you're not interested in:
from urllib.parse import urlsplit, urlunsplit
url = 'https://benetech.blogspot.com/2019/02/robin-seaman-agent-of-inclusion.html'
base = urlunsplit(urlsplit(url)[:2] + ('', '', ''))
print(base) # https://benetech.blogspot.com
Using the urllib.parse module from the standard library:
url_parts = urllib.parse.urlparse(url)
url_parts.path = “”
url_parts.query = “”
url_parts.fragment = “”
domain_only_url = urllib.parse.urlunparse(url_parts)
from urllib.parse import urlparse
sample_url = 'https://benetech.blogspot.com/2019/02/robin-seaman-agent-of-inclusion.html'
parsed_url = urlparse(sample_url)
subdomain = f'{parsed_url.scheme}://{parsed_url.hostname}'
print(subdomain)
Output:
https://benetech.blogspot.com
Do it like this:
url = 'https://benetech.blogspot.com/2019/02/robin-seaman-agent-of-inclusion.html'
parts = url.split('/')
subdomain = parts[0] + '//' + parts[2]
subdomain will be --> https://benetech.blogspot.com
split('/') will split string to several parts with /.
i.e --> 'my/name/is/Amirreza' will be --> ['my','name','is','Amirreza']

How to extract some data from url using Python

I have an url as follows:
https://some_url/vivi/v2/ZUxOZmVrdzJqTURxV20wQ0RvRld6SytEQWNocThwMGVnbFJ4RDQrZzJMeGRBcnhPYnUzV1pRPT0=/BE?category=PASSENGER&make=30&model=124&regmonth=3&regdate=2015-03&body=443,4781&facelift=252&seats=4&bodyHeight=443&bodyLength=443&weight=-1&engine=1394&wheeldrive=196&transmission=400
What I need is to get the string after v2/, thus ZUxOZmVrdzJqTURxV20wQ0RvRld6SytEQWNocThwMGVnbFJ4RDQrZzJMeGRBcnhPYnUzV1pRPT0=
I use furl to extract the parameter value. I do it as follows:
furl(url).args['category'] // gives PASSENGER
But here I do not have the name of the parameter.
How can I do that?
If you don't need a generalized solution but for the url you have provided in question. Then you can do the following:
url="https://some_url/vivi/v2/ZUxOZmVrdzJqTURxV20wQ0RvRld6SytEQWNocThwMGVnbFJ4RDQrZzJMeGRBcnhPYnUzV1pRPT0=/BE?category=PASSENGER&make=30&model=124&regmonth=3&regdate=2015-03&body=443,4781&facelift=252&seats=4&bodyHeight=443&bodyLength=443&weight=-1&engine=1394&wheeldrive=196&transmission=400"
answer=url.split('/')[5]
Use following code:
l=url.split('/')
m=l[l.index('v2')+1]
print(m)
Desired output using re.
import re
url = "https://some_url/vivi/v2/ZUxOZmVrdzJqTURxV20wQ0RvRld6SytEQWNocThwMGVnbFJ4RDQrZzJMeGRBcnhPYnUzV1pRPT0=/BE?category=PASSENGER&make=30&model=124&regmonth=3&regdate=2015-03&body=443,4781&facelift=252&seats=4&bodyHeight=443&bodyLength=443&weight=-1&engine=1394&wheeldrive=196&transmission=400"
re.findall(r'v2/(.*)/', url)
Resulting with ['ZUxOZmVrdzJqTURxV20wQ0RvRld6SytEQWNocThwMGVnbFJ4RDQrZzJMeGRBcnhPYnUzV1pRPT0='].
But it's safer to use split() the way other mentioned, because when api version changes to v3 this re code won't work anymore.
The string that you are after is not a query parameter, it is part of the URL path.
In the general case you can use the urllib.parse module to parse the URL into its components, then access the path. Then extract the required part of the path:
import base64
from urllib.parse import urlparse, parse_qs
parsed_url = urlparse(url)
s = parsed_url.path.split('/')[-2] # second last component of path
>>> s
'ZUxOZmVrdzJqTURxV20wQ0RvRld6SytEQWNocThwMGVnbFJ4RDQrZzJMeGRBcnhPYnUzV1pRPT0='
>>> base64.b64decode(s)
b'eLNfekw2jMDqWm0CDoFWzK+DAchq8p0eglRxD4+g2LxdArxObu3WZQ=='
The keys and values of the query string can also be processed into a dictionary and accessed by key:
params = parse_qs(parsed_url.query)
>>> params
{'category': ['PASSENGER'], 'make': ['30'], 'model': ['124'], 'regmonth': ['3'], 'regdate': ['2015-03'], 'body': ['443,4781'], 'facelift': ['252'], 'seats': ['4'], 'bodyHeight': ['443'], 'bodyLength': ['443'], 'weight': ['-1'], 'engine': ['1394'], 'wheeldrive': ['196'], 'transmission': ['400']}
>>> params['category']
['PASSENGER']

Python url construction: escape characters other than regular letters

I am using wikipedia api and using following api request,
http://en.wikipedia.org/w/api.php?`action=query&meta=globaluserinfo&guiuser='$cammer'&guiprop=groups|merged|unattached&format=json`
but the problem is I am unable to escape Dollar Sign and similar characters like that, I tried the following but it didn't work,
r['guiprop'] = u'groups|merged|unattached'
r['guiuser'] = u'$cammer'
I found it this in w3school but checking this for every single character would a pain full, what would be the best way to escape this in the strip.http://www.w3schools.com/tags/ref_urlencode.asp
You should take a look at using urlencode.
from urllib import urlencode
base_url = "http://en.wikipedia.org/w/api.php?"
arguments = dict(action="query",
meta="globaluserinfo",
guiuser="$cammer",
guiprop="groups|merged|unattached",
format="json")
url = base_url + urlencode(arguments)
If you don't need to build a complete url you can just use the quote function for a single string:
>>> import urllib
>>> urllib.quote("$cammer")
'%24cammer'
So you end up with:
r['guiprop'] = urllib.quote(u'groups|merged|unattached')
r['guiuser'] = urllib.quote(u'$cammer')

Parsing hostname and port from string or url

I can be given a string in any of these formats:
url: e.g http://www.acme.com:456
string: e.g www.acme.com:456, www.acme.com 456, or www.acme.com
I would like to extract the host and if present a port. If the port value is not present I would like it to default to 80.
I have tried urlparse, which works fine for the url, but not for the other format. When I use urlparse on hostname:port for example, it puts the hostname in the scheme rather than netloc.
I would be happy with a solution that uses urlparse and a regex, or a single regex that could handle both formats.
You can use urlparse to get hostname from URL string:
from urlparse import urlparse
print urlparse("http://www.website.com/abc/xyz.html").hostname # prints www.website.com
>>> from urlparse import urlparse
>>> aaa = urlparse('http://www.acme.com:456')
>>> aaa.hostname
'www.acme.com'
>>> aaa.port
456
>>>
I'm not that familiar with urlparse, but using regex you'd do something like:
p = '(?:http.*://)?(?P<host>[^:/ ]+).?(?P<port>[0-9]*).*'
m = re.search(p,'http://www.abc.com:123/test')
m.group('host') # 'www.abc.com'
m.group('port') # '123'
Or, without port:
m = re.search(p,'http://www.abc.com/test')
m.group('host') # 'www.abc.com'
m.group('port') # '' i.e. you'll have to treat this as '80'
EDIT: fixed regex to also match 'www.abc.com 123'
The reason it fails for:
www.acme.com 456
is because it is not a valid URI. Why don't you just:
Replace the space with a :
Parse the resulting string by using the standard urlparse method
Try and make use of default functionality as much as possible, especially when it comes to things like parsing well know formats like URI's.
Method using urllib -
from urllib.parse import urlparse
url = 'https://stackoverflow.com/questions'
print(urlparse(url))
Output -
ParseResult(scheme='https', netloc='stackoverflow.com',
path='/questions', params='', query='', fragment='')
Reference - https://www.tutorialspoint.com/urllib-parse-parse-urls-into-components-in-python

Categories