Python url construction: escape characters other than regular letters - python

I am using wikipedia api and using following api request,
http://en.wikipedia.org/w/api.php?`action=query&meta=globaluserinfo&guiuser='$cammer'&guiprop=groups|merged|unattached&format=json`
but the problem is I am unable to escape Dollar Sign and similar characters like that, I tried the following but it didn't work,
r['guiprop'] = u'groups|merged|unattached'
r['guiuser'] = u'$cammer'
I found it this in w3school but checking this for every single character would a pain full, what would be the best way to escape this in the strip.http://www.w3schools.com/tags/ref_urlencode.asp

You should take a look at using urlencode.
from urllib import urlencode
base_url = "http://en.wikipedia.org/w/api.php?"
arguments = dict(action="query",
meta="globaluserinfo",
guiuser="$cammer",
guiprop="groups|merged|unattached",
format="json")
url = base_url + urlencode(arguments)
If you don't need to build a complete url you can just use the quote function for a single string:
>>> import urllib
>>> urllib.quote("$cammer")
'%24cammer'
So you end up with:
r['guiprop'] = urllib.quote(u'groups|merged|unattached')
r['guiuser'] = urllib.quote(u'$cammer')

Related

'ascii' codec can't encode characters in position 10-12: ordinal not in range(128) [duplicate]

I need to fetch data from a URL with non-ascii characters but urllib2.urlopen refuses to open the resource and raises:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0131' in position 26: ordinal not in range(128)
I know the URL is not standards compliant but I have no chance to change it.
What is the way to access a resource pointed by a URL containing non-ascii characters using Python?
edit: In other words, can / how urlopen open a URL like:
http://example.org/Ñöñ-ÅŞÇİİ/
Strictly speaking URIs can't contain non-ASCII characters; what you have there is an IRI.
To convert an IRI to a plain ASCII URI:
non-ASCII characters in the hostname part of the address have to be encoded using the Punycode-based IDNA algorithm;
non-ASCII characters in the path, and most of the other parts of the address have to be encoded using UTF-8 and %-encoding, as per Ignacio's answer.
So:
import re, urlparse
def urlEncodeNonAscii(b):
return re.sub('[\x80-\xFF]', lambda c: '%%%02x' % ord(c.group(0)), b)
def iriToUri(iri):
parts= urlparse.urlparse(iri)
return urlparse.urlunparse(
part.encode('idna') if parti==1 else urlEncodeNonAscii(part.encode('utf-8'))
for parti, part in enumerate(parts)
)
>>> iriToUri(u'http://www.a\u0131b.com/a\u0131b')
'http://www.xn--ab-hpa.com/a%c4%b1b'
(Technically this still isn't quite good enough in the general case because urlparse doesn't split away any user:pass# prefix or :port suffix on the hostname. Only the hostname part should be IDNA encoded. It's easier to encode using normal urllib.quote and .encode('idna') at the time you're constructing a URL than to have to pull an IRI apart.)
In python3, use the urllib.parse.quote function on the non-ascii string:
>>> from urllib.request import urlopen
>>> from urllib.parse import quote
>>> chinese_wikipedia = 'http://zh.wikipedia.org/wiki/Wikipedia:' + quote('首页')
>>> urlopen(chinese_wikipedia)
Python 3 has libraries to handle this situation. Use
urllib.parse.urlsplit to split the URL into its components, and
urllib.parse.quote to properly quote/escape the unicode characters
and urllib.parse.urlunsplit to join it back together.
>>> import urllib.parse
>>> url = 'http://example.com/unicodè'
>>> url = urllib.parse.urlsplit(url)
>>> url = list(url)
>>> url[2] = urllib.parse.quote(url[2])
>>> url = urllib.parse.urlunsplit(url)
>>> print(url)
http://example.com/unicod%C3%A8
It is more complex than the accepted #bobince's answer suggests:
netloc should be encoded using IDNA;
non-ascii URL path should be encoded to UTF-8 and then percent-escaped;
non-ascii query parameters should be encoded to the encoding of a page URL was extracted from (or to the encoding server uses), then percent-escaped.
This is how all browsers work; it is specified in https://url.spec.whatwg.org/ - see this example. A Python implementation can be found in w3lib (this is the library Scrapy is using); see w3lib.url.safe_url_string:
from w3lib.url import safe_url_string
url = safe_url_string(u'http://example.org/Ñöñ-ÅŞÇİİ/', encoding="<page encoding>")
An easy way to check if a URL escaping implementation is incorrect/incomplete is to check if it provides 'page encoding' argument or not.
Based on #darkfeline answer:
from urllib.parse import urlsplit, urlunsplit, quote
def iri2uri(iri):
"""
Convert an IRI to a URI (Python 3).
"""
uri = ''
if isinstance(iri, str):
(scheme, netloc, path, query, fragment) = urlsplit(iri)
scheme = quote(scheme)
netloc = netloc.encode('idna').decode('utf-8')
path = quote(path)
query = quote(query)
fragment = quote(fragment)
uri = urlunsplit((scheme, netloc, path, query, fragment))
return uri
For those not depending strictly on urllib, one practical alternative is requests, which handles IRIs "out of the box".
For example, with http://bücher.ch:
>>> import requests
>>> r = requests.get(u'http://b\u00DCcher.ch')
>>> r.status_code
200
Encode the unicode to UTF-8, then URL-encode.
Use iri2uri method of httplib2. It makes the same thing as by bobin (is he/she the author of that?)
Another option to convert an IRI to an ASCII URI is to use furl package:
gruns/furl: 🌐 URL parsing and manipulation made easy. - https://github.com/gruns/furl
Python's standard urllib and urlparse modules provide a number of URL
related functions, but using these functions to perform common URL
operations proves tedious. Furl makes parsing and manipulating URLs
easy.
Examples
Non-ASCII domain
http://国立極地研究所.jp/english/ (Japanese National Institute of Polar Research website)
import furl
url = 'http://国立極地研究所.jp/english/'
furl.furl(url).tostr()
'http://xn--vcsoey76a2hh0vtuid5qa.jp/english/'
Non-ASCII path
https://ja.wikipedia.org/wiki/日本語 ("Japanese" article in Wikipedia)
import furl
url = 'https://ja.wikipedia.org/wiki/日本語'
furl.furl(url).tostr()
'https://ja.wikipedia.org/wiki/%E6%97%A5%E6%9C%AC%E8%AA%9E'
works! finally
I could not avoid from this strange characters, but at the end I come through it.
import urllib.request
import os
url = "http://www.fourtourismblog.it/le-nuove-tendenze-del-marketing-tenere-docchio/"
with urllib.request.urlopen(url) as file:
html = file.read()
with open("marketingturismo.html", "w", encoding='utf-8') as file:
file.write(str(html.decode('utf-8')))
os.system("marketingturismo.html")

Delete only a specific query from an URL

SO I have the following URL: https://foo.bar?query1=value1&query2=value2&query3=value3
I'd need a function that can strip just query2 for example, so that the result would be:
https://foo.bar?query1=value1&query3=value3
I think maybe urllib.parse or furl can do this in an easy and clean way?
You should use urllib.parse as it's designed exactly for these purposes. I'm unclear the reason for anyone reinventing the wheel here.
Basically 3 steps:
Use urlparse to parse the url into it's component parts
Use parse_qs to parse the query string part of that keeping blanks (if relevant intact)
Remove the erroneous query2 and re-encode the query string and url back
From the docs:
Parse a URL into six components, returning a 6-item named tuple. This
corresponds to the general structure of a URL:
scheme://netloc/path;parameters?query#fragment. Each tuple item is a
string, possibly empty.
from urllib.parse import urlparse, urlencode, parse_qs, urlunparse
url = "https://foo.bar?query1=value1&query2=value2&query3=value3"
url_bits = list(urlparse(url))
print(url_bits)
query_string = parse_qs(url_bits[4], keep_blank_values=True)
print(query_string)
del(query_string['query2'])
url_bits[4] = urlencode(query_string, doseq=True)
new_url = urlunparse(url_bits)
print(new_url)
# >>>['https', 'foo.bar', '', '', 'query1=value1&query2=value2&query3=value3', '']
# >>>{'query1': ['value1'], 'query2': ['value2'], 'query3': ['value3']}
# >>>https://foo.bar?query1=value1&query3=value3
If you want by position:
url="https://foo.bar?query1=value1&query2=value2&query3=value3"
findindex1=url.find("&")
findindex2=url.find("&",findindex1+1)
url=url[0:findindex1]+url[findindex2:len(url)]
if you want by the name:
url="https://foo.bar?query1=value1&query3=value3&query2=value2"
findindex1=url.find("query2")
findindex2=url.find("&",findindex1+1)
if findindex2==-1:
url=url[0:findindex1-1]
else:
url=url[0:findindex1-1]+url[findindex2:len(url)]
Hi you could try it with regular expressions.
re.sub("ThePatternOfTheURL","ThePatternYouWantToHave", "TheInput")
so it could look something like that
pattern = "'(https\:\/\/)([a-zA-Z.?0-9=]+)([&]query2=value2)([&][a-zA-Z0-9=]+)'"
#filters the third group out with query2
filter = r"\1\2\4"
yourUrl = "https://foo.bar?query1=value1&query2=value2&query3=value3"
newURL=re.sub(pattern, filter, yourUrl)
I think this should work for you

How to print only a specific link in Python

I'm still a newbie in Python but I'm trying to make my first little program.
My intention is to print only the link ending with .m3u8 (if available) istead of printing the whole web page.
The code I'm currently using:
import requests
channel1 = requests.get('https://website.tv/user/111111')
print(channel1.content)
print('\n')
channel2 = requests.get('https://website.tv/user/222222')
print(channel2.content)
print('\n')
input('Press Enter to Exit...')
The link I'm looking for always has 47 characters in total, and it's always the same model just changing the stream id represented as X:
https://website.tv/live/streamidXXXXXXXXX.m3u8
Can anyone help me?
You can use regex for this problem.
Explanation:
here in the expression portion .*? means to consider everything and whatever enclosed in \b(expr)\b needs to be present there mandatorily.
For e.g.:
import re
link="https://website.tv/live/streamidXXXXXXXXX.m3u8"
p=re.findall(r'.*?\b.m3u8\b',link)
print(p)
OUTPUT:
['https://website.tv/live/streamidXXXXXXXXX.m3u8']
There are a few ways to go about this, one that springs to mind which others have touched upon is using regex with findall that returns back a list of matched urls from our url_list.
Another option could also be BeautifulSoup but without more information regarding the html structure it may not be the best tool here.
Using Regex
from re import findall
from requests import get
def check_link(response):
result = findall(
r'.*?\b.m3u8\b',
str(response.content),
)
return result
def main(url):
response = get(url)
if response.ok:
link_found = check_link(response)
if link_found:
print('link {} found at {}'.format(
link_found,
url,
),
)
if __name__ == '__main__':
url_list = [
'http://www.test_1.com',
'http://www.test_2.com',
'http://www.test_3.com',
]
for url in url_list:
main(url)
print("All finished")
If I understand your question correctly I think you want to use Python's .split() string method. If your goal is to take a string like "https://website.tv/live/streamidXXXXXXXXX.m3u8" and extract just "streamidXXXXXXXXX.m3u8" then you could do that with the following code:
web_address = "https://website.tv/live/streamidXXXXXXXXX.m3u8"
specific_file = web_address.split('/')[-1]
print(specific_file)
The calling .split('/') on the string like that will return a list of strings where each item in the list is a different part of the string (first part being "https:", etc.). The last one of these (index [-1]) will be the file extension you want.
This will extract all URLs from webpage and filter only those which contain your required keyword ".m3u8"
import requests
import re
def get_desired_url(data):
urls = []
for url in re.findall(r'(https?://\S+)', data):
if ".m3u8" in url:
urls.append(url)
return urls
channel1 = requests.get('https://website.tv/user/111111')
urls = get_desired_url(channel1 )
Try this, I think this will be robust
import re
links=[re.sub('^<[ ]*a[ ]+.*href[ ]*=[ ]*', '', re.sub('.*>$', '', link) for link in re.findall(r'<[ ]*a[ ]+.*href[ ]*=[]*"http[s]*://.+\.m3u8".*>',channel2.content)]

Python- regular expression to print word within link

I am using Jupyter Notebook to get docid=PE209374738 as my output using reg ex. It is currently stored in a dictionary in this format:
{'Url': 'https://backtoschool.com/document.php?docid=PE209374738&datasource=PHE&vid=3326&referrer=api'}.
This is my code:
results= xmldoc.getElementsByTagName("result")
dict= {}
for a in results:
url= 'Url'
dict[url] = a.getElementsByTagName("url")[0].childNodes[0].nodeValue
docid= re.search(r'\?(.*?)&')
Does anyone have any suggestions on how to print that id?
The standard library already has methods for parsing URLs properly, no need for regex.
In Python 3:
from urllib.parse import urlparse, parse_qs
url = 'https://backtoschool.com/document.php?docid=PE209374738&datasource=PHE&vid=3326&referrer=api'
print(parse_qs(urlparse(url).query)['docid'][0]) # PE209374738
In Python 2 the first line is:
from urlparse import urlparse, parse_qs
#alex-hall is correct, you probably should better parse this using a proper URL parser.
That said, your original question was about doing it with using regexps, so here is the solution (which you nearly nailed already):
s = 'https://backtoschool.com/document.php?docid=PE209374738&datasource=PHE&vid=3326&referrer=api'
m = re.search(r'\?docid=(.*?)&', s)
print m.groups()[0]
This will print the desired PE209374738.

Url decode UTF-8 in Python

In Python 2.7, given a URL like example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0, how can I decode it to the expected result, example.com?title==правовая+защита?
I tried url=urllib.unquote(url.encode("utf8")), but it seems to give a wrong result.
The data is UTF-8 encoded bytes escaped with URL quoting, so you want to decode, with urllib.parse.unquote(), which handles decoding from percent-encoded data to UTF-8 bytes and then to text, transparently:
from urllib.parse import unquote
url = unquote(url)
Demo:
>>> from urllib.parse import unquote
>>> url = 'example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0'
>>> unquote(url)
'example.com?title=правовая+защита'
The Python 2 equivalent is urllib.unquote(), but this returns a bytestring, so you'd have to decode manually:
from urllib import unquote
url = unquote(url).decode('utf8')
If you are using Python 3, you can use urllib.parse.unquote:
url = """example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0"""
import urllib.parse
urllib.parse.unquote(url)
gives:
'example.com?title=правовая+защита'
You can achieve an expected result with requests library as well:
import requests
url = "http://www.mywebsite.org/Data%20Set.zip"
print(f"Before: {url}")
print(f"After: {requests.utils.unquote(url)}")
Output:
$ python3 test_url_unquote.py
Before: http://www.mywebsite.org/Data%20Set.zip
After: http://www.mywebsite.org/Data Set.zip
Might be handy if you are already using requests, without using another library for this job.
In HTML the URLs can contain html entities.
This replaces them, too.
#from urllib import unquote #earlier python version
from urllib.request import unquote
from html import unescape
unescape(unquote('https://v.w.xy/p1/p22?userId=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx&confirmationToken=7uAf%2fxJoxRTFAZdxslCn2uwVR9vV7cYrlHs%2fl9sU%2frix9f9CnVx8uUT%2bu8y1%2fWCs99INKDnfA2ayhGP1ZD0z%2bodXjK9xL5I4gjKR2xp7p8Sckvb04mddf%2fiG75QYiRevgqdMnvd9N5VZp2ksBc83lDg7%2fgxqIwktteSI9RA3Ux9VIiNxx%2fZLe9dZSHxRq9AA'))
I know this is an old question, but I stumbled upon this via Google search and found that no one has proposed a solution with only built-in features.
So I quickly wrote my own.
Basically a url string can only contain these characters: A-Z, a-z, 0-9, -, ., _, ~, :, /, ?, #, [, ], #, !, $, &, ', (, ), *, +, ,, ;, %, and =, everything else are url encoded.
URL encoding is pretty straight forward, just a percent sign followed by the hexadecimal digits of the byte values corresponding to the codepoints of illegal characters.
So basically using a simple while loop to iterate the characters, add any character's byte as is if it is not a percent sign, increment index by one, else add the byte following the percent sign and increment index by three, accumulate the bytes and decoding them should work perfectly.
Here is the code:
def url_parse(url):
l = len(url)
data = bytearray()
i = 0
while i < l:
if url[i] != '%':
d = ord(url[i])
i += 1
else:
d = int(url[i+1:i+3], 16)
i += 3
data.append(d)
return data.decode('utf8')
I have tested it and it works perfectly.

Categories