Using urllib to open a url with an accent

Using urllib to open a url with an accent - python

I am trying to open a url using urlopen in urllib, but am getting an error due to an accent mark in the URL:
import urllib
import ssl
context = ssl._create_unverified_context()
url = 'https://en.wikipedia.org/wiki/Raúl_Grijalva'
page = urllib.request.urlopen(url, context=context)
UnicodeEncodeError: 'ascii' codec can't encode character '\xfa' in position 12: ordinal not in range(128)
I found this answer suggesting adding a u to the string and encoding, but this gives me a different error:
import urllib
import ssl
context = ssl._create_unverified_context()
url = u'https://en.wikipedia.org/wiki/Raúl_Grijalva'
page = urllib.request.urlopen(url.encode('UTF-8'), context=context)
AttributeError: 'bytes' object has no attribute 'timeout'
I did notice in that answer they use urllib.urlopen instead of urllib.request.urlopen and I'm not exactly sure what the difference between these is, but the former throws an error that urllib doesn't have that attribute.
How can I properly handle this character in the url?

Using parse.quote() to escape the text with accent character seems to work:
from urllib import request, parse
import ssl
context = ssl._create_unverified_context()
url = 'https://en.wikipedia.org/'
path = parse.quote('wiki/Raúl_Grijalva')
page = request.urlopen(url + path, context=context)

Related

Encoding error whilst asynchronously scraping website

I have the following code here:
import aiohttp
import asyncio
from bs4 import BeautifulSoup
async def main():
async with aiohttp.ClientSession() as session:
async with session.get('https://www.pro-football-reference.com/years/2021/defense.htm') as response:
soup = BeautifulSoup(await response.text(), features="lxml")
print(soup)
asyncio.run(main())
But, it gives me the error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdf in position 2170985: invalid continuation byte for the line await response.text(). I believe the problem is that the url ends in a .htm instead of a .com.
Is there any way to decode it?
Note: I would not like to use response.read().

The website's headers indicate that the page should be encoded as UTF-8, but evidently it isn't:
$ curl --head --silent https://www.pro-football-reference.com/years/2021/defense.htm | grep -i charset
content-type: text/html; charset=UTF-8
Let's inspect the content:
>>> r = requests.get('https://www.pro-football-reference.com/years/2021/defense.htm')
>>> r.content[2170980:2170990]
b'/">Fu\xdfball'
It looks like this should be "Fußball", which would be b'Fu\xc3\x9fball' if encoded with UTF-8.
If we look up 0xdf in Triplee's Table of Legacy 8-bit Encodings we find that it represents "ß" in any of these encodings:
cp1250, cp1252, cp1254, cp1257, cp1258, iso8859_10, iso8859_13, iso8859_14, iso8859_15, iso8859_16, iso8859_2, iso8859_3, iso8859_4, iso8859_9, latin_1, palmos
Without any other information, I would choose latin-1 as the encoding; however it might be simpler to pass request.content to Beautiful Soup and let it handle decoding.

Curious as to why not just use pandas here?
import pandas as pd
url = 'https://www.pro-football-reference.com/years/2021/defense.htm'
df = pd.read_html('https://www.pro-football-reference.com/years/2021/defense.htm', header=1)[0]
df = df[df['Rk'].ne('Rk')]

Atom is giving an error when requesting data from a website

I'm requesting html data from a url using the request module in python.
Here is my code
import requests
source = requests.get('http://coreyms.com')
print(source.text)
When I run this in atom it gives me an error;
File "/Users/isaacrichardson/Desktop/Python/Web Scraping/wiki.py", line 7, in <module>
print(source.text)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 34807: ordinal not in range(128)
But when I run it in Treehouse workspaces it works fine and shows me the html data.
Whats wrong with atom or my code?

The requests library is not installed correctly for atom or it is not useable for it. Installing it correctly will solve the issue.
If that doesn't work I would try to use the beautiful soup module:
from bs4 import BeautifulSoup
doc = BeautifulSoup(source.text, "html.parser")
print(doc.text)

requests guesses the encoding when you access the .text attribute of the response object.
If you know the encoding of the response beforehand you should explicitly set it before accessing the .text attribute:
import requests
source = requests.get('http://coreyms.com')
source.encoding = 'utf-8' # or whatever the encoding is
print(source.text)
Alternatively, you can also work with .content to access the binary response conent and decode it yourself.
You may want to verify if the encodings are indeed guessed differently in your IDEs by simply printing source.encoding.

Python 3 requests.get().text returns unencoded string

Python 3 requests.get().text returns unencoded string.
If I execute:
import requests
request = requests.get('https://google.com/search?q=Кто является президентом России?').text.lower()
print(request)
I get kind of this:
Кто является презид
I've tried to change google.com to google.ru
If I execute:
import requests
request = requests.get('https://google.ru/search?q=Кто является президентом России?').text.lower()
print(request)
I get kind of this:
d0%9a%d1%82%d0%be+%d1%8f%d0%b2%d0%bb%d1%8f%d0%b5%d1%82%d1%81%d1%8f+%d0%bf%d1%80%d0%b5%d0%b7%d0%b8%d0%b4%d0%b5%d0%bd%d1%82%d0%be%d0%bc+%d0%a0%d0%be%d1%81%d1%81%d0%b8%d0
I need to get an encoded normal string.

You were getting this error because requests was not able to identify the correct encoding of the response. So if you are sure about the response encoding then you can set it like the following:
response = requests.get(url)
response.encoding --> to check the encoding
response.encoding = "utf-8" --> or any other encoding.
And then get the content with .text method.

I fixed it with urllib.parse.unquote() method:
import requests
from urllib.parse import unquote
request = unquote(requests.get('https://google.ru/search?q=Кто является президентом России?').text.lower())
print(request)

urlopen doesn't appear to work depending on how input text is generated

For some reason or another, it appears that depending on where I copy and paste a url from, urllib.request.urlopen won't work. For example when I copy http://www.google.com/ from the address bar and run the following script:
from urllib import request
url = "http://www.google.com/"
response = request.urlopen(url)
print(response)
I get the following error at the call to urlopen:
UnicodeEncodeError: 'ascii' codec can't encode character '\ufeff' in position 5: ordinal not in range(128)
But if I copy the url string from text on a web page or type it out by hand, it works just fine. To be clear, this doesn't work
from urllib import request
url = "http://www.google.com/"
response = request.urlopen(url)
print(response)
#url1 = "http://www.google.com/"
#
#response1 = request.urlopen(url1)
#print(response1)
but this does:
from urllib import request
#url = "http://www.google.com/"
#
#response = request.urlopen(url)
#print(response)
url1 = "http://www.google.com/"
response1 = request.urlopen(url1)
print(response1)
My suspicion is that the encoding is different in the actual address bar, and Spyder knows how to handle it, but I don't because I can't see what is actually going on.
EDIT: As requested...
print(ascii(url))
'http://www.google.com/\ufeff'
print(ascii(url1))
'http://www.google.com/'
Indeed the strings are different.

\ufeff is a zero-width non-breaking space, so it's no wonder you can't see it. Yes, there's an invisible character in your URL. At least it's not gremlins.

You could try
from urllib import request
url = "http://www.google.com/"
response = request.urlopen(url.decode("utf-8-sig").replace("\ufeff", ""))
print(response)
That Unicode character is the byte-order mark or BOM (more info here) you can encode without BOM by using utf-8-sig decoding then replacing the problematic character

UnicodeWarning: Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER

I am using python+bs4+pyside in the code ,please look the part of the code below:
enter code here
#coding:gb2312
import urllib2
import sys
import urllib
import urlparse
import random
import time
from datetime import datetime, timedelta
import socket
from bs4 import BeautifulSoup
import lxml.html
from PySide.QtGui import *
from PySide.QtCore import *
from PySide.QtWebKit import *
def download(self, url, headers, proxy, num_retries, data=None):
print 'Downloading:', url
request = urllib2.Request(url, data, headers or {})
opener = self.opener or urllib2.build_opener()
if proxy:
proxy_params = {urlparse.urlparse(url).scheme: proxy}
opener.add_handler(urllib2.ProxyHandler(proxy_params))
try:
response = opener.open(request)
html = response.read()
code = response.code
except Exception as e:
print 'Download error:', str(e)
html = ''
if hasattr(e, 'code'):
code = e.code
if num_retries > 0 and 500 <= code < 600:
# retry 5XX HTTP errors
return self._get(url, headers, proxy, num_retries-1, data)
else:
code = None
return {'html': html, 'code': code}
def crawling_hdf(openfile):
filename = open(openfile,'r')
namelist = filename.readlines()
app = QApplication(sys.argv)
for name in namelist:
url = "http://so.haodf.com/index/search?type=doctor&kw="+ urllib.quote(name)
#get doctor's home page
D = Downloader(delay=DEFAULT_DELAY, user_agent=DEFAULT_AGENT, proxies=None, num_retries=DEFAULT_RETRIES, cache=None)
html = D(url)
soup = BeautifulSoup(html)
tr = soup.find(attrs={'class':'docInfo'})
td = tr.find(attrs={'class':'docName font_16'}).get('href')
print td
#get doctor's detail information page
loadPage_bs4(td)
filename.close()
if __name__ == '__main__':
crawling_hdf("name_list.txt")
After I run the program , there shows a waring message:
Warning (from warnings module):
File "C:\Python27\lib\site-packages\bs4\dammit.py", line 231
"Some characters could not be decoded, and were "
UnicodeWarning: Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
I have used print str(html) and find all chinese language in tages are messy code.
I have tried use ”decode or encode“ and ”gzip“ solutions which are search in this website，but it doesn't work in my case.
Thank you very much for your help！

It looks like that page is encoded in gbk. The problem is that there is no direct conversion between utf-8 and gbk (that I am aware of).
I've seen this workaround used before, try:
html.encode('latin-1').decode('gbk').encode('utf-8')

GBK is one of the built-in encodings in the codecs in Python.
That means that anywhere you have a string of raw bytes, you can use the method decode and the appropriate codec name (or its alias) to convert it to a native Unicode string.
The following works (adapted from https://stackoverflow.com/q/36530578/2564301), insofar the returned text does not contain 'garbage' or 'unknown' characters, and indeed is differently encoded than the source page (as verified by saving this as a new file and comparing the values for the Chinese characters).
from urllib import request
def scratch(url,encode='utf-8'):
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = {'User-Agent':user_agent}
req = request.Request(url,headers=headers)
result = request.urlopen(req)
page = result.read()
u_page = page.decode(encoding="gbk")
result.close()
print(u_page)
return u_page
page = scratch('http://so.haodf.com/index/search')
print (page)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using urllib to open a url with an accent - python

Using parse.quote() to escape the text with accent character seems to work: from urllib import request, parse import ssl context = ssl._create_unverified_context() url = 'https://en.wikipedia.org/' path = parse.quote('wiki/Raúl_Grijalva') page = request.urlopen(url + path, context=context)

Related

Encoding error whilst asynchronously scraping website

Atom is giving an error when requesting data from a website

Python 3 requests.get().text returns unencoded string

urlopen doesn't appear to work depending on how input text is generated

UnicodeWarning: Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER

Categories

Resources