Encoding error whilst asynchronously scraping website

Encoding error whilst asynchronously scraping website - python

I have the following code here:
import aiohttp
import asyncio
from bs4 import BeautifulSoup
async def main():
async with aiohttp.ClientSession() as session:
async with session.get('https://www.pro-football-reference.com/years/2021/defense.htm') as response:
soup = BeautifulSoup(await response.text(), features="lxml")
print(soup)
asyncio.run(main())
But, it gives me the error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdf in position 2170985: invalid continuation byte for the line await response.text(). I believe the problem is that the url ends in a .htm instead of a .com.
Is there any way to decode it?
Note: I would not like to use response.read().

The website's headers indicate that the page should be encoded as UTF-8, but evidently it isn't:
$ curl --head --silent https://www.pro-football-reference.com/years/2021/defense.htm | grep -i charset
content-type: text/html; charset=UTF-8
Let's inspect the content:
>>> r = requests.get('https://www.pro-football-reference.com/years/2021/defense.htm')
>>> r.content[2170980:2170990]
b'/">Fu\xdfball'
It looks like this should be "Fußball", which would be b'Fu\xc3\x9fball' if encoded with UTF-8.
If we look up 0xdf in Triplee's Table of Legacy 8-bit Encodings we find that it represents "ß" in any of these encodings:
cp1250, cp1252, cp1254, cp1257, cp1258, iso8859_10, iso8859_13, iso8859_14, iso8859_15, iso8859_16, iso8859_2, iso8859_3, iso8859_4, iso8859_9, latin_1, palmos
Without any other information, I would choose latin-1 as the encoding; however it might be simpler to pass request.content to Beautiful Soup and let it handle decoding.

Curious as to why not just use pandas here?
import pandas as pd
url = 'https://www.pro-football-reference.com/years/2021/defense.htm'
df = pd.read_html('https://www.pro-football-reference.com/years/2021/defense.htm', header=1)[0]
df = df[df['Rk'].ne('Rk')]

Related

Atom is giving an error when requesting data from a website

I'm requesting html data from a url using the request module in python.
Here is my code
import requests
source = requests.get('http://coreyms.com')
print(source.text)
When I run this in atom it gives me an error;
File "/Users/isaacrichardson/Desktop/Python/Web Scraping/wiki.py", line 7, in <module>
print(source.text)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 34807: ordinal not in range(128)
But when I run it in Treehouse workspaces it works fine and shows me the html data.
Whats wrong with atom or my code?

The requests library is not installed correctly for atom or it is not useable for it. Installing it correctly will solve the issue.
If that doesn't work I would try to use the beautiful soup module:
from bs4 import BeautifulSoup
doc = BeautifulSoup(source.text, "html.parser")
print(doc.text)

requests guesses the encoding when you access the .text attribute of the response object.
If you know the encoding of the response beforehand you should explicitly set it before accessing the .text attribute:
import requests
source = requests.get('http://coreyms.com')
source.encoding = 'utf-8' # or whatever the encoding is
print(source.text)
Alternatively, you can also work with .content to access the binary response conent and decode it yourself.
You may want to verify if the encodings are indeed guessed differently in your IDEs by simply printing source.encoding.

Python 3 requests.get().text returns unencoded string

Python 3 requests.get().text returns unencoded string.
If I execute:
import requests
request = requests.get('https://google.com/search?q=Кто является президентом России?').text.lower()
print(request)
I get kind of this:
Кто является презид
I've tried to change google.com to google.ru
If I execute:
import requests
request = requests.get('https://google.ru/search?q=Кто является президентом России?').text.lower()
print(request)
I get kind of this:
d0%9a%d1%82%d0%be+%d1%8f%d0%b2%d0%bb%d1%8f%d0%b5%d1%82%d1%81%d1%8f+%d0%bf%d1%80%d0%b5%d0%b7%d0%b8%d0%b4%d0%b5%d0%bd%d1%82%d0%be%d0%bc+%d0%a0%d0%be%d1%81%d1%81%d0%b8%d0
I need to get an encoded normal string.

You were getting this error because requests was not able to identify the correct encoding of the response. So if you are sure about the response encoding then you can set it like the following:
response = requests.get(url)
response.encoding --> to check the encoding
response.encoding = "utf-8" --> or any other encoding.
And then get the content with .text method.

I fixed it with urllib.parse.unquote() method:
import requests
from urllib.parse import unquote
request = unquote(requests.get('https://google.ru/search?q=Кто является президентом России?').text.lower())
print(request)

urlopen doesn't appear to work depending on how input text is generated

For some reason or another, it appears that depending on where I copy and paste a url from, urllib.request.urlopen won't work. For example when I copy http://www.google.com/ from the address bar and run the following script:
from urllib import request
url = "http://www.google.com/"
response = request.urlopen(url)
print(response)
I get the following error at the call to urlopen:
UnicodeEncodeError: 'ascii' codec can't encode character '\ufeff' in position 5: ordinal not in range(128)
But if I copy the url string from text on a web page or type it out by hand, it works just fine. To be clear, this doesn't work
from urllib import request
url = "http://www.google.com/"
response = request.urlopen(url)
print(response)
#url1 = "http://www.google.com/"
#
#response1 = request.urlopen(url1)
#print(response1)
but this does:
from urllib import request
#url = "http://www.google.com/"
#
#response = request.urlopen(url)
#print(response)
url1 = "http://www.google.com/"
response1 = request.urlopen(url1)
print(response1)
My suspicion is that the encoding is different in the actual address bar, and Spyder knows how to handle it, but I don't because I can't see what is actually going on.
EDIT: As requested...
print(ascii(url))
'http://www.google.com/\ufeff'
print(ascii(url1))
'http://www.google.com/'
Indeed the strings are different.

\ufeff is a zero-width non-breaking space, so it's no wonder you can't see it. Yes, there's an invisible character in your URL. At least it's not gremlins.

You could try
from urllib import request
url = "http://www.google.com/"
response = request.urlopen(url.decode("utf-8-sig").replace("\ufeff", ""))
print(response)
That Unicode character is the byte-order mark or BOM (more info here) you can encode without BOM by using utf-8-sig decoding then replacing the problematic character

'charmap' codec can't encode character error in Python while parsing HTML

Here is my code:
dataFile = open('dataFile.html', 'w')
res = requests.get('site/pm=' + str(i))
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
linkElems = soup.select('#content')
dataFile.write(str(linkElems[0]))
I have some other code to but this is the code that I think is problematic. I have also tried using:
dataFile.write(str(linkElems[0].decode('utf-8')))
but that does not work and gives error.
Using dataFile = open('dataFile.html', 'wb') gives me the error:
a bytes-like object is required, not 'str'

You opened your text file without specifying an encoding:
dataFile = open('dataFile.html', 'w')
This tells Python to use the default codec for your system. Every Unicode string you try to write to it will be encoded to that codec, and your Windows system is not set up with UTF-8 as the default.
Explicitly specify the encoding:
dataFile = open('dataFile.html', 'w', encoding='utf8')
Next, you are trusting the HTTP server to know what encoding the HTML data is using. This is usually not set at all, so don't use response.text! It is not BeautifulSoup at fault here, you are re-encoding a Mojibake. The requests library will default to Latin-1 encoding for text/* content types when the server doesn't explicitly specify an encoding, because the HTTP standard states that that is the default.
See the Encoding section of the Advanced documentation:
The only time Requests will not do this is if no explicit charset is present in the HTTP headers and the Content-Type header contains text. In this situation, RFC 2616 specifies that the default charset must be ISO-8859-1. Requests follows the specification in this case. If you require a different encoding, you can manually set the Response.encoding property, or use the raw Response.content.
Bold emphasis mine.
Pass in the response.content raw data instead:
soup = bs4.BeautifulSoup(res.content, 'html.parser')
BeautifulSoup 4 usually does a great job of figuring out the right encoding to use when parsing, either from a HTML <meta> tag or statistical analysis of the bytes provided. If the server does provide a characterset, you can still pass this into BeautifulSoup from the response, but do test first if requests used a default:
encoding = res.encoding if 'charset' in res.headers.get('content-type', '').lower() else None
soup = bs4.BeautifulSoup(res.content, 'html.parser', encoding=encoding)

Python Sometimes Returns Strange Result When Reading HTML from URL

I created a function to read HTML content from specific url. Here is the code:
def __retrieve_html(self, address):
html = urllib.request.urlopen(address).read()
Helper.log('HTML length', len(html))
Helper.log('HTML content', html)
return str(html)
However the function is not always return the correct string. In some cases it returns a very long weird string.
For example if I use the URL: http://www.merdeka.com, sometimes it will give the correct html string, but sometimes also returns a result like:
HTML content: b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xed\xfdyW\x1c\xb7\xd28\x8e\xffm\x9f\x93\xf7\xa0;y>\xc1\xbeA\xcc\xc2b\x03\x86\x1cl\xb0\x8d1\x86\x038yr\......Very long and much more characters.
It seems that it only happen in any pages that have a lot of content. For simple page like Facebook.com login page and Google.com index, it never happened. What is this? Where is my mistake and how to handle it?

It appears the response from http://www.merdeka.com is gzipped compressed.
Give this a try:
import gzip
import urllib.request
def __retrieve_html(self, address):
with urllib.request.urlopen(address) as resp:
html = resp.read()
Helper.log('HTML length', len(html))
Helper.log('HTML content', html)
if resp.info().get('Content-Encoding') == 'gzip':
html = gzip.decompress(html)
return html
How to decode your html object, I leave as an exercise to you.
Alternatively, you could just use the Requests module: http://docs.python-requests.org/en/latest/
Install it with:
pip install requests
Then execute like:
import requests
r = requests.get('http://www.merdeka.com')
r.text
Requests didn't appear to have any trouble with the response from http://www.merdeka.com

You've got bytes instead of string, because urrlib can't decode the response for you. This could be because some sites omit encoding declaration in their content-type header.
For example, google.com has:
Content-Type: text/html; charset=UTF-8
and that http://www.merdeka.com website has just:
Content-Type: text/html
So, you need to manually decode the response, for example with utf-8 encoding
html = urllib.request.urlopen(address).read().decode('utf-8')
The problem is that you need to set correct encoding and if it is not in the server headers, your need to guess it somehow.
See this question for more information How to handle response encoding from urllib.request.urlopen()
PS: Consider moving from somewhat deprecated urllib to the requests lib. It's simplier, trendier and sexier at this time :)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.