requests-HTML Enconding error - python

Hello guys,
When I run this code:
from requests_html import HTMLSession
url = 'http://www.spell.org.br/documentos/resultadobusca/?eou%5B%5D=&tipo_busca=simples&campo%5B%5D=RESUMO&texto%5B%5D='\
+ parsekeyword(keyword) +\
'&eou%5B%5D=E&campo%5B%5D=TITULO&texto%5B%5D=&eou%5B%5D=E&campo%5B%5D=TITULO&texto%5B%5D=&mes_inicio=&ano_inicio=&mes_fim=&ano_fim=&qtd_reg_pagina=20&pagina=2'
session = HTMLSession()
link = session.get(url)
linkslist = list(link.html.absolute_links)
I get this error message:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 91835: invalid continuation byte
I think it's because of non utf-8 characters in some links.
Since it's happening inside the method, is there a way to handle this?
I'm a begginer, I'm sorry if I missed something obvious.

in python3 and requests, you can use response.content.decode('utf-8'), the response variable mean is your link variable

I had the same problem.
I ran the following command and it solved the problem.
pip uninstall requests-html
pip install requests-html

Related

Encoding error whilst asynchronously scraping website

I have the following code here:
import aiohttp
import asyncio
from bs4 import BeautifulSoup
async def main():
async with aiohttp.ClientSession() as session:
async with session.get('https://www.pro-football-reference.com/years/2021/defense.htm') as response:
soup = BeautifulSoup(await response.text(), features="lxml")
print(soup)
asyncio.run(main())
But, it gives me the error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdf in position 2170985: invalid continuation byte for the line await response.text(). I believe the problem is that the url ends in a .htm instead of a .com.
Is there any way to decode it?
Note: I would not like to use response.read().
The website's headers indicate that the page should be encoded as UTF-8, but evidently it isn't:
$ curl --head --silent https://www.pro-football-reference.com/years/2021/defense.htm | grep -i charset
content-type: text/html; charset=UTF-8
Let's inspect the content:
>>> r = requests.get('https://www.pro-football-reference.com/years/2021/defense.htm')
>>> r.content[2170980:2170990]
b'/">Fu\xdfball'
It looks like this should be "Fußball", which would be b'Fu\xc3\x9fball' if encoded with UTF-8.
If we look up 0xdf in Triplee's Table of Legacy 8-bit Encodings we find that it represents "ß" in any of these encodings:
cp1250, cp1252, cp1254, cp1257, cp1258, iso8859_10, iso8859_13, iso8859_14, iso8859_15, iso8859_16, iso8859_2, iso8859_3, iso8859_4, iso8859_9, latin_1, palmos
Without any other information, I would choose latin-1 as the encoding; however it might be simpler to pass request.content to Beautiful Soup and let it handle decoding.
Curious as to why not just use pandas here?
import pandas as pd
url = 'https://www.pro-football-reference.com/years/2021/defense.htm'
df = pd.read_html('https://www.pro-football-reference.com/years/2021/defense.htm', header=1)[0]
df = df[df['Rk'].ne('Rk')]

Using urllib to open a url with an accent

I am trying to open a url using urlopen in urllib, but am getting an error due to an accent mark in the URL:
import urllib
import ssl
context = ssl._create_unverified_context()
url = 'https://en.wikipedia.org/wiki/Raúl_Grijalva'
page = urllib.request.urlopen(url, context=context)
UnicodeEncodeError: 'ascii' codec can't encode character '\xfa' in position 12: ordinal not in range(128)
I found this answer suggesting adding a u to the string and encoding, but this gives me a different error:
import urllib
import ssl
context = ssl._create_unverified_context()
url = u'https://en.wikipedia.org/wiki/Raúl_Grijalva'
page = urllib.request.urlopen(url.encode('UTF-8'), context=context)
AttributeError: 'bytes' object has no attribute 'timeout'
I did notice in that answer they use urllib.urlopen instead of urllib.request.urlopen and I'm not exactly sure what the difference between these is, but the former throws an error that urllib doesn't have that attribute.
How can I properly handle this character in the url?
Using parse.quote() to escape the text with accent character seems to work:
from urllib import request, parse
import ssl
context = ssl._create_unverified_context()
url = 'https://en.wikipedia.org/'
path = parse.quote('wiki/Raúl_Grijalva')
page = request.urlopen(url + path, context=context)

Atom is giving an error when requesting data from a website

I'm requesting html data from a url using the request module in python.
Here is my code
import requests
source = requests.get('http://coreyms.com')
print(source.text)
When I run this in atom it gives me an error;
File "/Users/isaacrichardson/Desktop/Python/Web Scraping/wiki.py", line 7, in <module>
print(source.text)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 34807: ordinal not in range(128)
But when I run it in Treehouse workspaces it works fine and shows me the html data.
Whats wrong with atom or my code?
The requests library is not installed correctly for atom or it is not useable for it. Installing it correctly will solve the issue.
If that doesn't work I would try to use the beautiful soup module:
from bs4 import BeautifulSoup
doc = BeautifulSoup(source.text, "html.parser")
print(doc.text)
requests guesses the encoding when you access the .text attribute of the response object.
If you know the encoding of the response beforehand you should explicitly set it before accessing the .text attribute:
import requests
source = requests.get('http://coreyms.com')
source.encoding = 'utf-8' # or whatever the encoding is
print(source.text)
Alternatively, you can also work with .content to access the binary response conent and decode it yourself.
You may want to verify if the encodings are indeed guessed differently in your IDEs by simply printing source.encoding.

urlopen doesn't appear to work depending on how input text is generated

For some reason or another, it appears that depending on where I copy and paste a url from, urllib.request.urlopen won't work. For example when I copy http://www.google.com/ from the address bar and run the following script:
from urllib import request
url = "http://www.google.com/"
response = request.urlopen(url)
print(response)
I get the following error at the call to urlopen:
UnicodeEncodeError: 'ascii' codec can't encode character '\ufeff' in position 5: ordinal not in range(128)
But if I copy the url string from text on a web page or type it out by hand, it works just fine. To be clear, this doesn't work
from urllib import request
url = "http://www.google.com/"
response = request.urlopen(url)
print(response)
#url1 = "http://www.google.com/"
#
#response1 = request.urlopen(url1)
#print(response1)
but this does:
from urllib import request
#url = "http://www.google.com/"
#
#response = request.urlopen(url)
#print(response)
url1 = "http://www.google.com/"
response1 = request.urlopen(url1)
print(response1)
My suspicion is that the encoding is different in the actual address bar, and Spyder knows how to handle it, but I don't because I can't see what is actually going on.
EDIT: As requested...
print(ascii(url))
'http://www.google.com/\ufeff'
print(ascii(url1))
'http://www.google.com/'
Indeed the strings are different.
\ufeff is a zero-width non-breaking space, so it's no wonder you can't see it. Yes, there's an invisible character in your URL. At least it's not gremlins.
You could try
from urllib import request
url = "http://www.google.com/"
response = request.urlopen(url.decode("utf-8-sig").replace("\ufeff", ""))
print(response)
That Unicode character is the byte-order mark or BOM (more info here) you can encode without BOM by using utf-8-sig decoding then replacing the problematic character

Error using urllib using Python 3.6.1 and Python 2.7

I was trying to install urllib to my python 3.6.1 using pip method, but I am unable to fix the error output.
The error appears to be like this:
I first searched online and found out that one possible reason is that Python3 is unable to identify 0, I need to change the last digit to something, therefore, I tried to open the setup.py file in the folder.
I tried to access the hidden folders on my mac following the path listed in the error, but I am unable to find any pip-build-zur37k_r folder in my mac, I turned all the hidden fildes to visible.
I want to extract information using urllib.request library and BeautifulSoup, and when I run the following code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://www.pythonscraping.com/pages/page1.html")
bsObj = BeautifulSoup(html.read())
print(bsObj.h1)
The error appears to be like:
The code should return to me the following information:
<h1> An Interesting Title </h1>
Your error says certificate verification failed. So it is a problem with the website, not your code. The call to urlopen() works for me, but maybe you have a proxy server that is fussier about certificates.
The url you are hitting is not having any SSL certificate so when you want to request such site you'll need to overlook the ssl check. As below:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
html = urlopen("https://www.pythonscraping.com/pages/page1.html",context=ctx)
bsObj = BeautifulSoup(html.read()) print(bsObj.h1)
So you'll get the end result as expected.

Categories