I'm requesting html data from a url using the request module in python.
Here is my code
import requests
source = requests.get('http://coreyms.com')
print(source.text)
When I run this in atom it gives me an error;
File "/Users/isaacrichardson/Desktop/Python/Web Scraping/wiki.py", line 7, in <module>
print(source.text)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 34807: ordinal not in range(128)
But when I run it in Treehouse workspaces it works fine and shows me the html data.
Whats wrong with atom or my code?
The requests library is not installed correctly for atom or it is not useable for it. Installing it correctly will solve the issue.
If that doesn't work I would try to use the beautiful soup module:
from bs4 import BeautifulSoup
doc = BeautifulSoup(source.text, "html.parser")
print(doc.text)
requests guesses the encoding when you access the .text attribute of the response object.
If you know the encoding of the response beforehand you should explicitly set it before accessing the .text attribute:
import requests
source = requests.get('http://coreyms.com')
source.encoding = 'utf-8' # or whatever the encoding is
print(source.text)
Alternatively, you can also work with .content to access the binary response conent and decode it yourself.
You may want to verify if the encodings are indeed guessed differently in your IDEs by simply printing source.encoding.
Related
A website I try to scrape seems to have an encoding problem. The pages state, that they are encoded in utf-8, but if I try to scrape them and fetch the html source using requests, the redirect adress contains an encoding, that is not utf-8.
Browsers seem to be tolerant, so they fix this automatically, but the python requests package runs into an exception.
My code looks like this:
res= rq.get(url, allow_redirects=True)
This runs into an exception when trying to decode the redirect string in the following code (hidden somewhere in the requests package):
string.decode(encoding)
where string is the redirect string and encoding is 'utf8':
string= b'/aktien/herm\xe8s-aktie'
I found out, that the encoding in fact is encoded in something like 'Windows-1252'. Actually the redirect should go on '/aktien/herm%C3%A8s-aktie'.
Now my question: how can I either get requests to be more tolerant about such encoding bugs (like the browsers), or how can I alternatively pass an encoding?
I searched for encoding settings, but what I saw so far, requests always does that automatically based on the result.
Btw. the result page of the redirect starts with (it really states to be utf-8)
<!DOCTYPE html><html lang="de" prefix="og: http://ogp.me/ns#"><head><meta charset="utf-8">
You can use hooks= parameter in requests.get() method and explicitly urlencode the Location HTTP header. For example:
import requests
import urllib.parse
url = "<YOUR URL FROM EXAMPLE>"
def response_hook(hook_data, **kwargs):
if "Location" in hook_data.headers:
hook_data.headers["Location"] = urllib.parse.quote(
hook_data.headers["Location"]
)
res = requests.get(url, allow_redirects=True, hooks={"response": response_hook})
print(res.url)
Prints:
https://.../herm%C3%A8s-aktie
I am working on a scraping project using requests.get. The html file contains a relative path of the form href="../_static/new_page.html" for the css file.
I am using the below code to get the html file
import requests
url = "www.example.com"
req = requests.get(url)
req.content
All the href containing "../_static" become "_static/...". I tried req.text and changed the encoding to utf-8, which is the encoding of the page. However, I am always getting the same result. I also tried urllib.request.get, and I also got the same problem.
Any suggestions!
Adam.
Yes, it will be related to encoding format when you write the response content to Html file.
but you just have to consider the encoding type of response content itself.
Just check the encoding type of your requests library.
response = requests.get("url")
print(response.encoding)
You just need to choose the right encoding type like above.
response.encoding = "utf-8"
or
response.encoding = "ISO-8859-1"
or
response.encoding = "utf-8-sig"
...
Hope my answer helps you.
Regards
Im new to python.
Im trying to parse data from a website using BeautifulSoup, I have successful used BeautifulSoup before. However for this particular website the data returned has spaces between every character and lots of ">" characters as well.
The weird thing is if copy the page source and add it to my local apache instance and make a request to my local copy, then the output is perfect. I should mention that the difference between my local and the website:
my local does not use https
my local does not require authentication however the website does require Active Directory auth and I using requests_ntlm
import requests
from requests_ntlm import HttpNtlmAuth
from bs4 import BeautifulSoup
r = requests.get("http://WEBSITE/CONTEXT/",auth=HttpNtlmAuth('DOMAIN\USER','PASS'))
content = r.text
soup = BeautifulSoup(content, 'lxml')
print(soup)
It looks like local server returns content encoded using UTF-8 and the main website use UTF-16. It's suggests the main website in not configured correctly. However it's possible to get around this issue with code.
Python defaults the requests to the encoding to UTF-8. (I believe) this is based on the response headers. The request has a method called apparent_encoding, which reads the stream and detects the correct encoding using chardet. However apparent_encoding does not get consumed, unless specified.
Therefore by setting r.encoding = r.apparent_encoding, the request should download the text correctly across both environments.
Code should look something like:
r = requests.get("http://WEBSITE/CONTEXT/",auth=HttpNtlmAuth('DOMAIN\USER','PASS'))
r.encoding = r.apparent_encoding # Override the default encoding
content = r.text
r.raise_for_status() # Always check for server errors before consuming the data.
soup = BeautifulSoup(content, 'lxml')
print(soup.prettify()) # Should match print(content) (minus indentation)
Hello guys,
When I run this code:
from requests_html import HTMLSession
url = 'http://www.spell.org.br/documentos/resultadobusca/?eou%5B%5D=&tipo_busca=simples&campo%5B%5D=RESUMO&texto%5B%5D='\
+ parsekeyword(keyword) +\
'&eou%5B%5D=E&campo%5B%5D=TITULO&texto%5B%5D=&eou%5B%5D=E&campo%5B%5D=TITULO&texto%5B%5D=&mes_inicio=&ano_inicio=&mes_fim=&ano_fim=&qtd_reg_pagina=20&pagina=2'
session = HTMLSession()
link = session.get(url)
linkslist = list(link.html.absolute_links)
I get this error message:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 91835: invalid continuation byte
I think it's because of non utf-8 characters in some links.
Since it's happening inside the method, is there a way to handle this?
I'm a begginer, I'm sorry if I missed something obvious.
in python3 and requests, you can use response.content.decode('utf-8'), the response variable mean is your link variable
I had the same problem.
I ran the following command and it solved the problem.
pip uninstall requests-html
pip install requests-html
Here is my code:
dataFile = open('dataFile.html', 'w')
res = requests.get('site/pm=' + str(i))
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
linkElems = soup.select('#content')
dataFile.write(str(linkElems[0]))
I have some other code to but this is the code that I think is problematic. I have also tried using:
dataFile.write(str(linkElems[0].decode('utf-8')))
but that does not work and gives error.
Using dataFile = open('dataFile.html', 'wb') gives me the error:
a bytes-like object is required, not 'str'
You opened your text file without specifying an encoding:
dataFile = open('dataFile.html', 'w')
This tells Python to use the default codec for your system. Every Unicode string you try to write to it will be encoded to that codec, and your Windows system is not set up with UTF-8 as the default.
Explicitly specify the encoding:
dataFile = open('dataFile.html', 'w', encoding='utf8')
Next, you are trusting the HTTP server to know what encoding the HTML data is using. This is usually not set at all, so don't use response.text! It is not BeautifulSoup at fault here, you are re-encoding a Mojibake. The requests library will default to Latin-1 encoding for text/* content types when the server doesn't explicitly specify an encoding, because the HTTP standard states that that is the default.
See the Encoding section of the Advanced documentation:
The only time Requests will not do this is if no explicit charset is present in the HTTP headers and the Content-Type header contains text. In this situation, RFC 2616 specifies that the default charset must be ISO-8859-1. Requests follows the specification in this case. If you require a different encoding, you can manually set the Response.encoding property, or use the raw Response.content.
Bold emphasis mine.
Pass in the response.content raw data instead:
soup = bs4.BeautifulSoup(res.content, 'html.parser')
BeautifulSoup 4 usually does a great job of figuring out the right encoding to use when parsing, either from a HTML <meta> tag or statistical analysis of the bytes provided. If the server does provide a characterset, you can still pass this into BeautifulSoup from the response, but do test first if requests used a default:
encoding = res.encoding if 'charset' in res.headers.get('content-type', '').lower() else None
soup = bs4.BeautifulSoup(res.content, 'html.parser', encoding=encoding)