I am working on a scraping project using requests.get. The html file contains a relative path of the form href="../_static/new_page.html" for the css file.
I am using the below code to get the html file
import requests
url = "www.example.com"
req = requests.get(url)
req.content
All the href containing "../_static" become "_static/...". I tried req.text and changed the encoding to utf-8, which is the encoding of the page. However, I am always getting the same result. I also tried urllib.request.get, and I also got the same problem.
Any suggestions!
Adam.
Yes, it will be related to encoding format when you write the response content to Html file.
but you just have to consider the encoding type of response content itself.
Just check the encoding type of your requests library.
response = requests.get("url")
print(response.encoding)
You just need to choose the right encoding type like above.
response.encoding = "utf-8"
or
response.encoding = "ISO-8859-1"
or
response.encoding = "utf-8-sig"
...
Hope my answer helps you.
Regards
Related
A website I try to scrape seems to have an encoding problem. The pages state, that they are encoded in utf-8, but if I try to scrape them and fetch the html source using requests, the redirect adress contains an encoding, that is not utf-8.
Browsers seem to be tolerant, so they fix this automatically, but the python requests package runs into an exception.
My code looks like this:
res= rq.get(url, allow_redirects=True)
This runs into an exception when trying to decode the redirect string in the following code (hidden somewhere in the requests package):
string.decode(encoding)
where string is the redirect string and encoding is 'utf8':
string= b'/aktien/herm\xe8s-aktie'
I found out, that the encoding in fact is encoded in something like 'Windows-1252'. Actually the redirect should go on '/aktien/herm%C3%A8s-aktie'.
Now my question: how can I either get requests to be more tolerant about such encoding bugs (like the browsers), or how can I alternatively pass an encoding?
I searched for encoding settings, but what I saw so far, requests always does that automatically based on the result.
Btw. the result page of the redirect starts with (it really states to be utf-8)
<!DOCTYPE html><html lang="de" prefix="og: http://ogp.me/ns#"><head><meta charset="utf-8">
You can use hooks= parameter in requests.get() method and explicitly urlencode the Location HTTP header. For example:
import requests
import urllib.parse
url = "<YOUR URL FROM EXAMPLE>"
def response_hook(hook_data, **kwargs):
if "Location" in hook_data.headers:
hook_data.headers["Location"] = urllib.parse.quote(
hook_data.headers["Location"]
)
res = requests.get(url, allow_redirects=True, hooks={"response": response_hook})
print(res.url)
Prints:
https://.../herm%C3%A8s-aktie
Im new to python.
Im trying to parse data from a website using BeautifulSoup, I have successful used BeautifulSoup before. However for this particular website the data returned has spaces between every character and lots of ">" characters as well.
The weird thing is if copy the page source and add it to my local apache instance and make a request to my local copy, then the output is perfect. I should mention that the difference between my local and the website:
my local does not use https
my local does not require authentication however the website does require Active Directory auth and I using requests_ntlm
import requests
from requests_ntlm import HttpNtlmAuth
from bs4 import BeautifulSoup
r = requests.get("http://WEBSITE/CONTEXT/",auth=HttpNtlmAuth('DOMAIN\USER','PASS'))
content = r.text
soup = BeautifulSoup(content, 'lxml')
print(soup)
It looks like local server returns content encoded using UTF-8 and the main website use UTF-16. It's suggests the main website in not configured correctly. However it's possible to get around this issue with code.
Python defaults the requests to the encoding to UTF-8. (I believe) this is based on the response headers. The request has a method called apparent_encoding, which reads the stream and detects the correct encoding using chardet. However apparent_encoding does not get consumed, unless specified.
Therefore by setting r.encoding = r.apparent_encoding, the request should download the text correctly across both environments.
Code should look something like:
r = requests.get("http://WEBSITE/CONTEXT/",auth=HttpNtlmAuth('DOMAIN\USER','PASS'))
r.encoding = r.apparent_encoding # Override the default encoding
content = r.text
r.raise_for_status() # Always check for server errors before consuming the data.
soup = BeautifulSoup(content, 'lxml')
print(soup.prettify()) # Should match print(content) (minus indentation)
I'm trying to use request to download the content of some web pages which are in fact PDFs.
I've tried the following code but the output that comes back is not properly decoded it seems:
link= 'http://www.pdf995.com/samples/pdf.pdf'
import requests
r = requests.get(link)
r.text
The output looks like below:
'%PDF-1.3\n%�쏢\n30 0 obj\n<>\nstream\nx��}ݓ%�m���\x15S�%NU���M&O7�㛔]ql�����+Kr�+ْ%���/~\x00��=����{feY�T�\x05��\r�\x00�/���q�8�8�\x7f�\x7f�~����\x1f�ܷ�O�z�7�7�o\x1f����7�\'�{��\x7f<~��\x1e?����C�%\ByLշK����!_b^0o\x083�K\x0b\x0b�\x05z�E�S���?�~ �]rb\x10C�y�>_r�\x10�<�K��<��!>��(�\x17���~�.m��]2\x11��
etc
I was hoping to get the html. I also tried with beautifulsoup but it does not decode it either.. I hope someone can help. Thank you, BR
Yes; a PDF file is a binary file, not a text file, so you should use r.content instead of r.text to access the binary data.
PDF files are not easy to deal with programmatically; but you might (for example) save it to a file:
import requests
link = 'http://www.pdf995.com/samples/pdf.pdf'
r = requests.get(link)
with open('pdf.pdf', 'wb') as f:
f.write(r.content)
Here is my code:
dataFile = open('dataFile.html', 'w')
res = requests.get('site/pm=' + str(i))
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
linkElems = soup.select('#content')
dataFile.write(str(linkElems[0]))
I have some other code to but this is the code that I think is problematic. I have also tried using:
dataFile.write(str(linkElems[0].decode('utf-8')))
but that does not work and gives error.
Using dataFile = open('dataFile.html', 'wb') gives me the error:
a bytes-like object is required, not 'str'
You opened your text file without specifying an encoding:
dataFile = open('dataFile.html', 'w')
This tells Python to use the default codec for your system. Every Unicode string you try to write to it will be encoded to that codec, and your Windows system is not set up with UTF-8 as the default.
Explicitly specify the encoding:
dataFile = open('dataFile.html', 'w', encoding='utf8')
Next, you are trusting the HTTP server to know what encoding the HTML data is using. This is usually not set at all, so don't use response.text! It is not BeautifulSoup at fault here, you are re-encoding a Mojibake. The requests library will default to Latin-1 encoding for text/* content types when the server doesn't explicitly specify an encoding, because the HTTP standard states that that is the default.
See the Encoding section of the Advanced documentation:
The only time Requests will not do this is if no explicit charset is present in the HTTP headers and the Content-Type header contains text. In this situation, RFC 2616 specifies that the default charset must be ISO-8859-1. Requests follows the specification in this case. If you require a different encoding, you can manually set the Response.encoding property, or use the raw Response.content.
Bold emphasis mine.
Pass in the response.content raw data instead:
soup = bs4.BeautifulSoup(res.content, 'html.parser')
BeautifulSoup 4 usually does a great job of figuring out the right encoding to use when parsing, either from a HTML <meta> tag or statistical analysis of the bytes provided. If the server does provide a characterset, you can still pass this into BeautifulSoup from the response, but do test first if requests used a default:
encoding = res.encoding if 'charset' in res.headers.get('content-type', '').lower() else None
soup = bs4.BeautifulSoup(res.content, 'html.parser', encoding=encoding)
Looking at the requests documentation, I know that I can use response.content for binary content (such as a .jpg file) and response.text for a regular html page. However, when the source is an image, and I try to access r.text, the script hangs. How can I determine in advance if the response contains html?
I have considered checking the url for an image extension, but that does not seem fool-proof.
The content type should be a header. See this page in the documentation.
Example code:
r = requests.get(url)
if r.headers['content-type'] == 'text/html':
data = r.text
elif r.headers['content-type'] == 'application/ogg':
data = r.content