I try to retrieve html code from a site using the code above
url = 'http://www.somesite.com'
obj = requests.get(url, timeout=60, verify=True, allow_redirects=True)
print(obj.encoding)
print(obj.text.encode('utf-8'))
but the result I took is a strange encoding like the below text
\xb72\xc2\xacBD\xc3\xb70\xc2\xacAN\xc3\xb7n\xc2\xac~AA\xc3\xb7M1FX7q3K\xc2\xacAD\xc3\xb71414690200\xc2\xacAB\xc3\xb73\xc2\xacCR\xc3\xb73\xc2\xacAC\xc3\xb73\xc
Any ideas how can I decode the text?
Related
I want to read this webpage:
http://www.stats.gov.cn/tjsj/zxfb/202210/t20221014_1889255.html
If I use pd.read_html the content usually loads properly, but recently, I have started getting an HTTP Error 400: Bad Request.
So I tried to use:
link = 'http://www.stats.gov.cn/tjsj/zxfb/202210/t20221014_1889255.html'
header = {'User-Agent': 'Mozilla/5.0'}
r = requests.get(link, headers=header)
df = pd.read_html(r.text, encoding='utf-8')[1]
which gets over the 400 error, but the Chinese characters aren't readable, as the screenshot shows.
Why does this encoding problem occur in requests v pd.read_html, and how can I solve it? Thanks
Screenshot
I think I've solved it. Use r.content rather than r.text
I am working on a scraping project using requests.get. The html file contains a relative path of the form href="../_static/new_page.html" for the css file.
I am using the below code to get the html file
import requests
url = "www.example.com"
req = requests.get(url)
req.content
All the href containing "../_static" become "_static/...". I tried req.text and changed the encoding to utf-8, which is the encoding of the page. However, I am always getting the same result. I also tried urllib.request.get, and I also got the same problem.
Any suggestions!
Adam.
Yes, it will be related to encoding format when you write the response content to Html file.
but you just have to consider the encoding type of response content itself.
Just check the encoding type of your requests library.
response = requests.get("url")
print(response.encoding)
You just need to choose the right encoding type like above.
response.encoding = "utf-8"
or
response.encoding = "ISO-8859-1"
or
response.encoding = "utf-8-sig"
...
Hope my answer helps you.
Regards
My goal is to scrape the macys.com website, and I can not get access. The following code is my initial attempt.
Attempt 1
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.macys.com').text
soup = BeautifulSoup(source, 'lxml')
print(soup)
This resulted in the following error.
<html><head>
<title>Access Denied</title>
</head><body>
<h1>Access Denied</h1>
You don't have permission to access the requested URL on this server.
<p>Reference: 18.c503d417.1587673952.4f27a98</p>
</body>
</html>
After finding similar issues on stackoverflow, I see the most common solution is to add a header. Here is the main code from that attempt.
Attempt 2
url = 'https://www.macys.com'
headers = {'User-agent': 'Mozilla/5.0'}
res = requests.get(url, headers=headers)
soup = BeautifulSoup(res.content, 'lxml')
print(soup)
Here is the last error message I have received. After researching the site, I am still unsure how to proceed.
UnicodeEncodeError: 'charmap' codec can't encode character '\x92' in position 586833: character maps to <undefined>
I am very intro level, so I appreciate any insight. I am also just genuinely curious why I don't have permissions for macys site as testing other sites works fine.
I tried your Attempt 2 code, and it works fine for me.
Try setting the BeautifulSoup's from_encoding argument to utf-8, like so:
url = 'https://www.macys.com'
headers = {'User-agent': 'Mozilla/5.0'}
res = requests.get(url, headers=headers)
soup = BeautifulSoup(res.content, 'lxml', from_encoding='utf-8')
print(soup)
I am also just genuinely curious why I don't have permissions for macys site as testing other sites works fine.
This is something the administrators for Macy's have done to prevent bots from accessing their website. It's an extremely trivial form of protection, though, since you only need to change the user-agent header to something typical.
I am trying to use the requests function in python to post the text content of a text file to a website, submit the text for analysis on said website, and pull the results back in to python. I have read through a number of responses here and on other websites, but have not yet figured out how to correctly modify the code to a new website.
I'm familiar with beautiful soup so pulling in webpage content and removing HTML isn't an issue, its the submitting the data that I don't understand.
My code currently is:
import requests
fileName = "texttoAnalyze.txt"
fileHandle = open(fileName, 'rU');
url_text = fileHandle.read()
url = "http://www.webpagefx.com/tools/read-able/"
payload = {'value':url_text}
r = requests.post(url, payload)
print r.text
This code comes back with the html of the website, but hasn't recognized the fact that I'm trying to a submit a form.
Any help is appreciated. Thanks so much.
You need to send the same request the website is sending, usually you can get these with web debugging tools (like chrome/firefox developer tools).
In this case the url the request is being sent to is: http://www.webpagefx.com/tools/read-able/check.php
With the following params: tab=Test+by+Direct+Link&directInput=SOME_RANDOM_TEXT
So your code should look like this:
url = "http://www.webpagefx.com/tools/read-able/check.php"
payload = {'directInput':url_text, 'tab': 'Test by Direct Link'}
r = requests.post(url, data=payload)
print r.text
Good luck!
There are two post parameters, tab and directInput:
import requests
post = "http://www.webpagefx.com/tools/read-able/check.php"
with open("in.txt") as f:
data = {"tab":"Test by Direct Link",
"directInput":f.read()}
r = requests.post(post, data=data)
print(r.content)
Looking at the requests documentation, I know that I can use response.content for binary content (such as a .jpg file) and response.text for a regular html page. However, when the source is an image, and I try to access r.text, the script hangs. How can I determine in advance if the response contains html?
I have considered checking the url for an image extension, but that does not seem fool-proof.
The content type should be a header. See this page in the documentation.
Example code:
r = requests.get(url)
if r.headers['content-type'] == 'text/html':
data = r.text
elif r.headers['content-type'] == 'application/ogg':
data = r.content