Chinese character encoding - pd.read_html v requests

Chinese character encoding - pd.read_html v requests - python

I want to read this webpage:
http://www.stats.gov.cn/tjsj/zxfb/202210/t20221014_1889255.html
If I use pd.read_html the content usually loads properly, but recently, I have started getting an HTTP Error 400: Bad Request.
So I tried to use:
link = 'http://www.stats.gov.cn/tjsj/zxfb/202210/t20221014_1889255.html'
header = {'User-Agent': 'Mozilla/5.0'}
r = requests.get(link, headers=header)
df = pd.read_html(r.text, encoding='utf-8')[1]
which gets over the 400 error, but the Chinese characters aren't readable, as the screenshot shows.
Why does this encoding problem occur in requests v pd.read_html, and how can I solve it? Thanks
Screenshot

I think I've solved it. Use r.content rather than r.text

Related

Python requests.get content encoding problem

I am working on a scraping project using requests.get. The html file contains a relative path of the form href="../_static/new_page.html" for the css file.
I am using the below code to get the html file
import requests
url = "www.example.com"
req = requests.get(url)
req.content
All the href containing "../_static" become "_static/...". I tried req.text and changed the encoding to utf-8, which is the encoding of the page. However, I am always getting the same result. I also tried urllib.request.get, and I also got the same problem.
Any suggestions!

Adam.
Yes, it will be related to encoding format when you write the response content to Html file.
but you just have to consider the encoding type of response content itself.
Just check the encoding type of your requests library.
response = requests.get("url")
print(response.encoding)
You just need to choose the right encoding type like above.
response.encoding = "utf-8"
or
response.encoding = "ISO-8859-1"
or
response.encoding = "utf-8-sig"
...
Hope my answer helps you.
Regards

Python3 beautifulsoup doesn't parse anything

import requests
from bs4 import BeautifulSoup
url = "https://www.sahibinden.com/hyundai/"
req = requests.get(url)
context = req.content
soup = BeautifulSoup(context, "html.parser")
print(soup.prettify())
I am getting an error with the above code. If I try to parse another website it works, but there is a problem with sahibinden.com . When i run the program it is waiting like 1 minute than it throws an error. I ve to parse this website. Could you please help me with explaining what the issue is?

Your problem is due to the server is expecting a user agent, can't perform the request without it.
It's possible that the error that's giving to you is a timeout?
Add the following to your code
headers_dict = {'User-Agent': user_agent}
req = requests.get(url, headers=headers_dict)

Python Requests: No permission to access URL & unicode error

My goal is to scrape the macys.com website, and I can not get access. The following code is my initial attempt.
Attempt 1
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.macys.com').text
soup = BeautifulSoup(source, 'lxml')
print(soup)
This resulted in the following error.
<html><head>
<title>Access Denied</title>
</head><body>
<h1>Access Denied</h1>
You don't have permission to access the requested URL on this server.
<p>Reference: 18.c503d417.1587673952.4f27a98</p>
</body>
</html>
After finding similar issues on stackoverflow, I see the most common solution is to add a header. Here is the main code from that attempt.
Attempt 2
url = 'https://www.macys.com'
headers = {'User-agent': 'Mozilla/5.0'}
res = requests.get(url, headers=headers)
soup = BeautifulSoup(res.content, 'lxml')
print(soup)
Here is the last error message I have received. After researching the site, I am still unsure how to proceed.
UnicodeEncodeError: 'charmap' codec can't encode character '\x92' in position 586833: character maps to <undefined>
I am very intro level, so I appreciate any insight. I am also just genuinely curious why I don't have permissions for macys site as testing other sites works fine.

I tried your Attempt 2 code, and it works fine for me.
Try setting the BeautifulSoup's from_encoding argument to utf-8, like so:
url = 'https://www.macys.com'
headers = {'User-agent': 'Mozilla/5.0'}
res = requests.get(url, headers=headers)
soup = BeautifulSoup(res.content, 'lxml', from_encoding='utf-8')
print(soup)
I am also just genuinely curious why I don't have permissions for macys site as testing other sites works fine.
This is something the administrators for Macy's have done to prevent bots from accessing their website. It's an extremely trivial form of protection, though, since you only need to change the user-agent header to something typical.

Strange encoding on site request with python

I try to retrieve html code from a site using the code above
url = 'http://www.somesite.com'
obj = requests.get(url, timeout=60, verify=True, allow_redirects=True)
print(obj.encoding)
print(obj.text.encode('utf-8'))
but the result I took is a strange encoding like the below text
\xb72\xc2\xacBD\xc3\xb70\xc2\xacAN\xc3\xb7n\xc2\xac~AA\xc3\xb7M1FX7q3K\xc2\xacAD\xc3\xb71414690200\xc2\xacAB\xc3\xb73\xc2\xacCR\xc3\xb73\xc2\xacAC\xc3\xb73\xc
Any ideas how can I decode the text?

issue with cookies and sending POST/GET to get te web content in Python [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to use Python to login to a webpage and retrieve cookies for later usage?
I want to download whole webpage source from a service that handles cookies in some unusual way. I wrote a script that actually works and seems to be fine however at some point it returned such error:
urllib2.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Found
My script works in loop and changes link to subpage wchich content im interested to download.
I get a cookie, send a package of data and then i am able to get to porper link then download html.
script look like this:
import urllib2
data = 'some_string'
url = "http://example/index.php"
url2 = "http://example/source"
req1 = urllib2.Request(url)
response = urllib2.urlopen(req1)
cookie = response.info().getheader('Set-Cookie')
## Use the cookie is subsequent requests
req2 = urllib2.Request(url, data)
req2.add_header('cookie', cookie)
response = urllib2.urlopen(req2)
## reuse again
req3 = urllib2.Request(url2)
req3.add_header('cookie', cookie)
response = urllib2.urlopen(req3)
html = response.read()
Ive been reading sth ab cookiejar/cookielib coz using this lib i am supposed to ged rid of this error mentioned above however i have no clue how to reporoduce my code to be used by: http.cookiejar, urllib.request
i tried sth like this:
import http.cookiejar, urllib.request
cj = http.cookiejar.CookieJar()
opener = urllib.request.build_opener( urllib.request.HTTPCookieProcessor(cj) )
r = opener.open(url) # now cookies are stored in cj
r1 = urllib.request(url, data) #TypeError: POST data should be bytes or an iterable of bytes. It cannot be str.
r2 = opener.open(url2)
print( r2.read() )
But its not working as my first script.
ps. Sorry for my english but im am not native.

#Piotr Dobrogost thanks for the link, it solved the issue.
TypeError solved by using data=b"string" instead of data="string"
Ive got still some issues due to porting to python3 but issue is to be closed.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Chinese character encoding - pd.read_html v requests - python

I think I've solved it. Use r.content rather than r.text

Related

Python requests.get content encoding problem

Python3 beautifulsoup doesn't parse anything

Python Requests: No permission to access URL & unicode error

Strange encoding on site request with python

issue with cookies and sending POST/GET to get te web content in Python [duplicate]

Categories

Resources