I want to parse this webpage which contains symbols and Cyrillic letters. GET request return response with wrong encoding, which can not show these letters. What you recommend to prevent this response
import requests
url = "http://www.cawater-info.net/karadarya/1991/veg1991.htm"
response = requests.get(url)
print(response.encoding)
print(response.text[:100])
I tried to encode this text, but it did not help
print(response.text.encode('utf-8')[:100])
print(response.text.encode('cp852')[:100])
Since the response contains some cyrillic alphabet, you need cp1251 to decode the content :
print(response.content.decode("cp1251")[:100]) # or windows-1251
#<HTML><HEAD><TITLE>Оперативные данные по водозаборам бассейна реки Карадарья на период вегетации 199
Related
I have the following code here:
import aiohttp
import asyncio
from bs4 import BeautifulSoup
async def main():
async with aiohttp.ClientSession() as session:
async with session.get('https://www.pro-football-reference.com/years/2021/defense.htm') as response:
soup = BeautifulSoup(await response.text(), features="lxml")
print(soup)
asyncio.run(main())
But, it gives me the error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdf in position 2170985: invalid continuation byte for the line await response.text(). I believe the problem is that the url ends in a .htm instead of a .com.
Is there any way to decode it?
Note: I would not like to use response.read().
The website's headers indicate that the page should be encoded as UTF-8, but evidently it isn't:
$ curl --head --silent https://www.pro-football-reference.com/years/2021/defense.htm | grep -i charset
content-type: text/html; charset=UTF-8
Let's inspect the content:
>>> r = requests.get('https://www.pro-football-reference.com/years/2021/defense.htm')
>>> r.content[2170980:2170990]
b'/">Fu\xdfball'
It looks like this should be "Fußball", which would be b'Fu\xc3\x9fball' if encoded with UTF-8.
If we look up 0xdf in Triplee's Table of Legacy 8-bit Encodings we find that it represents "ß" in any of these encodings:
cp1250, cp1252, cp1254, cp1257, cp1258, iso8859_10, iso8859_13, iso8859_14, iso8859_15, iso8859_16, iso8859_2, iso8859_3, iso8859_4, iso8859_9, latin_1, palmos
Without any other information, I would choose latin-1 as the encoding; however it might be simpler to pass request.content to Beautiful Soup and let it handle decoding.
Curious as to why not just use pandas here?
import pandas as pd
url = 'https://www.pro-football-reference.com/years/2021/defense.htm'
df = pd.read_html('https://www.pro-football-reference.com/years/2021/defense.htm', header=1)[0]
df = df[df['Rk'].ne('Rk')]
I am working on a scraping project using requests.get. The html file contains a relative path of the form href="../_static/new_page.html" for the css file.
I am using the below code to get the html file
import requests
url = "www.example.com"
req = requests.get(url)
req.content
All the href containing "../_static" become "_static/...". I tried req.text and changed the encoding to utf-8, which is the encoding of the page. However, I am always getting the same result. I also tried urllib.request.get, and I also got the same problem.
Any suggestions!
Adam.
Yes, it will be related to encoding format when you write the response content to Html file.
but you just have to consider the encoding type of response content itself.
Just check the encoding type of your requests library.
response = requests.get("url")
print(response.encoding)
You just need to choose the right encoding type like above.
response.encoding = "utf-8"
or
response.encoding = "ISO-8859-1"
or
response.encoding = "utf-8-sig"
...
Hope my answer helps you.
Regards
For some reason or another, it appears that depending on where I copy and paste a url from, urllib.request.urlopen won't work. For example when I copy http://www.google.com/ from the address bar and run the following script:
from urllib import request
url = "http://www.google.com/"
response = request.urlopen(url)
print(response)
I get the following error at the call to urlopen:
UnicodeEncodeError: 'ascii' codec can't encode character '\ufeff' in position 5: ordinal not in range(128)
But if I copy the url string from text on a web page or type it out by hand, it works just fine. To be clear, this doesn't work
from urllib import request
url = "http://www.google.com/"
response = request.urlopen(url)
print(response)
#url1 = "http://www.google.com/"
#
#response1 = request.urlopen(url1)
#print(response1)
but this does:
from urllib import request
#url = "http://www.google.com/"
#
#response = request.urlopen(url)
#print(response)
url1 = "http://www.google.com/"
response1 = request.urlopen(url1)
print(response1)
My suspicion is that the encoding is different in the actual address bar, and Spyder knows how to handle it, but I don't because I can't see what is actually going on.
EDIT: As requested...
print(ascii(url))
'http://www.google.com/\ufeff'
print(ascii(url1))
'http://www.google.com/'
Indeed the strings are different.
\ufeff is a zero-width non-breaking space, so it's no wonder you can't see it. Yes, there's an invisible character in your URL. At least it's not gremlins.
You could try
from urllib import request
url = "http://www.google.com/"
response = request.urlopen(url.decode("utf-8-sig").replace("\ufeff", ""))
print(response)
That Unicode character is the byte-order mark or BOM (more info here) you can encode without BOM by using utf-8-sig decoding then replacing the problematic character
I try to retrieve html code from a site using the code above
url = 'http://www.somesite.com'
obj = requests.get(url, timeout=60, verify=True, allow_redirects=True)
print(obj.encoding)
print(obj.text.encode('utf-8'))
but the result I took is a strange encoding like the below text
\xb72\xc2\xacBD\xc3\xb70\xc2\xacAN\xc3\xb7n\xc2\xac~AA\xc3\xb7M1FX7q3K\xc2\xacAD\xc3\xb71414690200\xc2\xacAB\xc3\xb73\xc2\xacCR\xc3\xb73\xc2\xacAC\xc3\xb73\xc
Any ideas how can I decode the text?
every one,I am trying to send sms by python, I can send it,but I need to send in Chinese, which is big5,I have to decode utf8 to big5,here is my sms python code
trydecode.py
import urllib
import urllib2
def sendsms(phonenumber,textcontent):
textcontent.decode('utf8').encode('big5')
url = "https://url?username=myname&password=mypassword&dstaddr="+phonenumber+"&smbody="+textcontent
req = urllib2.Request(url)
response = urllib2.urlopen(req)
this code(python2.7) I can send sms in English but in Chinese (big5) got problem,how can I fix it? thank you
I think you forgot to save it to change the variable.
textcontent = textcontent.decode('utf8').encode('big5')