decode utf8 to big5 - python

every one,I am trying to send sms by python, I can send it,but I need to send in Chinese, which is big5,I have to decode utf8 to big5,here is my sms python code
trydecode.py
import urllib
import urllib2
def sendsms(phonenumber,textcontent):
textcontent.decode('utf8').encode('big5')
url = "https://url?username=myname&password=mypassword&dstaddr="+phonenumber+"&smbody="+textcontent
req = urllib2.Request(url)
response = urllib2.urlopen(req)
this code(python2.7) I can send sms in English but in Chinese (big5) got problem,how can I fix it? thank you

I think you forgot to save it to change the variable.
textcontent = textcontent.decode('utf8').encode('big5')

Related

Python 3 requests.get().text returns unencoded string

Python 3 requests.get().text returns unencoded string.
If I execute:
import requests
request = requests.get('https://google.com/search?q=Кто является президентом России?').text.lower()
print(request)
I get kind of this:
Кто является презид
I've tried to change google.com to google.ru
If I execute:
import requests
request = requests.get('https://google.ru/search?q=Кто является президентом России?').text.lower()
print(request)
I get kind of this:
d0%9a%d1%82%d0%be+%d1%8f%d0%b2%d0%bb%d1%8f%d0%b5%d1%82%d1%81%d1%8f+%d0%bf%d1%80%d0%b5%d0%b7%d0%b8%d0%b4%d0%b5%d0%bd%d1%82%d0%be%d0%bc+%d0%a0%d0%be%d1%81%d1%81%d0%b8%d0
I need to get an encoded normal string.
You were getting this error because requests was not able to identify the correct encoding of the response. So if you are sure about the response encoding then you can set it like the following:
response = requests.get(url)
response.encoding --> to check the encoding
response.encoding = "utf-8" --> or any other encoding.
And then get the content with .text method.
I fixed it with urllib.parse.unquote() method:
import requests
from urllib.parse import unquote
request = unquote(requests.get('https://google.ru/search?q=Кто является президентом России?').text.lower())
print(request)

urlopen doesn't appear to work depending on how input text is generated

For some reason or another, it appears that depending on where I copy and paste a url from, urllib.request.urlopen won't work. For example when I copy http://www.google.com/ from the address bar and run the following script:
from urllib import request
url = "http://www.google.com/"
response = request.urlopen(url)
print(response)
I get the following error at the call to urlopen:
UnicodeEncodeError: 'ascii' codec can't encode character '\ufeff' in position 5: ordinal not in range(128)
But if I copy the url string from text on a web page or type it out by hand, it works just fine. To be clear, this doesn't work
from urllib import request
url = "http://www.google.com/"
response = request.urlopen(url)
print(response)
#url1 = "http://www.google.com/"
#
#response1 = request.urlopen(url1)
#print(response1)
but this does:
from urllib import request
#url = "http://www.google.com/"
#
#response = request.urlopen(url)
#print(response)
url1 = "http://www.google.com/"
response1 = request.urlopen(url1)
print(response1)
My suspicion is that the encoding is different in the actual address bar, and Spyder knows how to handle it, but I don't because I can't see what is actually going on.
EDIT: As requested...
print(ascii(url))
'http://www.google.com/\ufeff'
print(ascii(url1))
'http://www.google.com/'
Indeed the strings are different.
\ufeff is a zero-width non-breaking space, so it's no wonder you can't see it. Yes, there's an invisible character in your URL. At least it's not gremlins.
You could try
from urllib import request
url = "http://www.google.com/"
response = request.urlopen(url.decode("utf-8-sig").replace("\ufeff", ""))
print(response)
That Unicode character is the byte-order mark or BOM (more info here) you can encode without BOM by using utf-8-sig decoding then replacing the problematic character

Python Sometimes Returns Strange Result When Reading HTML from URL

I created a function to read HTML content from specific url. Here is the code:
def __retrieve_html(self, address):
html = urllib.request.urlopen(address).read()
Helper.log('HTML length', len(html))
Helper.log('HTML content', html)
return str(html)
However the function is not always return the correct string. In some cases it returns a very long weird string.
For example if I use the URL: http://www.merdeka.com, sometimes it will give the correct html string, but sometimes also returns a result like:
HTML content: b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xed\xfdyW\x1c\xb7\xd28\x8e\xffm\x9f\x93\xf7\xa0;y>\xc1\xbeA\xcc\xc2b\x03\x86\x1cl\xb0\x8d1\x86\x038yr\......Very long and much more characters.
It seems that it only happen in any pages that have a lot of content. For simple page like Facebook.com login page and Google.com index, it never happened. What is this? Where is my mistake and how to handle it?
It appears the response from http://www.merdeka.com is gzipped compressed.
Give this a try:
import gzip
import urllib.request
def __retrieve_html(self, address):
with urllib.request.urlopen(address) as resp:
html = resp.read()
Helper.log('HTML length', len(html))
Helper.log('HTML content', html)
if resp.info().get('Content-Encoding') == 'gzip':
html = gzip.decompress(html)
return html
How to decode your html object, I leave as an exercise to you.
Alternatively, you could just use the Requests module: http://docs.python-requests.org/en/latest/
Install it with:
pip install requests
Then execute like:
import requests
r = requests.get('http://www.merdeka.com')
r.text
Requests didn't appear to have any trouble with the response from http://www.merdeka.com
You've got bytes instead of string, because urrlib can't decode the response for you. This could be because some sites omit encoding declaration in their content-type header.
For example, google.com has:
Content-Type: text/html; charset=UTF-8
and that http://www.merdeka.com website has just:
Content-Type: text/html
So, you need to manually decode the response, for example with utf-8 encoding
html = urllib.request.urlopen(address).read().decode('utf-8')
The problem is that you need to set correct encoding and if it is not in the server headers, your need to guess it somehow.
See this question for more information How to handle response encoding from urllib.request.urlopen()
PS: Consider moving from somewhat deprecated urllib to the requests lib. It's simplier, trendier and sexier at this time :)

urllib in python reading

I am supposed to write a function using the urllib library to open a url and then read and decode the file and return the string form. so far i have
def request(url):
urllib.request.Request(url)
#open URL
urllib.request.urlopen(url)
#read URL and decode the content string
#return string form of URL content
I really dont know how to read and decode the information once i download it. if anyone could help me. since it reads the im assumsing you will use
urllib.request.read(url)
but i am not sure and i dont know what file will come back and how i am supposed to decode it.
>>> from urllib import request
>>> data = request.urlopen("http://www.google.com").read()
By calling read() function on request.urlopen you will get the source of the url you passed.

python urllib2.urlopen - html text is garbled - why?

The printed html returns garbled text... instead of what I expect to see as seen in "view source" in browser.
Why is that? How to fix it easily?
Thank you for your help.
Same behavior using mechanize, curl, etc.
import urllib
import urllib2
start_url = "http://www.ncert.nic.in/ncerts/textbook/textbook.htm"
response = urllib2.urlopen(start_url)
html = response.read()
print html
I got the same garbled text using curl
curl http://www.ncert.nic.in/ncerts/textbook/textbook.htm
The result appears to be gzipped. So this shows the correct HTML for me.
curl http://www.ncert.nic.in/ncerts/textbook/textbook.htm | gunzip
Here's a solutions on doing this in Python: Convert gzipped data fetched by urllib2 to HTML
Edited by OP:
The revised answer after reading above is:
import urllib
import urllib2
import gzip
import StringIO
start_url = "http://www.ncert.nic.in/ncerts/textbook/textbook.htm"
response = urllib2.urlopen(start_url)
html = response.read()
data = StringIO.StringIO(html)
gzipper = gzip.GzipFile(fileobj=data)
html = gzipper.read()
html now holds the HTML (Print it to see)
Try requests. Python Requests.
import requests
response = requests.get("http://www.ncert.nic.in/ncerts/textbook/textbook.htm")
print response.text
The reason for this is because the site uses gzip encoding. To my knowledge urllib doesn't support deflating so you end up with compressed html responses for certain sites that use that encoding. You can confirm this by printing the content headers from the response like so.
print response.headers
There you will see that the "Content-Encoding" is gzip format. In order to get around this using the standard urllib library you'd need to use the gzip module. Mechanize also does this because it uses the same urllib library. Requests will handle this encoding and format it nicely for you.

Categories