I want to send zlib compressed data of file to server using POST request. Following is the code I am trying to use
orig = open('fileName', 'r').read()
comp = zlib.compress(orig, 9)
req = Request(url, comp)
urlopen(req)
But I get the following error UnicodeDecodeError: 'utf8' codec can't decode byte 0x85 in position 2: invalid start byte
I tried following comp.encode('utf-8') but this also doesn't work. I get the same UnicodeDecodeError at some position. How can I resolve my problem?
The docs for urllib2.Request's data parameter state:
data should be a buffer in the standard application/x-www-form-urlencoded format.
You can encode your buffer using urllib.quote:
>>> orig = 'aaaaabbbccddxddaaabb'
>>> comp = zlib.compress(orig, 9)
>>> comp
'x\xdaKL\x04\x82\xa4\xa4\xa4\xe4\xe4\x94\x94\x8a\x94\x140\x07\x00Q\x19\x07\xc1'
>>> quoted = quote(comp)
>>> quoted
'x%DAKL%04%82%A4%A4%A4%E4%E4%94%94%8A%94%140%07%00Q%19%07%C1'
>>> req = Request('http://example.com', quoted)
Related
I receive the following byte message via socket connection and I want to convert into string and do further processing I am using python3.7
below is the code i tried so far
import codecs
a = b'0400F224648188E0801200000040000000001941678904000010237890000000000000222220418151856038556051259950760020806002468060046010403319 HSBCBSB8001101234567890MC 100 WITH ORDERIN FO AU009006Q\x00\x00\x00\x83\x00007\xa0\x00\x00\x00\x00%\x02010003855604181518562468000000000460100000'
b= codecs.decode(a, 'utf-8')
print(b)
Iam getting the error as below
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0x83 in position > 208: invalid start byte
how can I convert the data to string and process further
Thanks in advance
Your data is not utf-8 encoded. You can use BeautifulSoup to decode unknown encodings:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(b'0400F224648188E0801200000040000000001941678904000010237890000000000000222220418151856038556051259950760020806002468060046010403319 HSBCBSB8001101234567890MC 100 WITH ORDERIN FO AU009006Q\x00\x00\x00\x83\x00007\xa0\x00\x00\x00\x00%\x02010003855604181518562468000000000460100000'
)
print(soup.contents[0])
print(soup.originalEncoding)
to get
0400F224648188E0801200000040000 ... # etc
and
windows-1252
You can use the bs4-detector seperately as well: UnicodeDammit and also provide it with suggestions which encodings to try first / not to try to finetune it.
More info on SO:
How to determine the encoding of text?
I want to use python requests module to get data from server,but I always get bytes data,even if I had set headers={'content-type':'application/json;charset=utf-8'} .
My code:
import requests
from io import BytesIO
headers={'content-type':'application/json;charset=utf-8'}
#response=requests.get("https://api-dev.creams.io/buildings/2/contract- templates",headers=headers)
r = requests.get('https://developer.github.com/v3/timeline.json',headers=headers)
print(r.headers)
# response = urlopen("https://beta.creams.io/")
when I print headers,content-type still be text/html;charset-utf-8
and I always get bytes data. when I use r.text, I got an error:UnicodeEncodeError: 'ascii' codec can't encode character '\u2022' in position 382: ordinal not in range(128). And I used r.content method,I always get bytes data(start with b'),I just want to get utf-8 encoding string. How can I resolve it?
This should work just fine:
import requests as req
r = req.get('https://developer.github.com/v3/timeline.json')
print(r.text)
I'm trying to get a response from urllib and decode it
to a readable format. The text is in Hebrew and also contains characters like { and /
top page coding is:
# -*- coding: utf-8 -*-
raw string is:
b'\xff\xfe{\x00 \x00\r\x00\n\x00"\x00i\x00d\x00"\x00 \x00:\x00 \x00"\x001\x004\x000\x004\x008\x003\x000\x000\x006\x004\x006\x009\x006\x00"\x00,\x00\r\x00\n\x00"\x00t\x00i\x00t\x00l\x00e\x00"\x00 \x00:\x00 \x00"\x00\xe4\x05\xd9\x05\xe7\x05\xd5\x05\xd3\x05 \x00\xd4\x05\xe2\x05\xd5\x05\xe8\x05\xe3\x05 \x00\xd4\x05\xea\x05\xe8\x05\xe2\x05\xd4\x05 \x00\xd1\x05\xde\x05\xe8\x05\xd7\x05\xd1\x05 \x00"\x00,\x00\r\x00\n\x00"\x00d\x00a\x00t\x00a\x00"\x00 \x00:\x00 \x00[\x00]\x00\r\x00\n\x00}\x00\r\x00\n\x00\r\x00\n\x00'
Now I'm trying to decode it using:
data = data.decode()
and I get the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
Your problem is that that is not UTF-8. You have UTF-16 encoded data, decode it as such:
>>> data = b'\xff\xfe{\x00 \x00\r\x00\n\x00"\x00i\x00d\x00"\x00 \x00:\x00 \x00"\x001\x004\x000\x004\x008\x003\x000\x000\x006\x004\x006\x009\x006\x00"\x00,\x00\r\x00\n\x00"\x00t\x00i\x00t\x00l\x00e\x00"\x00 \x00:\x00 \x00"\x00\xe4\x05\xd9\x05\xe7\x05\xd5\x05\xd3\x05 \x00\xd4\x05\xe2\x05\xd5\x05\xe8\x05\xe3\x05 \x00\xd4\x05\xea\x05\xe8\x05\xe2\x05\xd4\x05 \x00\xd1\x05\xde\x05\xe8\x05\xd7\x05\xd1\x05 \x00"\x00,\x00\r\x00\n\x00"\x00d\x00a\x00t\x00a\x00"\x00 \x00:\x00 \x00[\x00]\x00\r\x00\n\x00}\x00\r\x00\n\x00\r\x00\n\x00'
>>> data.decode('utf16')
'{ \r\n"id" : "1404830064696",\r\n"title" : "פיקוד העורף התרעה במרחב ",\r\n"data" : []\r\n}\r\n\r\n'
>>> import json
>>> json.loads(data.decode('utf16'))
{'title': 'פיקוד העורף התרעה במרחב ', 'id': '1404830064696', 'data': []}
If you loaded this from a website with urllib.request, the Content-Type header should contain a charset parameter telling you this; if response is the returned urllib.request response object, then use:
codec = response.info().get_content_charset('utf-8')
This defaults to UTF-8 when no charset parameter has been set, which is the appropriate default for JSON data.
Alternatively, use the requests library to load the JSON response, it handles decoding automatically (including UTF-codec autodetection specific to JSON responses).
One further note: the PEP 263 source code codec comment is used only to interpret your source code, including string literals. It has nothing to do with encodings of external sources (files, network data, etc.).
I got this error in Django with Python 3.4. I was trying to get this to work with django-rest-framework.
This was my code that fixed the error UnicodeDecodeError: 'utf-8' codec can't decode byte error.
This is the passing test:
import os
from os.path import join, dirname
import uuid
from rest_framework.test import APITestCase
class AttachmentTests(APITestCase):
def setUp(self):
self.base_dir = dirname(dirname(dirname(__file__)))
self.image = join(self.base_dir, "source/test_in/aaron.jpeg")
self.image_filename = os.path.split(self.image)[1]
def test_create_image(self):
id = str(uuid.uuid4())
with open(self.image, 'rb') as data:
# data = data.read()
post_data = {
'id': id,
'filename': self.image_filename,
'file': data
}
response = self.client.post("/api/admin/attachments/", post_data)
self.assertEqual(response.status_code, 201)
I am teaching myself how to parse google results with json, but when I run this code ( which shoud work ), I am getting this error: UnicodeEncodeError: 'charmap' codec can't encode character u'\u2014' in position 5: character maps to <undefined>. Can someone help me?
import urllib
import simplejson
query = urllib.urlencode({'q' : 'site:example.com'})
url = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s&start=50' \
% (query)
search_results = urllib.urlopen(url)
json = simplejson.loads(search_results.read())
results = json['responseData']['results']
for i in results:
print i['title'] + ": " + i['url']
This error may be caused by the encoding that your console application uses when sending unicode data to stdout. There's an article that talks about it.
Check stdout's encoding:
>>> import sys
>>> sys.stdout.encoding # On my machine I get this result:
'UTF-8'
Use unicode literals.
print i[u'title'] + u": " + i[u'url']
Also:
jsondata = simplejson.load(search_results)
My guess is that the error is in simplejson.loads(search_results.read()) line, possibly because the default encoding your python is picking up is not utf-8 and google is returning utf-8.
Try: simplejson.loads(unicode(search_results.read(), "utf8").
I just want to build simple UI translation built in GAE (using python SDK).
def add_translation(self, pid=None):
trans = Translation()
trans.tlang = db.Key("agtwaW1kZXNpZ25lcnITCxILQXBwTGFuZ3VhZ2UY8aIEDA")
trans.ttype = "UI"
trans.transid = "ui-about"
trans.content = "关于我们"
trans.put()
this is resulting encoding error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)
How to encode the correct insert content with unicode(utf-8) character?
using the u notation:
>>> s=u"关于我们"
>>> print s
关于我们
Or explicitly, stating the encoding:
>>> s=unicode('אדם מתן', 'utf8')
>>> print s
אדם מתן
Read more at the Unicode HOWTO page in the python documentation site.