UnicodeDecodeError: 'utf-8' codec can't decode byte error - python

I'm trying to get a response from urllib and decode it
to a readable format. The text is in Hebrew and also contains characters like { and /
top page coding is:
# -*- coding: utf-8 -*-
raw string is:
b'\xff\xfe{\x00 \x00\r\x00\n\x00"\x00i\x00d\x00"\x00 \x00:\x00 \x00"\x001\x004\x000\x004\x008\x003\x000\x000\x006\x004\x006\x009\x006\x00"\x00,\x00\r\x00\n\x00"\x00t\x00i\x00t\x00l\x00e\x00"\x00 \x00:\x00 \x00"\x00\xe4\x05\xd9\x05\xe7\x05\xd5\x05\xd3\x05 \x00\xd4\x05\xe2\x05\xd5\x05\xe8\x05\xe3\x05 \x00\xd4\x05\xea\x05\xe8\x05\xe2\x05\xd4\x05 \x00\xd1\x05\xde\x05\xe8\x05\xd7\x05\xd1\x05 \x00"\x00,\x00\r\x00\n\x00"\x00d\x00a\x00t\x00a\x00"\x00 \x00:\x00 \x00[\x00]\x00\r\x00\n\x00}\x00\r\x00\n\x00\r\x00\n\x00'
Now I'm trying to decode it using:
data = data.decode()
and I get the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Your problem is that that is not UTF-8. You have UTF-16 encoded data, decode it as such:
>>> data = b'\xff\xfe{\x00 \x00\r\x00\n\x00"\x00i\x00d\x00"\x00 \x00:\x00 \x00"\x001\x004\x000\x004\x008\x003\x000\x000\x006\x004\x006\x009\x006\x00"\x00,\x00\r\x00\n\x00"\x00t\x00i\x00t\x00l\x00e\x00"\x00 \x00:\x00 \x00"\x00\xe4\x05\xd9\x05\xe7\x05\xd5\x05\xd3\x05 \x00\xd4\x05\xe2\x05\xd5\x05\xe8\x05\xe3\x05 \x00\xd4\x05\xea\x05\xe8\x05\xe2\x05\xd4\x05 \x00\xd1\x05\xde\x05\xe8\x05\xd7\x05\xd1\x05 \x00"\x00,\x00\r\x00\n\x00"\x00d\x00a\x00t\x00a\x00"\x00 \x00:\x00 \x00[\x00]\x00\r\x00\n\x00}\x00\r\x00\n\x00\r\x00\n\x00'
>>> data.decode('utf16')
'{ \r\n"id" : "1404830064696",\r\n"title" : "פיקוד העורף התרעה במרחב ",\r\n"data" : []\r\n}\r\n\r\n'
>>> import json
>>> json.loads(data.decode('utf16'))
{'title': 'פיקוד העורף התרעה במרחב ', 'id': '1404830064696', 'data': []}
If you loaded this from a website with urllib.request, the Content-Type header should contain a charset parameter telling you this; if response is the returned urllib.request response object, then use:
codec = response.info().get_content_charset('utf-8')
This defaults to UTF-8 when no charset parameter has been set, which is the appropriate default for JSON data.
Alternatively, use the requests library to load the JSON response, it handles decoding automatically (including UTF-codec autodetection specific to JSON responses).
One further note: the PEP 263 source code codec comment is used only to interpret your source code, including string literals. It has nothing to do with encodings of external sources (files, network data, etc.).

I got this error in Django with Python 3.4. I was trying to get this to work with django-rest-framework.
This was my code that fixed the error UnicodeDecodeError: 'utf-8' codec can't decode byte error.
This is the passing test:
import os
from os.path import join, dirname
import uuid
from rest_framework.test import APITestCase
class AttachmentTests(APITestCase):
def setUp(self):
self.base_dir = dirname(dirname(dirname(__file__)))
self.image = join(self.base_dir, "source/test_in/aaron.jpeg")
self.image_filename = os.path.split(self.image)[1]
def test_create_image(self):
id = str(uuid.uuid4())
with open(self.image, 'rb') as data:
# data = data.read()
post_data = {
'id': id,
'filename': self.image_filename,
'file': data
}
response = self.client.post("/api/admin/attachments/", post_data)
self.assertEqual(response.status_code, 201)

Related

How to handle the network message with unicode that is not decodeable to utf-8

I receive the following byte message via socket connection and I want to convert into string and do further processing I am using python3.7
below is the code i tried so far
import codecs
a = b'0400F224648188E0801200000040000000001941678904000010237890000000000000222220418151856038556051259950760020806002468060046010403319 HSBCBSB8001101234567890MC 100 WITH ORDERIN FO AU009006Q\x00\x00\x00\x83\x00007\xa0\x00\x00\x00\x00%\x02010003855604181518562468000000000460100000'
b= codecs.decode(a, 'utf-8')
print(b)
Iam getting the error as below
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0x83 in position > 208: invalid start byte
how can I convert the data to string and process further
Thanks in advance
Your data is not utf-8 encoded. You can use BeautifulSoup to decode unknown encodings:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(b'0400F224648188E0801200000040000000001941678904000010237890000000000000222220418151856038556051259950760020806002468060046010403319 HSBCBSB8001101234567890MC 100 WITH ORDERIN FO AU009006Q\x00\x00\x00\x83\x00007\xa0\x00\x00\x00\x00%\x02010003855604181518562468000000000460100000'
)
print(soup.contents[0])
print(soup.originalEncoding)
to get
0400F224648188E0801200000040000 ... # etc
and
windows-1252
You can use the bs4-detector seperately as well: UnicodeDammit and also provide it with suggestions which encodings to try first / not to try to finetune it.
More info on SO:
How to determine the encoding of text?

Can't decode correctly JSON URL from Python

I want to read some data from a JSON url, however, with my code i don't get a JSON structure, instead i get a string undecoded.
I've also tried reading the data directly from the url using pandas read_json, but i get this error message:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 136430: character maps to <undefined>.
This is the main code i'm using:
from urllib.request import urlopen
import json
response = urlopen("https://data.nasa.gov/resource/y77d-th95.json")
json_data = response.read().decode('utf8','replace')
And this is what i get in json_data:
'\x1f�\x08\x00\x00\x00\x00\x00\x00\x00̽K��ȕ&��_ልR����\n
i�MOfd*3+#��RI5]�F��:�:��%Ct�\x1b�ѨE�\x06�3jP��l\x06�\x02\n�A�\n
\x02��ƌ��<f4�c��Ũ^�P����^�����__u�S��o^}Y���\x15y�n��Gտ������\x0f�T��
\x1f�WCs\x7f��\x0f\x07�Gor��?��=�\x7f�m�߫\x7f��F�\x1f�ꥩ\x07�G�l�Q��\x7f�e
\x7f3��&˲��}}T\x7f%�6e�O\x7f�w\x0f�O�M&9��\x0f\x1f�~���Ƕ�^��
\x7f}u�I?�mwT��}�\x0f۶����%��"F�J��?���B\x00�aw:\x18�\x0cE�]1!,Yv\x13b�S
\x0cb��\x06�\x04�f\x1b�\x130\x1a9b�:0�ƀ,P��|\'&�4+Ͽ�\x16P�\x01\x15
\x1bF�����)��o���\x12�Z�\x18�\x0e����i\x7f�_wl�b5"��\x01�+*n#.
\x0b\x041-6r����TI�� 1J]Ļv��\x1b��8�7p\x0b���f�ʮ9ߨE�-m!6U�\x02t\x14$W�
\x0e���]��\x1f\'�ղ�,\x18�n��\x15���\r5�7�-�F�,��#Z�\x0b�î]\x7f�?l��
\x17�c�5���"��\r_��h`y\x05\x06X���m\\�Ӛ/\x01l�Q�\x00\x7f�\x1e\x1a^E\\���Y
膒T�#=7T�)h\x02\u038b\x18\x11�\x1b��To�\t�\\t\\i\x11zr8���9�\x14�-
�.�G�*HF�3���^}����+`AK\x1c0�+D\x00/��f\x1b鿟���)���E�
\x03��Oݪٯ�<\x0e��p��յ�b�����Wo���\x13|zC\x0fo�mk�b�\r�\x15�3��ˌ���Y�
\x18�\x0e����0!\x16��5\x13A�,ǀV�a8\x01�\x1b�b-^ĈQ�.�Ь\x0f�a���o^�
\x1a�(�?v�]�����,�\x0bY\x19��|;�\rO9�\x171b�:8\x1f���2�f���2�\u03a2�
\x0eS\x1bm~\x1b�|����Q�\x18�.Ȼzxw\x02�\x16��,�X��\x1d.q��\x1chY�\x19O�
\x1c1:]����aZO\x1e������̱���j�RnDʦ����N\x17���~�P����Cߑ��\x7f�rB�
\x07��Jb~\x1d�|x\x05��\x14k��\x11�ց�_��n�kH�\x07���ٺ�M\x11Z�\x12��b#<�
\x15\x1c�E��"G��\x19�\x7f���q7�0\'`P����,˰�U\x1ekQd�;��f,\x12\x0e�(G�N
\x17�������{:�,��p����r��\x1d�\x16|��\x11��ũn��xu�E���7�y�\x06\\�,
\x04�,"�\x16��$�t��l`#GF�3��p�.�<�H\x02k:Z\x18�o�\x14j\x01\x1b�M�Ģ�{�#l
\x06�e�l����ǿ\x0eu7!���7��A��OO��\x11,n�\x02�\x15�O�|#
\x19u�k��(O�"j1b��ط���p���""\\\x1ay5)[\x023�s�)=!\'��Blg9bԺ`�w�3[��\x04�
\x0e9\x0b!-9��\x16��\n^�rCS\x0c\n#G��\x19�á�C;��\x02��`p���m�
.
.
.
Any idea what i'm doing wrong?
trying to use json.loads directly on the result of read with default decode gives me a valid list,
try this please:
from urllib.request import urlopen
import json
response = urlopen("https://data.nasa.gov/resource/y77d-th95.json")
json_data = json.loads(response.read().decode())
print(json_data)

sending zlib compress data to server using POST

I want to send zlib compressed data of file to server using POST request. Following is the code I am trying to use
orig = open('fileName', 'r').read()
comp = zlib.compress(orig, 9)
req = Request(url, comp)
urlopen(req)
But I get the following error UnicodeDecodeError: 'utf8' codec can't decode byte 0x85 in position 2: invalid start byte
I tried following comp.encode('utf-8') but this also doesn't work. I get the same UnicodeDecodeError at some position. How can I resolve my problem?
The docs for urllib2.Request's data parameter state:
data should be a buffer in the standard application/x-www-form-urlencoded format.
You can encode your buffer using urllib.quote:
>>> orig = 'aaaaabbbccddxddaaabb'
>>> comp = zlib.compress(orig, 9)
>>> comp
'x\xdaKL\x04\x82\xa4\xa4\xa4\xe4\xe4\x94\x94\x8a\x94\x140\x07\x00Q\x19\x07\xc1'
>>> quoted = quote(comp)
>>> quoted
'x%DAKL%04%82%A4%A4%A4%E4%E4%94%94%8A%94%140%07%00Q%19%07%C1'
>>> req = Request('http://example.com', quoted)

Why I always get bytes data from server when used python requests module?

I want to use python requests module to get data from server,but I always get bytes data,even if I had set headers={'content-type':'application/json;charset=utf-8'} .
My code:
import requests
from io import BytesIO
headers={'content-type':'application/json;charset=utf-8'}
#response=requests.get("https://api-dev.creams.io/buildings/2/contract- templates",headers=headers)
r = requests.get('https://developer.github.com/v3/timeline.json',headers=headers)
print(r.headers)
# response = urlopen("https://beta.creams.io/")
when I print headers,content-type still be text/html;charset-utf-8
and I always get bytes data. when I use r.text, I got an error:UnicodeEncodeError: 'ascii' codec can't encode character '\u2022' in position 382: ordinal not in range(128). And I used r.content method,I always get bytes data(start with b'),I just want to get utf-8 encoding string. How can I resolve it?
This should work just fine:
import requests as req
r = req.get('https://developer.github.com/v3/timeline.json')
print(r.text)

Python 2.7: 'ascii' codec can't encode character u'\xe9' error while writing in file

I know this question have been asked various time but somehow I am not getting results.
I am fetching data from web which contains a string Elzéar. While going to read in CSV file it gives error which mentioned in question title.
While producing data I did following:
address = str(address).strip()
address = address.encode('utf8')
return name+','+address+','+city+','+state+','+phone+','+fax+','+pumps+','+parking+','+general+','+entertainment+','+fuel+','+resturants+','+services+','+technology+','+fuel_cards+','+credit_cards+','+permits+','+money_services+','+security+','+medical+','+longit+','+latit
and writing it as:
with open('records.csv', 'a') as csv_file:
print(type(data)) #prints <unicode>
data = data.encode('utf8')
csv_file.write(id+','+data+'\n')
status = 'OK'
the_file.write(ts+'\t'+url+'\t'+status+'\n')
Generates error as:
'ascii' codec can't encode character u'\xe9' in position 55: ordinal
not in range(128)
You could try something like (python2.7):
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import codecs
...
with codecs.open('records.csv', 'a', encoding="utf8") as csv_file:
print(type(data)) #prints <unicode>
# because data is unicode
csv_file.write(unicode(id)+u','+data+u'\n')
status = u'OK'
the_file.write(unicode(ts, encoding="utf8")+u'\t'+unicode(url, encoding="utf8")+u'\t'+status+u'\n')
The main idea is to work with unicode as much as possible and return str when outputing (better do not operate over str).

Categories