UnicodeDecodeError when dumping to json input from console

UnicodeDecodeError when dumping to json input from console - python

I input some cyrillic text from console and when I try to dump it to json I get exceptions.UnicodeDecodeError: 'utf8' codec can't decode byte. I can't figure out why because it doesn't happen always and text is always cyrillic.
Here's the part of the code where I input text:
item['title'] = raw_input('Title: ')
item['description'] = raw_input('Description: ')
And here's the line where I dump the dictionary to json:
line = json.dumps(dict(item), encoding='utf8') + "\n"
The item is not a dictionary, it's an object, so I need to convert it to dictionary first.
Here's the full traceback:
Traceback (most recent call last):
File "/home/dmitry/.virtualenvs/test_scrapy/local/lib/python2.7/site-packages/scrapy/middleware.py", line 62, in _process_chain
return process_chain(self.methods[methodname], obj, *args)
File "/home/dmitry/.virtualenvs/test_scrapy/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 65, in process_chain
d.callback(input)
File "/home/dmitry/.virtualenvs/test_scrapy/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 382, in callback
self._startRunCallbacks(result)
File "/home/dmitry/.virtualenvs/test_scrapy/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 490, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/home/dmitry/.virtualenvs/test_scrapy/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/home/dmitry/Dropbox/coding/python/scrapy/videos_parser/videos_parser/pipelines.py", line 94, in process_item
line = json.dumps(dict(item), encoding='utf8') + "\n"
File "/usr/lib/python2.7/json/__init__.py", line 250, in dumps
sort_keys=sort_keys, **kw).encode(obj)
File "/usr/lib/python2.7/json/encoder.py", line 207, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/lib/python2.7/json/encoder.py", line 270, in iterencode
return _iterencode(o, 0)
File "/usr/lib/python2.7/json/encoder.py", line 233, in _encoder
o = o.decode(_encoding)
File "/home/dmitry/.virtualenvs/test_scrapy/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
exceptions.UnicodeDecodeError: 'utf8' codec can't decode byte 0xd1 in position 15: invalid continuation byte
sys.getdefaultencoding() says I'm using ascii. I've tried to changed it to utf8 with sys.setdefaultencoding('utf8') but it didn't work.
UPDATE
Here's the code I use to see how strings looked before decoding:
try:
item['title'] = raw_input('Title: ')
item['title'] = item['title'].decode(sys.stdin.encoding)
except UnicodeDecodeError:
print repr(item['title'])
try:
item['description'] = raw_input('Description: ')
item['description'] = item['description'].decode(sys.stdin.encoding)
except UnicodeDecodeError:
print repr(item['description'])
And here's what the output from console:
Title: На работе платят бабло, но работать надо на ней
'\xd0\x9d\xd0\xb0 \xd1\x80\xd0\xb0\xd0\xb1\xd0\xbe\xd1\x82\xd0\xd0\xb5 \xd0\xbf\xd0\xbb\xd0\xb0\xd1\x82\xd1\x8f\xd1\x82 \xd0\xb1\xd0\xb0\xd0\xb1\xd0\xbb\xd0\xbe, \xd0\xbd\xd0\xd0\xbe \xd1\x80\xd0\xb0\xd0\xb1\xd0\xbe\xd1\x82\xd0\xb0\xd1\x82\xd1\x8c \xd0\xbd\xd0\xb0\xd0\xb4\xd0\xbe \xd0\xbd\xd0\xb0 \xd0\xbd\xd0\xb5\xd0\xb9'
Description: Я не против первого, но без второго мне веселей
'\xd0\xaf \xd0\xbd\xd0\xb5 \xd0\xbf\xd1\x80\xd0\xbe\xd1\x82\xd0\xb8\xd0\xb2 \xd0\xbf\xd0\xb5\xd1\x80\xd0\xb2\xd0\xbe\xd0\xb3\xd0\xbe \xd0, \xd0\xbd\xd0\xbe \xd0\xb1\xd0\xb5\xd0\xb7 \xd0\xb2\xd1\x82\xd0\xbe\xd1\x80\xd0\xbe\xd0\xb3\xd0\xbe \xd0\xbc\xd0\xbd\xd0\xb5 \xd0\xb2\xd0\xb5\xd1\x81\xd0\xb5\xd0\xbb\xd0\xb5\xd0\xb9'

Your terminal appears to be botching UTF-8 input; extra \dx0 bytes have been inserted:
>>> import difflib
>>> given = '\xd0\x9d\xd0\xb0 \xd1\x80\xd0\xb0\xd0\xb1\xd0\xbe\xd1\x82\xd0\xd0\xb5 \xd0\xbf\xd0\xbb\xd0\xb0\xd1\x82\xd1\x8f\xd1\x82 \xd0\xb1\xd0\xb0\xd0\xb1\xd0\xbb\xd0\xbe, \xd0\xbd\xd0\xd0\xbe \xd1\x80\xd0\xb0\xd0\xb1\xd0\xbe\xd1\x82\xd0\xb0\xd1\x82\xd1\x8c \xd0\xbd\xd0\xb0\xd0\xb4\xd0\xbe \xd0\xbd\xd0\xb0 \xd0\xbd\xd0\xb5\xd0\xb9'
>>> expected = 'На работе платят бабло, но работать надо на ней' # requires UTF-8 terminal
>>> for opcode in difflib.SequenceMatcher(a=expected, b=given).get_opcodes():
... print "%6s a[%d:%d] b[%d:%d]" % opcode
... if opcode[0] == 'insert': print 'Inserted:', repr(given[opcode[3]:opcode[4]])
...
equal a[0:15] b[0:15]
insert a[15:15] b[15:16]
Inserted: '\xd0'
equal a[15:45] b[16:46]
insert a[45:45] b[46:47]
Inserted: '\xd0'
equal a[45:85] b[47:87]
>>> expected[14:17]
'\x82\xd0\xb5'
>>> given[14:18]
'\x82\xd0\xd0\xb5'
>>> expected[44:47]
'\xbd\xd0\xbe'
>>> given[45:49]
'\xbd\xd0\xd0\xbe'
>>> given = '\xd0\xaf \xd0\xbd\xd0\xb5 \xd0\xbf\xd1\x80\xd0\xbe\xd1\x82\xd0\xb8\xd0\xb2 \xd0\xbf\xd0\xb5\xd1\x80\xd0\xb2\xd0\xbe\xd0\xb3\xd0\xbe \xd0, \xd0\xbd\xd0\xbe \xd0\xb1\xd0\xb5\xd0\xb7 \xd0\xb2\xd1\x82\xd0\xbe\xd1\x80\xd0\xbe\xd0\xb3\xd0\xbe \xd0\xbc\xd0\xbd\xd0\xb5 \xd0\xb2\xd0\xb5\xd1\x81\xd0\xb5\xd0\xbb\xd0\xb5\xd0\xb9'
>>> expected = 'Я не против первого, но без второго мне веселей' # requires UTF-8 terminal
>>> for opcode in difflib.SequenceMatcher(a=expected, b=given).get_opcodes():
... print "%6s a[%d:%d] b[%d:%d]" % opcode
... if opcode[0] == 'insert': print 'Inserted:', repr(given[opcode[3]:opcode[4]])
...
equal a[0:35] b[0:35]
insert a[35:35] b[35:37]
Inserted: ' \xd0'
equal a[35:85] b[37:87]
>>> expected[34:38]
'\xbe, \xd0'
>>> given[34:40]
'\xbe \xd0, \xd0'
In the title, two extra \xd0 bytes were inserted where there already was a \xd0 byte. In the description, a space and \xd0 was inserted before a comma, space, \xd0 sequence.
This is a failure of your terminal, not Python. Why this happens, is not clear however.

If you use simple raw_input() you give only bytes:
>>> raw_input('Input: ')
Input: фыв
'\xd1\x84\xd1\x8b\xd0\xb2'
Use unicode() to convert input string to unicode
>>> unicode(raw_input('Input: '), encoding='utf-8')
Input: фыв
u'\u0444\u044b\u0432'
and then try work with json

Related

Program crashes during reading text file

def process_file(self):
error_flag = 0
line_count = 0
log_file = self.file_name
pure_name = log_file.strip()
# print('Before opening file ',pure_name)
logfile_in = open(pure_name, 'r') # Read file
lines = logfile_in.readlines()
# print('After reading file enteries ', pure_name)
Error Message
Traceback (most recent call last):
File "C:\Users\admin\PycharmProjects\BackupLogCheck\main.py", line 49, in <module>
backupLogs.process_file()
File "C:\Users\admin\PycharmProjects\BackupLogCheck\main.py", line 20, in process_file
lines = logfile_in.readlines()
File "C:\Users\admin\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 350: character maps to <undefined>
Process finished with exit code 1
Line 49 is where I call above method. But I have traced that it crashes at reading the file. I have checked the file; it has just text in it. I don't know if there are some characters which it doesn't like on reading entries. I am running on Windows 10.
I am new to Python, any suggestion how to find/correct the issue?

Try the file name in string format
logfile_in = open('pure_name', 'r') # Read file
lines = logfile_in.readlines()
print(lines)
output
['test line one\n', 'test line two']
or
logfile_in = open('pure_name', 'r') # Read file
lines = logfile_in.readlines()
for line in lines:
print(line)
output
test line one
test line two

Python errors in file encode

this is my code:
prettyPicture(clf, features_test, labels_test)
output_image("F:/test.png", "png", open("F:/test.png", "rb").read())
def output_image(name, format, bytes):
image_start = "BEGIN_IMAGE_f9825uweof8jw9fj4r8"
image_end = "END_IMAGE_0238jfw08fjsiufhw8frs"
data = {}
data['name'] = name
data['format'] = format
data['bytes'] = base64.encodestring(bytes)
print(image_start + json.dumps(data) + image_end)
this errors is:
Traceback (most recent call last):
File "studentMain.py", line 41, in <module>
output_image("F:/test.png", "png", open("F:/test.png", "rb").read())
File "F:\Demo\class_vis.py", line 69, in output_image
print(image_start + json.dumps(data) + image_end)
File "C:\Users\Tony\AppData\Local\Programs\Python\Python36-
32\lib\json\__init__.py", line 231, in dumps
return _default_encoder.encode(obj)
File "C:\Users\Tony\AppData\Local\Programs\Python\Python36-
32\lib\json\encoder.py", line 199, in encode
chunks = self.iterencode(o, _one_shot=True)
File "C:\Users\Tony\AppData\Local\Programs\Python\Python36-
32\lib\json\encoder.py", line 257, in iterencode
return _iterencode(o, 0)
File "C:\Users\Tony\AppData\Local\Programs\Python\Python36-
32\lib\json\encoder.py", line 180, in default
o.__class__.__name__)
TypeError: Object of type 'bytes' is not JSON serializable

The issue here is that base64.encodestring() returns a bytes object, not a string.
Try:
data['bytes'] = base64.encodestring(bytes).decode('ascii')
Check out this question and answer for a good explanation of why this is:
Why does base64.b64encode() return a bytes object?
Also see: How to encode bytes in JSON? json.dumps() throwing a TypeError

You're only missing one aspect here: When you use .encodestring, you have a bytes object as result, and bytes are not json serializable in python 3.
You can solve it just encoding your data["bytes"]:
data['bytes'] = base64.encodestring(bytes).decode("utf-8")
I'm assuming you'll always receive a bytes object at the "bytes" variable, otherwise you should add a checker for the type of the object, and not encoding when it's already a string.

Python 3 UnicodeEncodeError [duplicate]

This question already has answers here:
Unsupported operation :not writeable python
(2 answers)
Closed 5 years ago.
I have strings as follows in my python list (taken from command prompt):
>>> o['records'][5790]
(5790, 'Vlv-Gate-Assy-Mdl-\xe1M1-2-\xe19/16-10K-BB Credit Memo ', 60,
True, '40141613')
>>>
I have tried suggestions as mentioned here: Changing default encoding of Python?
Further changed the default encoding to utf-16 too. But still json.dumps() threw and exception as follows:
>>> write(o)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "okapi_create_master.py", line 49, in write
o = json.dumps(output)
File "C:\Python27\lib\json\__init__.py", line 231, in dumps
return _default_encoder.encode(obj)
File "C:\Python27\lib\json\encoder.py", line 201, in encode
chunks = self.iterencode(o, _one_shot=True)
File "C:\Python27\lib\json\encoder.py", line 264, in iterencode
return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 25: invalid
continuation byte
Can't figure what kind of transformation is required for such strings so that json.dumps() works.

\xe1 is not decodable using utf-8, utf-16 encoding.
>>> '\xe1'.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 0: unexpected end of data
>>> '\xe1'.decode('utf-16')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0xe1 in position 0: truncated data
Try latin-1 encoding:
>>> record = (5790, 'Vlv-Gate-Assy-Mdl-\xe1M1-2-\xe19/16-10K-BB Credit Memo ',
... 60, True, '40141613')
>>> json.dumps(record, encoding='latin1')
'[5790, "Vlv-Gate-Assy-Mdl-\\u00e1M1-2-\\u00e19/16-10K-BB Credit Memo ", 60, true, "40141613"]'
Or, specify ensure_ascii=False, json.dumps to make json.dumps not try to decode the string.
>>> json.dumps(record, ensure_ascii=False)
'[5790, "Vlv-Gate-Assy-Mdl-\xe1M1-2-\xe19/16-10K-BB Credit Memo ", 60, true, "40141613"]'

I had a similar problem, and came up with the following approach to either guarantee unicodes or byte strings, from either input. In short, include and use the following lambdas:
# guarantee unicode string
_u = lambda t: t.decode('UTF-8', 'replace') if isinstance(t, str) else t
_uu = lambda *tt: tuple(_u(t) for t in tt)
# guarantee byte string in UTF8 encoding
_u8 = lambda t: t.encode('UTF-8', 'replace') if isinstance(t, unicode) else t
_uu8 = lambda *tt: tuple(_u8(t) for t in tt)
Applied to your question:
import json
o = (5790, u"Vlv-Gate-Assy-Mdl-\xe1M1-2-\xe19/16-10K-BB Credit Memo ", 60,
True, '40141613')
as_json = json.dumps(_uu8(*o))
as_obj = json.loads(as_json)
print "object\n ", o
print "json (type %s)\n %s " % (type(as_json), as_json)
print "object again\n ", as_obj
=>
object
(5790, u'Vlv-Gate-Assy-Mdl-\xe1M1-2-\xe19/16-10K-BB Credit Memo ', 60, True, '40141613')
json (type <type 'str'>)
[5790, "Vlv-Gate-Assy-Mdl-\u00e1M1-2-\u00e19/16-10K-BB Credit Memo ", 60, true, "40141613"]
object again
[5790, u'Vlv-Gate-Assy-Mdl-\xe1M1-2-\xe19/16-10K-BB Credit Memo ', 60, True, u'40141613']
Here's some more reasoning about this.

UnicodeDecodeError while using json.dumps() [duplicate]

This question already has answers here:
Unsupported operation :not writeable python
(2 answers)
Closed 5 years ago.
I have strings as follows in my python list (taken from command prompt):
>>> o['records'][5790]
(5790, 'Vlv-Gate-Assy-Mdl-\xe1M1-2-\xe19/16-10K-BB Credit Memo ', 60,
True, '40141613')
>>>
I have tried suggestions as mentioned here: Changing default encoding of Python?
Further changed the default encoding to utf-16 too. But still json.dumps() threw and exception as follows:
>>> write(o)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "okapi_create_master.py", line 49, in write
o = json.dumps(output)
File "C:\Python27\lib\json\__init__.py", line 231, in dumps
return _default_encoder.encode(obj)
File "C:\Python27\lib\json\encoder.py", line 201, in encode
chunks = self.iterencode(o, _one_shot=True)
File "C:\Python27\lib\json\encoder.py", line 264, in iterencode
return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 25: invalid
continuation byte
Can't figure what kind of transformation is required for such strings so that json.dumps() works.

\xe1 is not decodable using utf-8, utf-16 encoding.
>>> '\xe1'.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 0: unexpected end of data
>>> '\xe1'.decode('utf-16')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0xe1 in position 0: truncated data
Try latin-1 encoding:
>>> record = (5790, 'Vlv-Gate-Assy-Mdl-\xe1M1-2-\xe19/16-10K-BB Credit Memo ',
... 60, True, '40141613')
>>> json.dumps(record, encoding='latin1')
'[5790, "Vlv-Gate-Assy-Mdl-\\u00e1M1-2-\\u00e19/16-10K-BB Credit Memo ", 60, true, "40141613"]'
Or, specify ensure_ascii=False, json.dumps to make json.dumps not try to decode the string.
>>> json.dumps(record, ensure_ascii=False)
'[5790, "Vlv-Gate-Assy-Mdl-\xe1M1-2-\xe19/16-10K-BB Credit Memo ", 60, true, "40141613"]'

I had a similar problem, and came up with the following approach to either guarantee unicodes or byte strings, from either input. In short, include and use the following lambdas:
# guarantee unicode string
_u = lambda t: t.decode('UTF-8', 'replace') if isinstance(t, str) else t
_uu = lambda *tt: tuple(_u(t) for t in tt)
# guarantee byte string in UTF8 encoding
_u8 = lambda t: t.encode('UTF-8', 'replace') if isinstance(t, unicode) else t
_uu8 = lambda *tt: tuple(_u8(t) for t in tt)
Applied to your question:
import json
o = (5790, u"Vlv-Gate-Assy-Mdl-\xe1M1-2-\xe19/16-10K-BB Credit Memo ", 60,
True, '40141613')
as_json = json.dumps(_uu8(*o))
as_obj = json.loads(as_json)
print "object\n ", o
print "json (type %s)\n %s " % (type(as_json), as_json)
print "object again\n ", as_obj
=>
object
(5790, u'Vlv-Gate-Assy-Mdl-\xe1M1-2-\xe19/16-10K-BB Credit Memo ', 60, True, '40141613')
json (type <type 'str'>)
[5790, "Vlv-Gate-Assy-Mdl-\u00e1M1-2-\u00e19/16-10K-BB Credit Memo ", 60, true, "40141613"]
object again
[5790, u'Vlv-Gate-Assy-Mdl-\xe1M1-2-\xe19/16-10K-BB Credit Memo ', 60, True, u'40141613']
Here's some more reasoning about this.

Python IMAPClient/imaplib search unicode issue

I'm using the IMAPClient library, but I'm getting UnicodeEncodeError when doing a search. Below is a snippet and the stack trace:
imap_client = imapclient.IMAPClient('imap.gmail.com', use_uid=True, ssl=True)
imap_client.oauth2_login('john#example.com', 'xxx')
subject = u'Test \u0153\u2211\u00b4\u00e5\u00df\u2202'
from_email = u'john#example.com'
to_emails = [u'foo#example.com']
cc_emails = []
approx_date_sent = '05-Aug-2013'
imap_client.select_folder(r'\Sent')
search_criteria = [
u'FROM %s' % from_email,
u'SUBJECT %s'.encode('utf-8') % subject,
u'TO %s' % ';'.join(to_emails) or '',
u'CC %s' % ';'.join(cc_emails) or '',
u'SENTON %s' % approx_date_sent
]
msg_ids = imap_client.search(search_criteria, charset='utf-8')
'ascii' codec can't encode characters in position 77-82: ordinal not in range(128)
Traceback (most recent call last):
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.1/webapp2.py", line 570, in dispatch
return method(*args, **kwargs)
File "/base/data/home/apps/s~app/dev.369284735686497536/imapapi.py", line 269, in post
to_emails=to_emails, cc_emails=cc_emails, approx_date_sent=approx_date_sent
File "/base/data/home/apps/s~app/dev.369284735686497536/utils/imap.py", line 123, in search_message
msg_ids = imap_client.search(search_criteria, charset='utf-8')
File "/base/data/home/apps/s~app/dev.369284735686497536/imapclient/imapclient.py", line 569, in search
typ, data = self._imap.search(charset, *criteria)
File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/imaplib.py", line 625, in search
typ, dat = self._simple_command(name, 'CHARSET', charset, *criteria)
File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/imaplib.py", line 1070, in _simple_command
return self._command_complete(name, self._command(name, *args))
File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/imaplib.py", line 857, in _command
self.send('%s%s' % (data, CRLF))
File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/imaplib.py", line 1178, in send
sent = self.sslobj.write(data)
File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/ssl.py", line 232, in write
return self._sslobj.write(data)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 77-82: ordinal not in range(128)
It seems the issue happens in ssl.py? And ssl=True is needed for oauth2_login.

Try this out
Python IMAP search using a subject encoded with iso-8859-1
It covers utf-8 as well as iso-8859-1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

UnicodeDecodeError when dumping to json input from console - python

If you use simple raw_input() you give only bytes: >>> raw_input('Input: ') Input: фыв '\xd1\x84\xd1\x8b\xd0\xb2' Use unicode() to convert input string to unicode >>> unicode(raw_input('Input: '), encoding='utf-8') Input: фыв u'\u0444\u044b\u0432' and then try work with json

Related

Program crashes during reading text file

Python errors in file encode

Python 3 UnicodeEncodeError [duplicate]

UnicodeDecodeError while using json.dumps() [duplicate]

Python IMAPClient/imaplib search unicode issue

Categories

Resources