UnicodeEncodeError: 'ascii' codec can't encode

UnicodeEncodeError: 'ascii' codec can't encode - python

Ia have the following data container which is constantly being updated:
data = []
for val, track_id in zip(values,list(track_ids)):
#below
if val < threshold:
#structure data as dictionary
pre_data = {"artist": sp.track(track_id)['artists'][0]['name'], "track":sp.track(track_id)['name'], "feature": filter_name, "value": val}
data.append(pre_data)
#write to file
with open('db/json/' + user + '_' + product + '_' + filter_name + '.json', 'w') as f:
json.dump(data,f, ensure_ascii=False, indent=4, sort_keys=True)
but I am getting a lot of errors like this:
json.dump(data,f, ensure_ascii=False, indent=4, sort_keys=True)
File"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 190, in dump
fp.write(chunk)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 6: ordinal not in range(128)
Is there a way I can get rid of this encoding problem once and for all?
I was told that this would do it:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
but many people do not recommend it.
I use python 2.7.10
any clues?

When you write to a file that was opened in text mode, Python encodes the string for you. The default encoding is ascii, which generates the error you see; there are a lot of characters that can't be encoded to ASCII.
The solution is to open the file in a different encoding. In Python 2 you must use the codecs module, in Python 3 you can add the encoding= parameter directly to open. utf-8 is a popular choice since it can handle all of the Unicode characters, and for JSON specifically it's the standard; see https://en.wikipedia.org/wiki/JSON#Data_portability_issues.
import codecs
with codecs.open('db/json/' + user + '_' + product + '_' + filter_name + '.json', 'w', encoding='utf-8') as f:

Your object has unicode strings and python 2.x's support for unicode can be a bit spotty. First, lets make a short example that demonstrates the problem:
>>> obj = {"artist":u"Björk"}
>>> import json
>>> with open('deleteme', 'w') as f:
... json.dump(obj, f, ensure_ascii=False)
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/lib/python2.7/json/__init__.py", line 190, in dump
fp.write(chunk)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 3: ordinal not in range(128)
From the json.dump help text:
If ``ensure_ascii`` is true (the default), all non-ASCII characters in the
output are escaped with ``\uXXXX`` sequences, and the result is a ``str``
instance consisting of ASCII characters only. If ``ensure_ascii`` is
``False``, some chunks written to ``fp`` may be ``unicode`` instances.
This usually happens because the input contains unicode strings or the
``encoding`` parameter is used. Unless ``fp.write()`` explicitly
understands ``unicode`` (as in ``codecs.getwriter``) this is likely to
cause an error.
Ah! There is the solution. Either use the default ensure_ascii=True and get ascii escaped unicode characters or use the codecs module to open the file with the encoding you want. This works:
>>> import codecs
>>> with codecs.open('deleteme', 'w', encoding='utf-8') as f:
... json.dump(obj, f, ensure_ascii=False)
...
>>>

Why not encode the specific string instead? try, the .encode('utf-8') method on the string that is raising the exception.

Related

UnicodeDecodeError - Add encoding to custom function input

Could you tell me where I'm going wrong with my current way of thinking? This is my function:
def replace_line(file_name, line_num, text):
lines = open(f"realfolder/files/{item}.html", "r").readlines()
lines[line_num] = text
out = open(file_name, 'w')
out.writelines(lines)
out.close()
This is an example of it being called:
replace_line(f'./files/{item}.html', 9, f'text {item} wordswordswords' + '\n')
I need to encode the text input as utf-8. I'm not sure why I haven't been able to do this already. I also need to retain the fstring value.
I've been doing things like adding:
str.encode(text)
#or
text.encode(encoding = 'utf-8')
To the top of my replace line function. This hasn't worked. I have tried dozens of different methods but each continues to leave me with this error.
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2982: character maps to
undefined

You need to set the encoding to utf-8 for both opening the file to read from
lines = open(f"realfolder/files/{item}.html", "r", encoding="utf-8").readlines()
and opening the file to write to
out = open(file_name, 'w', encoding="utf-8")

How to get python's json module to cope with right quotation marks?

I am trying to load a utf-8 encoded json file using python's json module. The file contains several right quotation marks, encoded as E2 80 9D. When I call
json.load(f, encoding='utf-8')
I receive the message:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 212068: character maps to
How can I convince the json module to decode this properly?
EDIT: Here's a minimal example:
[
{
"aQuote": "“A quote”"
}
]

There is no encoding in the signature of json.load. The solution should be simply:
with open(filename, encoding='utf-8') as f:
x = json.load(f)

unable to decode this string using python

I have this text.ucs file which I am trying to decode using python.
file = open('text.ucs', 'r')
content = file.read()
print content
My result is
\xf\xe\x002\22
I tried doing decoding with utf-16, utf-8
content.decode('utf-16')
and getting error
Traceback (most recent call last): File "", line 1, in
File "C:\Python27\lib\encodings\utf_16.py", line 16, in
decode
return codecs.utf_16_decode(input, errors, True) UnicodeDecodeError: 'utf16' codec can't decode bytes in position
32-33: illegal encoding
Please let me know if I am missing anything or my approach is wrong
Edit: Screenshot has been asked

The string is encoded as UTF16-BE (Big Endian), this works:
content.decode("utf-16-be")

oooh, as i understand you using python 2.x.x but encoding parameter was added only in python 3.x.x as I know, i am doesn't master of python 2.x.x but you can search in google about io.open for example try:
file = io.open('text.usc', 'r',encoding='utf-8')
content = file.read()
print content
but chek do you need import io module or not

You can specify which encoding to use with the encoding argument:
with open('text.ucs', 'r', encoding='utf-16') as f:
text = f.read()

your string need to Be Uncoded With The Coding utf-8 you can do What I Did Now for decode your string
f = open('text.usc', 'r',encoding='utf-8')
print f

Trying to Write tau = 'τ' out to a text file

I'm trying to write html into a text file and running into problems. At first I couldn't print it but setting up a translate helped,
non_bmp_map = dict.fromkeys(range(0x10000, sys.maxunicode + 1), 0xfffd)
(found on this site).
It is annoying that printing and writing text can be 'different deals' with Python; here is code that fails:
#--------*---------*---------*---------*---------*---------*---------*---------*
# Desc: writing tau = 'τ'
#--------*---------*---------*---------*---------*---------*---------*---------*
import os, sys
while True:
tau = 'τ'
os.chdir("C:\\Users\\Mike\\Desktop")
dd = open('tmp.txt', 'w')
dd.write(tau + '\n')
dd.close()
sys.exit()
OUTPUT:
Traceback (most recent call last):
File "C:\Users\Mike\AppData\Local\Programs\Python\Python36-32\board.py", line 11, in <module>
dd.write(tau + '\n')
File "C:\Users\Mike\AppData\Local\Programs\Python\Python36-32\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u03c4' in position 0: character maps to <undefined>

You’ll need to specify an encoding when opening the file. UTF-8 is pretty much always the right choice here.
import os
tau = "τ"
os.chdir(r"C:\Users\Mike\Desktop")
with open("tmp.txt", "w", encoding='utf-8') as dd:
print(tau, file=dd)
This is something a lot of people forget when using open, but it’s important.

It sounds like you need to specify an encoding when you open your file. From the open documentation:
The default encoding is platform dependent (whatever
locale.getpreferredencoding() returns
If your default system encoding doesn't include the τ character, then you will get this error. Try specifying UTF-8. Assuming you are on Python 3:
dd = open('tmp.txt', 'w', encoding = 'utf-8')

Python 3: JSON File Load with Non-ASCII Characters

just trying to load this JSON file(with non-ascii characters) as a python dictionary with Unicode encoding but still getting this error:
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 90: ordinal not in range(128)
JSON file content = "tooltip":{
"dxPivotGrid-sortRowBySummary": "Sort\"{0}\"byThisRow",}
import sys
import json
data = []
with open('/Users/myvb/Desktop/Automation/pt-PT.json') as f:
for line in f:
data.append(json.loads(line.encode('utf-8','replace')))

You have several problems as near as I can tell. First, is the file encoding. When you open a file without specifying an encoding, the file is opened with whatever sys.getfilesystemencoding() is. Since that may vary (especially on Windows machines) its a good idea to explicitly use encoding="utf-8" for most json files. Because of your error message, I suspect that the file was opened with an ascii encoding.
Next, the file is decoded from utf-8 into python strings as it is read by the file system object. The utf-8 line has already been decoded to a string and is already ready for json to read. When you do line.encode('utf-8','replace'), you encode the line back into a bytes object which the json loads (that is, "load string") can't handle.
Finally, "tooltip":{ "navbar":"Operações de grupo"} isn't valid json, but it does look like one line of a pretty-printed json file containing a single json object. My guess is that you should read the entire file as 1 json object.
Putting it all together you get:
import json
with open('/Users/myvb/Desktop/Automation/pt-PT.json', encoding="utf-8") as f:
data = json.load(f)
From its name, its possible that this file is encoded as a Windows Portugese code page. If so, the "cp860" encoding may work better.

I had the same problem, what worked for me was creating a regular expression, and parsing every line from the json file:
REGEXP = '[^A-Za-z0-9\'\:\.\;\-\?\!]+'
new_file_line = re.sub(REGEXP, ' ', old_file_line).strip()

Having a file with content similar to yours I can read the file in one simple shot:
>>> import json
>>> fname = "data.json"
>>> with open(fname) as f:
... data = json.load(f)
...
>>> data
{'tooltip': {'navbar': 'Operações de grupo'}}

You don't need to read each line. You have two options:
import sys
import json
data = []
with open('/Users/myvb/Desktop/Automation/pt-PT.json') as f:
data.append(json.load(f))
Or, you can load all lines and pass them to the json module:
import sys
import json
data = []
with open('/Users/myvb/Desktop/Automation/pt-PT.json') as f:
data.append(json.loads(''.join(f.readlines())))
Obviously, the first suggestion is the best.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

UnicodeEncodeError: 'ascii' codec can't encode - python

Why not encode the specific string instead? try, the .encode('utf-8') method on the string that is raising the exception.

Related

UnicodeDecodeError - Add encoding to custom function input

How to get python's json module to cope with right quotation marks?

unable to decode this string using python

Trying to Write tau = 'τ' out to a text file

Python 3: JSON File Load with Non-ASCII Characters

Categories

Resources