Here simple script, I'm confused when to write non-ascii text. I want to write some character to a file, U know python write default is simple str. for instance, char number 128, the type is str, so I conclude that when writing that to a file, it doesn't matter because its type is str, not bytes. The write default is str too, not binary file
#python 3
#v.1
print(type(chr(128))) #The type is str
f = open('tep.txt','w')
f.write('\n')
for i in range(128,1000):
f.write(chr(i))
f.close()
the code above is solved with this code
#v.2
f = open('tep.txt','wb')
f.write('\n'.encode('utf-8'))
for i in range(128,1000):
f.write(chr(i).encode('utf-8'))
f.close()
I don't know what's happen with writing binary file.
Because char > 128 the type is str. So writing str to file with default should be ok, but it fails.
What's happening here?
Python 3 strings are Unicode and must be encoded to a file. The default encoding for open on some OSes is not UTF-8, so it is best to be explicit. If you look at open() documentation, the default encoding on Windows is (on my system):
>>> import locale
>>> locale.getpreferredencoding(False)
'cp1252'
>>> chr(128).encode('cp1252')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python38\lib\encodings\cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character '\x80' in position 0: character maps to <undefined>
That is probably the error you see (post it next time so we don't have to guess!).
It is best to be explicit about encoding whenever opening a file for reading and writing, since it varies by OS, and some encodings don't support every Unicode code point.
Not Explicit - Note the error complains about using CP1252 and not supporting that character:
>>> with open('tep.txt','w') as f: # NOT explicit
... f.write(chr(128))
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "C:\Python38\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x80' in position 0: character maps to <undefined>
Explicit - UTF-8 supports every valid Unicode code point:
>>> with open('tep.txt','w',encoding='utf8') as f: # Explicit!
... f.write(chr(128))
...
1
Further Reading:
The Absolute Minimum Every Software Developer...Must Know About Unicode....
Pragmatic Unicode
Related
I'm having a unicode encode error with the following code for a simple web scraper.
print 'JSON scraper initializing'
from bs4 import BeautifulSoup
import json
import requests
import geocoder
# Set page variable
page = 'https://www.bandsintown.com/?came_from=257&page='
urlBucket = []
for i in range (1,3):
uniqueUrl = page + str(i)
urlBucket.append(uniqueUrl)
# Build response container
responseBucket = []
for i in urlBucket:
uniqueResponse = requests.get(i)
responseBucket.append(uniqueResponse)
# Build soup container
soupBucket = []
for i in responseBucket:
individualSoup = BeautifulSoup(i.text, 'html.parser')
soupBucket.append(individualSoup)
# Build events container
allSanFranciscoEvents = []
for i in soupBucket:
script = i.find_all("script")[4]
eventsJSON = json.loads(script.text)
allSanFranciscoEvents.append(eventsJSON)
with open("allSanFranciscoEvents.json", "w") as writeJSON:
json.dump(allSanFranciscoEvents, writeJSON, ensure_ascii=False)
print ('end')
The odd thing is the sometimes, this code works, and doesn't give an error. It has to do with the for i in range line of the code. For example, if I put in (2,4) for the range, it works fine. If I change it to 1,3, it reads:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 12: ordinal not in range(128)
Can anyone tell me how to fix this issue within my code? If I print allSanFranciscoEvents, it is reading in all the data, so I believe the issue is happening in the final piece of code, with the JSON dump. Thanks so much.
Best Fix
Use Python 3! Python 2 is going EOL very soon. New code written in legacy python today will have a very short shelf life.
The only thing I had to change to make your code work in python 3 was to call the print() function instead of the print keyword. Your example code then worked without any error.
Persisting with Python 2
The odd thing is the sometimes, this code works, and doesn't give an
error. It has to do with the for i in range line of the code. For
example, if I put in (2,4) for the range, it works fine.
That is because you are requesting different pages with those different ranges, and not every page has a character that can't be converted to str using the ascii codec. I had to go to page 5 of the response to get the same error that you did. In my case, it was the artist name, u'Mø' that caused the issue. So here's a 1 liner that reproduces the issue:
>>> str(u'Mø')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in position 0: ordinal not in range(128)
Your error explicitly singles out the character u'\xe9':
>>> str(u'\xe9')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)
Same issue, just different character. The character is Latin small letter e with acute. Python is trying to use the default encoding, 'ascii', to convert the Unicode string to str, but 'ascii' doesn't know what the code point is.
I believe the issue is happening in the final piece of code, with the
JSON dump.
Yes, it is:
>>> with open('tmp.json', 'w') as f:
... json.dump(u'\xe9', f, ensure_ascii=False)
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/lib/python2.7/json/__init__.py", line 190, in dump
fp.write(chunk)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)
And from the traceback, you can see that it's actually coming from writing to the file (fp.write(chunk)).
file.write() writes a string to a file, but u'\xe9' is a unicode object. The error message: 'ascii' codec can't encode character... tells us that python is trying to encode that unicode object to turn it into a str type, so it can write it to the file. Calling encode on the unicode string uses the "default string encoding", which is defined here to be 'ascii'.
To fix, don't leave it up to python to use the default encoding:
>>> with open('tmp.json', 'w') as f:
... json.dump(u'\xe9'.encode('utf-8'), f, ensure_ascii=False)
...
# No error :)
In your specific example, you can fix the intermittent error by changing this:
allSanFranciscoEvents.append(eventsJSON)
to this:
allSanFranciscoEvents.append(eventsJSON.encode('utf-8'))
That way, you are explicitly using the 'utf-8' codec to convert the Unicode strings to str, so that python doesn't try to apply the default encoding, 'ascii' when writing to the file.
eventsJSON is object it can't use eventsJSON.encode('utf-8'). For Python 2.7 to write the file in utf-8 or unicode you can use codecs or write it using binary or wb flag.
with open("allSanFranciscoEvents.json", "wb") as writeJSON:
jsStr = json.dumps(allSanFranciscoEvents)
# the decode() needed because we need to convert it to binary
writeJSON.write(jsStr.decode('utf-8'))
print ('end')
# and read it normally
with open("allSanFranciscoEvents.json", "r") as readJson:
data = json.load(readJson)
print(data[0][0]["startDate"])
# 2019-02-04
I am trying to see if I can transfer the output of urllib.request.urlopen() to a text file just to look at it. I tried decoding the output into a string so I can write into a file, but apparently the original output included some Korean characters that are not translating properly into the string.
So far I have:
from urllib.request import urlopen
openU = urlopen(myUrl)
pageH = openU.read()
openU.close()
stringU = pageH.decode("utf-8")
f=open("test.txt", "w+")
f.write(stringU)
I do not get any errors until the last step at which point it says:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Chae\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 85-89: character maps to `<undefined>`
Is there a way to get the string to also include Korean or if not, how do I skip the characters causing problems and write the rest of the string into the file?
Does it matter to you what the file encoding is? If not, then use utf-8 encoding:
f=open("test.txt", "w+", encoding="utf-8")
f.write(stringU)
If you want the file to be cp1252-encoded, which apparently is the default on your system, and to ignore unencodable values, add errors="ignore":
f=open("test.txt", "w+", errors="ignore")
f.write(stringU)
I am trying to access a table on a SQL Server using the Python 2.7 module adodbapi, and print certain information to the command prompt (Windows). Here is my original code snippet:
query_str = "SELECT id, headline, state, severity FROM GPS3.Defect ORDER BY id"
cur.execute(query_str)
dr_data = cur.fetchall()
con.close()
for i in dr_data:
print i
It will print out about 30 rows, all correctly formatted, but then it will stop and give me this:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 52: oridinal not in range(128)
So I looked this up online and went through the presentation explaining Unicode in Python, and I thought I understood. So I would explicitly tell the Python
interpreter that it was dealing with Unicode, and should encode it into UTF-8.
This is what i came up with:
for i in dr_data:
print (u"%s"%i).encode('utf-8')
However, I suppose I don't actually understand Unicode because I get the same exact error when I run this. I know this question is asked a lot, but could someone explain to me, simply, what is going on here? Thanks in advance.
Your error message does not agree with the statement that you are printing on a Windows command prompt. It does not default to the ascii codec. On US Windows, it defaults to cp437.
You can just print Unicode to the console without trying to encode it. Python will encode the Unicode strings to the console encoding. Here is an example. Note the source file is saved in UTF-8 encoding, and the encoding is declared with the special #coding:utf8 comment. This allows any Unicode character to be in the source code.
#coding:utf8
s1 = u'αßΓπΣσµτ' # cp437-supported
s2 = u'ÀÁÂÃÄÅ' # cp1252-supported
s3 = u'我是美国人。' # unsupported by cp437 or cp1252.
Since my US Windows console default to cp437, only s1 will display without error.
C:\>chcp
Active code page: 437
C:\>py -2 -i test.py
>>> print s1
αßΓπΣσµτ
>>> print s2
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-3: character maps to <undefined>
>>> print s3
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-5: character maps to <undefined>
Note the error message indicates what encoding it tried to use: cp437.
If I change the console encoding, now s2 will work correctly:
C:\>chcp 1252
Active code page: 1252
C:\>py -2 -i test.py
>>> print s1
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u03b1' in position 0: character maps to <undefined>
>>> print s2
ÀÁÂÃÄÅ
>>> print s3
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-5: character maps to <undefined>
Now s3 contains characters that common Western encodings don't support. You can go into Control Panel and change the system locale to Chinese and the console will then support Chinese encodings, but a better solution is to use a Python IDE that supports UTF-8, an encoding that supports all Unicode characters (subject to font support, or course). Below is the output of PythonWin, an editor that comes with the pywin32 Python extension:
>>> print s1
αßΓπΣσµτ
>>> print s2
ÀÁÂÃÄÅ
>>> print s3
我是美国人。
In summary, just use Unicode strings, and ideally use a terminal with UTF-8 and it will "just work". Convert text data to Unicode as soon as it is read from file, user input, network socket, etc. Process and print in Unicode, but encode it when it leaves the program (write to file, network socket, etc.).
If a unicode character (code point) that is unsupported by Windows cmd, e.g. EN DASH "–" is printed with Python 3 in a Windows cmd terminal using:
print('\u2013')
Then an exception is raised:
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position 0: character maps to < undefined >
Is there a way to make print convert unsupported characters to e.g. "?", or otherwise handle the print to allow execution to continue ?
Update
There is a better way... see below.
There must be a better way, but this is all I can think of at the moment:
print('\u2013'.encode(errors='replace').decode())
This uses encode() to encode the unicode string to whatever your default encoding is, "replacing" characters that are not valid for that encoding with ?. That converts the string to a bytes string, so that is then converted back to unicode, preserving the replaced characters.
Here is an example using a code point that is not valid in GBK encoding:
>>> s = 'abc\u3020def'
>>> print(s)
s.abc〠def
>>> s.encode(encoding='gbk')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'gbk' codec can't encode character '\u3020' in position 3: illegal multibyte sequence
>>> s.encode(encoding='gbk', errors='replace')
b'abc?def'
>>> s.encode(encoding='gbk', errors='replace').decode()
'abc?def'
>>> print(s.encode(encoding='gbk', errors='replace').decode())
abc?def
Update
So there is a better way as mentioned by #eryksun in comments. Once set up there is no need to change any code to effect unsupported character replacement. The code below demonstrates before and after behaviour (I have set my preferred encoding to GBK):
>>> import os, sys
>>> print('\u3030')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'gbk' codec can't encode character '\u3030' in position 0: illegal multibyte sequence
>>> old_stdout = sys.stdout
>>> fd = os.dup(sys.stdout.fileno())
>>> sys.stdout = open(fd, mode='w', errors='replace')
>>> old_stdout.close()
>>> print('\u3030')
?
#eryksun comment mentions assigning Windows environment variable:
PYTHONIOENCODING=:replace
Note the ":" before "replace". This looks like a usable answer that does not require any changes in Python scripts using print.
The print('\u2013') results in:
?
and print('Hello\u2013world!') results in:
Hello?world!
I'm trying to run the command u'\xe1'.decode("utf-8") in python and I get this error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: ordinal not in range(128)
Why does it say I'm trying to decode ascii when I'm passing utf-8 as the first argument? In addition to this, is there any way I can get the character á from u'\xe1' and save it in a string?
decode will take a string and convert it to unicode (eg: "\xb0".decode("utf8") ==> u"\xb0")
encode will take unicode and convert it to a string (eg: u"\xb0".encode("utf8") ==> "\xb0")
neither has much to do with the rendering of a string... it is mostly an internal representation
try
print u"\xe1"
(your terminal will need to support unicode (idle will work ... dos terminal not so much))
>>> print u"\xe1"
á
>>> print repr(u"\xe1".encode("utf8"))
'\xc3\xa1'
>>> print repr("\xc3\xa1".decode("utf8"))
u'\xe1'