python: writing ★ in a file

python: writing ★ in a file - python

I am trying to use:
text = "★"
file.write(text)
In python 3. But I get this error message:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0: ordinal not in range(128)
How can I print the symbol ★ in a file in python? This is the same symbol that is being used as star ratings.

By default open uses the platform default encoding (see docs):
encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any text encoding supported by Python can be used. See the codecs module for the list of supported encodings.
This might not be an encoding that supports not-ascii characters as you noticed yourself. If you know that you want utf-8 then it's always a good idea to provide it explicitly:
with open(filename, encoding='utf-8', mode='w') as file:
file.write(text)
Using the with context manager also makes sure there is no file handle around in case you forget to close or it throws an exception before you close the handle.

Related

UnicodeEncodeError: 'gbk' codec can't encode character '\ue13b' in position 25: illegal multibyte sequence

Error:
UnicodeEncodeError: 'gbk' codec can't encode character '\ue13b' in position 25: illegal multibyte sequence
The file encoding format is utf-8, and there is an unrecognized word in the file when it is read. ‘左足趾麻木’
Code:
for line in open(label_filepath, encoding='utf-8'):
print(line)

Change the native encoding to utf-8
import sys
import io
sys.stdout = io.TextIOWrapper(buffer=sys.stdout.buffer,encoding='utf8')

The error is happening when Python tries to print. When printing, that is writing to sys.stdout, Python encodes the text to be printed with the encoding expected by the terminal. In this case the system encoding is gbk, but gbk is unable to encode the third character in the string ('\ue13b'), so the UnicodeEncodeException is raised.
One solution would be to set the PYTHONIOENCODING environment variable to UTF-8 when you call Python:
PYTHONIOENCODING=utf-8 python myscript.py
If you are using a unix-like operating system you could change your locale from a gbk locale to a utf-8 locale, for example from zh_CN.gbk to zh_CN.utf8 (this will affect how all programs read and write from files, so this may not be a good idea if you have a lot of gbk-encoded data).
If you are using Windows, see the answers to this question for information about working with unicode in the Windows terminal.

If it's caused by the editor you are using, for example. Try to add a line to the file "Python.sublime-build". It worked on mine.
reference

UnicodeDecodeError while processing Accented words

I have a python script which reads a YAML file (runs on an embedded system). Without accents, the script runs normally on my development machine and in the embedded system. But with accented words make it crash with
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6: ordinal not in range(128)
only in the embedded environment.
The YAML sample:
data: ã
The snippet which reads the YAML:
with open(YAML_FILE, 'r') as stream:
try:
data = yaml.load(stream)
Tried a bunch of solutions without success.
Versions: Python 3.6, PyYAML 3.12

The codec that is reading your bytes has been set to ASCII. This restricts you to byte values between 0 and 127.
The representation of accented characters in Unicode, comes outside this range, so you're getting a decoding error.
A UTF-8 codec decodes ASCII as well as UTF-8, because ASCII is a (very small) subset of UTF-8, by design.
If you can change your codec to be a UTF-8 decode, it should work.
In general, you should always specify how you will decode a byte stream to text, otherwise, your stream could be ambiguous.

You can specify the codec that should be used when dumping data using PyYAML, but there is no way you specify your coded in PyYAML when you load. However PyYAML will handle unicode as input and you can explicitly specify which codec to use when opening the file for reading, that codec is then used to return the text (you open the file as text file with 'r', which is the default for open()).
import yaml
YAML_FILE = 'input.yaml'
with open(YAML_FILE, encoding='utf-8') as stream:
data = yaml.safe_load(stream)
Please note that you should almost never have to use yaml.load(), which is documented to be unsafe, use yaml.safe_load() instead.
To dump data in the same format you loaded it use:
import sys
yaml.safe_dump(data, sys.stdout, allow_unicode=True, encoding='utf-8',
default_flow_style=False)
The default_flow_style is needed in order not to get the flow-style curly braces, and the allow_unicode is necessary or else you get data: "\xE3" (i.e. escape sequences for unicode characters)

How do I fix this cp950 "illegal multibyte sequence" UnicodeDecodeError when reading a text file?

My teacher teach us that how to use "exec",but I got an error:
UnicodeDecodeError: 'cp950' codec can't decode byte 0xe6 in position 1814: illegal multibyte sequence
I use:
exec(open("somefile.py").read())
how to fix this problem?

Given this is presumably Python 3 source code, the likely encoding is UTF-8 (it's the standard encoding for Python 3 source code).
If that's the case, changing open("somefile.py") to open("somefile.py", encoding="utf-8") would specify the encoding explicitly, overriding the locale default, which should allow you to read it in correctly.
For idiomatic code, you'd also want to use a with statement (to guarantee deterministic closing of the file), making it:
with open("somefile.py", encoding="utf-8") as f:
exec(f.read())

Python 3: Solution for "'ascii' codec can't decode byte" while reading from a text file [duplicate]

I've just added Python3 interpreter to Sublime, and the following code stopped working:
for directory in directoryList:
fileList = os.listdir(directory)
for filename in fileList:
filename = os.path.join(directory, filename)
currentFile = open(filename, 'rt')
for line in currentFile: ##Here comes the exception.
currentLine = line.split(' ')
for word in currentLine:
if word.lower() not in bigBagOfWords:
bigBagOfWords.append(word.lower())
currentFile.close()
I get a following exception:
File "/Users/Kuba/Desktop/DictionaryCreator.py", line 11, in <module>
for line in currentFile:
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 305: ordinal not in range(128)
I found this rather strange, because as far as I know Python3 is supposed to support utf-8 everywhere. What's more, the same exact code works with no problems on Python2.7. I've read about adding environmental variable PYTHONIOENCODING, but I tried it - to no avail (however, it appears it is not that easy to add an environmental variable in OS X Mavericks, so maybe I did something wrong with adding the variable? I modidified /etc/launchd.conf)

Python 3 decodes text files when reading, encodes when writing. The default encoding is taken from locale.getpreferredencoding(False), which evidently for your setup returns 'ASCII'. See the open() function documenation:
In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.
Instead of relying on a system setting, you should open your text files using an explicit codec:
currentFile = open(filename, 'rt', encoding='latin1')
where you set the encoding parameter to match the file you are reading.
Python 3 supports UTF-8 as the default for source code.
The same applies to writing to a writeable text file; data written will be encoded, and if you rely on the system encoding you are liable to get UnicodeEncodingError exceptions unless you explicitly set a suitable codec. What codec to use when writing depends on what text you are writing and what you plan to do with the file afterward.
You may want to read up on Python 3 and Unicode in the Unicode HOWTO, which explains both about source code encoding and reading and writing Unicode data.

"as far as I know Python3 is supposed to support utf-8 everywhere ..."
Not true. I have python 3.6 and my default encoding is NOT utf-8.
To change it to utf-8 in my code I use:
import locale
def getpreferredencoding(do_setlocale = True):
return "utf-8"
locale.getpreferredencoding = getpreferredencoding
as explained in
Changing the “locale preferred encoding” in Python 3 in Windows

In general, I found 3 ways to fix Unicode related Errors in Python3:
Use the encoding explicitly like currentFile = open(filename, 'rt',encoding='utf-8')
As the bytes have no encoding, convert the string data to bytes before writing to file like data = 'string'.encode('utf-8')
Especially in Linux environment, check $LANG. Such issue usually arises when LANG=C which makes default encoding as 'ascii' instead of 'utf-8'. One can change it with other appropriate value like LANG='en_IN'

Switching to Python 3 causing UnicodeDecodeError

I've just added Python3 interpreter to Sublime, and the following code stopped working:
for directory in directoryList:
fileList = os.listdir(directory)
for filename in fileList:
filename = os.path.join(directory, filename)
currentFile = open(filename, 'rt')
for line in currentFile: ##Here comes the exception.
currentLine = line.split(' ')
for word in currentLine:
if word.lower() not in bigBagOfWords:
bigBagOfWords.append(word.lower())
currentFile.close()
I get a following exception:
File "/Users/Kuba/Desktop/DictionaryCreator.py", line 11, in <module>
for line in currentFile:
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 305: ordinal not in range(128)
I found this rather strange, because as far as I know Python3 is supposed to support utf-8 everywhere. What's more, the same exact code works with no problems on Python2.7. I've read about adding environmental variable PYTHONIOENCODING, but I tried it - to no avail (however, it appears it is not that easy to add an environmental variable in OS X Mavericks, so maybe I did something wrong with adding the variable? I modidified /etc/launchd.conf)

Python 3 decodes text files when reading, encodes when writing. The default encoding is taken from locale.getpreferredencoding(False), which evidently for your setup returns 'ASCII'. See the open() function documenation:
In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.
Instead of relying on a system setting, you should open your text files using an explicit codec:
currentFile = open(filename, 'rt', encoding='latin1')
where you set the encoding parameter to match the file you are reading.
Python 3 supports UTF-8 as the default for source code.
The same applies to writing to a writeable text file; data written will be encoded, and if you rely on the system encoding you are liable to get UnicodeEncodingError exceptions unless you explicitly set a suitable codec. What codec to use when writing depends on what text you are writing and what you plan to do with the file afterward.
You may want to read up on Python 3 and Unicode in the Unicode HOWTO, which explains both about source code encoding and reading and writing Unicode data.

"as far as I know Python3 is supposed to support utf-8 everywhere ..."
Not true. I have python 3.6 and my default encoding is NOT utf-8.
To change it to utf-8 in my code I use:
import locale
def getpreferredencoding(do_setlocale = True):
return "utf-8"
locale.getpreferredencoding = getpreferredencoding
as explained in
Changing the “locale preferred encoding” in Python 3 in Windows

In general, I found 3 ways to fix Unicode related Errors in Python3:
Use the encoding explicitly like currentFile = open(filename, 'rt',encoding='utf-8')
As the bytes have no encoding, convert the string data to bytes before writing to file like data = 'string'.encode('utf-8')
Especially in Linux environment, check $LANG. Such issue usually arises when LANG=C which makes default encoding as 'ascii' instead of 'utf-8'. One can change it with other appropriate value like LANG='en_IN'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python: writing ★ in a file - python

Related

UnicodeEncodeError: 'gbk' codec can't encode character '\ue13b' in position 25: illegal multibyte sequence

UnicodeDecodeError while processing Accented words

How do I fix this cp950 "illegal multibyte sequence" UnicodeDecodeError when reading a text file?

Python 3: Solution for "'ascii' codec can't decode byte" while reading from a text file [duplicate]

Switching to Python 3 causing UnicodeDecodeError

Categories

Resources