I will have to run some python code across platforms.
What is the safest option in term of encoding for the source files?
I have noticed that:
#!/bin/env python
# -*- coding: iso-8859-1 -*-
"""
Created on Wed Feb 22 09:40:16 2017
"""
pycode
does not raise errors in Linux, while it does in Windows.
The following seems safer, why is that the case?
#!/bin/env python
#-*- coding: utf-8 -*-
"""
Created on Wed Feb 22 09:40:16 2017
"""
pycode
Python 3 source code is expected to be encoded as UTF-8 by default. Therefore UTF-8 is the safest encoding to use for Python 3 code because developers do not have to remember to do anything - such as declaring a spec
it's the default, so developers do not need to remember to declare an explicit encoding
it can encode any unicode codepoint, so there is (theoretically) no risk that a developer might use a different encoding in a particular source to include a particular character
However, if the source code is edited on systems where UTF-8 is not the default encoding, developers must take care to ensure that the source is saved as UTF-8.
The same applies to Python 2, save that declaring the encoding is required if UTF-8 is to be used.
Having a reasonably comprehensive test suite would greatly reduce the risk of wrongly-encoded source files, as importing such a file will raise a SyntaxError. If the code lacks tests, it would not be difficult to write a script that searched for .py files, tried to open them specifying UTF-8 as the encoding, and reported any that raised UnicodeDecodeError
Related
The following code runs fine with Python3 on my Windows machine and prints the character 'é':
data = b"\xc3\xa9"
print(data.decode('utf-8'))
However, running the same on an Ubuntu based docker container results in :
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 0: ordinal not in range(128)
Is there anything that I have to install to enable utf-8 decoding ?
Seems ubuntu - depending on version - uses one encoding or another as default, and it may vary between shell and python as well. Adopted from this posting and also this blog:
Thus the recommended way seems to be to tell your python instance to use utf-8 as default encoding:
Set your default encoding of python source files via environment variable:
export PYTHONIOENCODING=utf8
Also, in your source files you can state the encoding you prefer to be used explicitly, so it should work irrespective of environment setting (see this question + answer, python docs and PEP 263:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
....
Concerning the interpretation of encoding of files read by python, you can specify it explicitly in the open command
with open(fname, "rt", encoding="utf-8") as f:
...
and there's a more hackish way with some side effects, but saves you to explicitly specify it each time
import sys
# sys.setdefaultencoding() does not exist, here!
reload(sys) # Reload does the trick!
sys.setdefaultencoding('UTF8')
Please read the warnings about this hack in the related answer and comments.
The problem is with the print() expression, not with the decode() method.
If you look closely, the raised exception is a UnicodeEncodeError, not a -DecodeError.
Whenever you use the print() function, Python converts its arguments to a str and subsequently encodes the result to bytes, which are sent to the terminal (or whatever Python is run in).
The codec which is used for encoding (eg. UTF-8 or ASCII) depends on the environment.
In an ideal case,
the codec which Python uses is compatible with the one which the terminal expects, so the characters are displayed correctly (otherwise you get mojibake like "é" instead of "é");
the codec used covers a range of characters that is sufficient for your needs (such as UTF-8 or UTF-16, which contain all characters).
In your case, the second condition isn't met for the Linux docker you mention: the encoding used is ASCII, which only supports characters found on an old English typewriter.
These are a few options to address this problem:
Set environment variables: on Linux, Python's encoding defaults depend on this (at least partially). In my experience, this is a bit of a trial and error; setting LC_ALL to something containing "UTF-8" worked for me once. You'll have to put them in start-up script for the shell your terminal runs, eg. .bashrc.
Re-encode STDOUT, like so:
sys.stdout = open(sys.stdout.buffer.fileno(), 'w', encoding='utf8')
The encoding used has to match the one of the terminal.
Encode the strings yourself and send them to the binary buffer underlying sys.stdout, eg. sys.stdout.buffer.write("é".encode('utf8')). This is of course much more boilerplate than print("é"). Again, the encoding used has to match the one of the terminal.
Avoid print() altogether. Use open(fn, encoding=...) for output, the logging module for progress info – depending on how interactive your script is, this might be worthwhile (admittedly, you'll probably face the same encoding problem when writing to STDERR with the logging module).
There might be other options, but I doubt that there are nicer ones.
I am using Python for the first time and am running into an encoding error that I can't seem to get around. Here is the code:
#!/usr/bin/python
#-*- coding: utf -*-
import pandas as pd
a = "C:\Users"
print(a)
When I do this, I get:
File "C:\Users\Public\Documents\Python Scripts\ImportExcel.py", line
5
a = "C:\Users"
^ SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in positio n 2-3: truncated \UXXXXXXXX escape
In Notepad++ I have tried all of the encoding options. Nothing seems to work.
Any suggestions?
Specifically, the problem is that the '\' is an escape character.
If you want to print the string
"C:\Users"
then you have to do it thus:
a = "C:\\Users"
Hope this helps.
The error message suggests you're on a Windows machine, but you're using *nix notation for #!/usr/bin/python. That line should look something like #!C:\Python33\python.exe on a Windows machine, depending on where you've installed Python.
Use this: # -*- coding: utf-8 -*- instead of #-- coding: utf --
You can set the encoding in Notepad++, but you also need to tell Python about it.
In legacy Python (2.7), source code is ASCII unless specified otherwise. In Python 3, source code is UTF-8 unless otherwise specified.
You should use the following as the first or second line of the file to specify the encoding of the source code. The documentation gives:
# -*- coding: <encoding> -*-
This is the format originally from the Emacs editor, but according to PEP263 you can also use:
# vim: set fileencoding=<encoding>:
of even:
# coding=<encoding>
Where <encoding> can be any encoding that Python supports, but utf-8 is generally a good choice for portable code.
I’m having problems getting python 2.7 to read scripts containing utf-8 strings; setting the default encoding to utf-8 in sitecustomize.py doesn’t seem to take.
Here’s my sitecustomize.py:
import sys
sys.setdefaultencoding("utf-8")
I can verify that the default encoding has been changed from the command line:
$ /usr/bin/python -c 'import sys; print(sys.getdefaultencoding())'
utf-8
However, when I try to run a script containing a utf-8 string, as in test.py below (containing · at code point U+00b7)…
filename = 'utf-8·filename.txt'
print(filename)
…the default encoding seems to be ignored:
$ /usr/bin/python test.py
File "test.py", line 1
SyntaxError: Non-ASCII character '\xc2' in file test.py on line 1, but
no encoding declared; see http://www.python.org/peps/pep-0263.html for details
Using an encoding declaration, as in test-coding.py below…
# coding=utf-8
filename = 'utf-8·filename.txt'
print(filename)
…does work:
$ /usr/bin/python test-coding.py
utf-8·filename.txt
Unfortunately, the problem’s come up with scripts that are generated and run by another program (the catkin build system’s catkin_make). I can’t manually add encoding declarations to these scripts before catkin_make runs them, giving SyntaxError & check PEP 263. Changing the default encoding seems like the only solution short of going deep under catkin’s hood, or eliminating all non-ascii paths on my system… and setting it in sitecustomize.py should work, but doesn’t.
Any ideas or insights greatly appreciated!
sys.setdefaultencoding("utf-8") is not doing what you think it is doing. It has no effect on how Python parses source files. That's why you are still seeing SyntaxErrors when the source files use non-ascii characters. To eliminate those errors you need to add an encoding declaration at the beginning of the source file, such as
# -*- encoding: utf-8 -*-
Regarding sys.setdefaultencoding:
Do not try to change the default encoding. The default encoding is used when Python does silent conversion between str
and unicode. For example,
Expected Python2 behavior:
In [1]: '€' + u'€'
should raise UnicodeDecodeError because Python tries to convert '€' to unicode by
computing '€'.decode(sys.getdefaultencoding())
If you change the default encoding, you get different behavior:
In [2]: import sys; reload(sys); sys.setdefaultencoding('utf-8')
<module 'sys' (built-in)>
In [3]: '€' + u'€'
u'\u20ac\xe2\x82\xac'
If you change the defaultencoding, then your Python's behavior will be different than just about all other people's expectation of how Python2 should behave.
You cannot set the default encoding for source files. That default is hardcoded, as part of the language specification.
Set the PEP 263 header instead, as the interpreter is instructing you to do. You'll have to fix the Catkin build system, or rewrite the files it produces to include the header. Simply add a first or second line to those files with # coding=utf8, a task easily accomplished with Python.
The system default encoding is only used for implicit encoding and decoding of Unicode and byte string objects in running code. You should not try and change that, as other often relies on the value to not change. The ability to set it was removed entirely from Python 3.
Some people use the following to declare the encoding method for the text of their Python source code:
# -*- coding: utf-8 -*-
Back in 2001, it is said the default encoding method that Python interpreter assumes is ASCII. I have dealt with strings using non-ASCII characters in my Python code, without declaring encoding method of my code, and I don't remember I have bumped into encoding error before. What is the default encoding for code assumed by Python interpreter now?
I am not sure if this is relevant.
My OS is Ubuntu, and I am using the default Python interpreter, and gedit or emacs for editing.
Will the default encoding method by Python interpreter changes if the above changes?
Thanks.
Without any explicit encoding declaration, the assumed encoding for your source code will be
ascii for Python 2.x
utf-8 for Python 3.x
See PEP 0263 and Using source code encoding for Python 2.x, and PEP 3120 for the new default of utf-8 for Python 3.x.
So the default encoding assumened for source code will be directly dependent of the version of the Python interpreter, and it is not configurable.
Note that the source code encoding is something entirely different than dealing with non-ASCII characters as part of your data in strings.
There are two distinct cases where you may encounter non-ASCII characters:
As part of your programs data, during runtime
As part of your source code (and since you can't have non-ASCII characters in identifiers, that usually means hard coded string data in your source code or comments).
The source code encoding declaration affects what encoding your source code will be interpreted with - so it's only needed if you decide to directly put non-ASCII characters in your source code.
So, the following code will eventually have to deal with the fact that there might be non-ASCII characters in data.txt:
with open('data.txt') as f:
for line in f:
# do something with `line`
But it doesn't contain any non-ASCII characters in the source code, therefore it doesn't need an encoding declaration at the top of the file. It will however need to properly decode line if it wants to turn it into unicode. Simply doing unicode(line) will use the system default encoding, which is ascii (different from the default source encoding, but happens to also be ascii). So to explicitely decode the string using utf-8 you'd need to do line.decode('utf-8').
This code however does contain non-ASCII characters directly in its source code:
TEST_DATA = 'Bär' # <--- non-ASCII character on this line
print TEST_DATA
And it will fail with a SyntaxError similar to this, unless you declare an explicit source code encoding:
SyntaxError: Non-ASCII character '\xc3' in file foo.py on line 1, but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details
So assuming your text editor is configured to save files in utf-8, you'd need to put the line
# -*- coding: utf-8 -*-
at the top of the file for Python to interpret the source code correctly.
My advice however would be to generally avoid putting non-ASCII characters in your source code, exactly because if it depends on your and your co-workers editor and terminal settings wheter it will be written and read correctly.
Instead you can use escaped strings to safely enter non-ASCII characters in your code:
TEST_DATA = 'B\xc3\xa4r'
By default, Python source files are treated as encoded in UTF-8. In that encoding, — although the standard library only uses ASCII characters for identifiers, a convention that any portable code should follow. To display all these characters properly, the editor must recognize that the file is UTF-8, and it must use a font that supports all the characters in the file.
It is also possible to specify a different encoding for source files. In order to do this, we put the below code on top of our code !
# -*- coding: encoding -*-
https://docs.python.org/dev/tutorial/interpreter.html
I have a python script that starts with:
#!/usr/bin/env python
# -*- coding: ASCII -*-
and prior to saving, it always splits my window, and asks:
Warning (mule): Invalid coding system `ASCII' is specified
for the current buffer/file by the :coding tag.
It is highly recommended to fix it before writing to a file.
and I need to say yes, it there a way to disable this ? Sorry for asking but I had no luck on google.
Gabriel
A solution that doesn't involve changing the script is to tell Emacs what ASCII means as a coding system. (By default, Emacs calls it US-ASCII instead.) Add this to your .emacs file:
(define-coding-system-alias 'ascii 'us-ascii)
Then Emacs should be able to understand # -*- coding: ASCII -*-.
The Python Enhancement Proposal (PEP) 263, Defining Python Source Code Encodings, discusses a number of ways of defining the source code encoding. Two particular points are relevant here:
Without encoding comment, Python's parser will assume ASCII
So you don't need this at all in your file. Still, if you do want to be explicit about the file encoding:
To define a source code encoding, a magic comment must be placed into the source files either as first or second line in the file, such as:
# coding=<encoding name>
(note that the = can be replaced by a :). So you can use # coding: ascii instead of the more verbose # -*- coding: ASCII -*-, as suggested by this answer. This seems to keep emacs happy.