Python Encoding Comment Format - python

Originally, I've learned to specify the source code encoding in Python 2.7 this way:
# -*- coding: utf-8 -*-
Now I just noticed, that PEP263 also allows this:
# coding=utf-8
Is there any differences between these? What about editor compatiblity, cross-platform etc.?
What about Python 3? Is this comment still needed for python 3 or is any code in python 3 expected to be utf-8 by default?

Take a look at PEP3120 which changed the default encoding of python source code to be UTF-8
For python 3.x one therefore finds in the docs:
If a comment in the first or second line of the Python script matches
the regular expression coding[=:]\s*([-\w.]+), this comment is
processed as an encoding declaration [...] The recommended forms of an
encoding expression are:
# -*- coding: <encoding-name> -*-
which is recognized also by GNU Emacs, and
# vim:fileencoding=<encoding-name>
which is recognized by Bram Moolenaar’s VIM.
If no encoding declaration is found, the default encoding is UTF-8
The take home message is therefore:
python 3.x does not neccessarily need to have utf-8 specified, since it is the default
The way the coding line is written is to some degree personal choice (only a recommendation in the docs), it only has to match the regex.

Since Python 3 the default encoding is utf-8. You can still change the encoding using the special-formatted comment # -*- coding: <encoding name> -*-.
The docs recommend to use this coding expression as it is recognized also by GNU Emacs.
As python checks whether the first two lines are matching the regex coding[=:]\s*([-\w.]+), # coding=utf-8 works also to ensure utf-8 encoding but it is not recognized by GNU Emacs.

Related

Print Urdu/Arabic Language in Console (Python)

I am a newbie and i don't know how to set my console to print urdu / arabic characters i am using Wing IDE when i run this code
print "طجکسعبکبطکسبطب"
i get this on my console
طجکسعبکبطکسبطب
You should encode your string arguments as unicode UTF-8 or later. Wrap the whole code in unicode, and/or mark individual string args as unicode (u'your text') too.
Additionally, you should make sure that unicode is enabled in your terminal/prompt window too.
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
arabic_words = u'لغت العربیه'
print arabic_words

What's the difference between 'coding=utf8' and '-*- coding: utf-8 -*-'?

Is there any difference between using
#coding=utf8
and
# -*- coding: utf-8 -*-
What about
# encoding: utf-8
There is no difference; Python recognizes all 3. It looks for the pattern:
coding[:=]\s*([-\w.]+)
on the first two lines of the file (which also must start with a #).
That's the literal text 'coding', followed by either a colon or an equals sign, followed by optional whitespace. Any word, dash or dot characters following that pattern are read as the codec.
The -*- is an Emacs-specific syntax; letting the text editor know what encoding to use. It makes the comment useful to two tools. VIM supports similar syntax.
See PEP 263: Defining Python Source Code Encodings.

Define shebang and encoding in python [duplicate]

PEP 263 defines how to declare Python source code encoding.
Normally, the first 2 lines of a Python file should start with:
#!/usr/bin/python
# -*- coding: <encoding name> -*-
But I have seen a lot of files starting with:
#!/usr/bin/python
# -*- encoding: <encoding name> -*-
=> encoding instead of coding.
So what is the correct way of declaring the file encoding?
Is encoding permitted because the regex used is lazy? Or is it just another form of declaring the file encoding?
I'm asking this question because the PEP does not talk about encoding, it just talks about coding.
Check the docs here:
"If a comment in the first or second line of the Python script matches the regular expression coding[=:]\s*([-\w.]+), this comment is processed as an encoding declaration"
"The recommended forms of this expression are
# -*- coding: <encoding-name> -*-
which is recognized also by GNU Emacs, and
# vim:fileencoding=<encoding-name>
which is recognized by Bram Moolenaar’s VIM."
So, you can put pretty much anything before the "coding" part, but stick to "coding" (with no prefix) if you want to be 100% python-docs-recommendation-compatible.
More specifically, you need to use whatever is recognized by Python and the specific editing software you use (if it needs/accepts anything at all). E.g. the coding form is recognized (out of the box) by GNU Emacs but not Vim (yes, without a universal agreement, it's essentially a turf war).
Just copy paste below statement on the top of your program.It will solve character encoding problems
#!/usr/bin/env python
# -*- coding: utf-8 -*-
PEP 263:
the first or second line must match
the regular
expression "coding[:=]\s*([-\w.]+)"
So, "encoding: UTF-8" matches.
PEP provides some examples:
#!/usr/bin/python
# vim: set fileencoding=<encoding name> :
# This Python file uses the following encoding: utf-8
import os, sys
As of today — June 2018
PEP 263 itself mentions the regex it follows:
To define a source code encoding, a magic comment must be placed into
the source files either as first or second line in the file, such as:
# coding=<encoding name>
or (using formats recognized by popular editors):
#!/usr/bin/python
# -*- coding: <encoding name> -*-
or:
#!/usr/bin/python
# vim: set fileencoding=<encoding name> :
More precisely, the first or second line must match the following regular expression:
^[ \t\f]*#.*?coding[:=][ \t]*([-_.a-zA-Z0-9]+)
So, as already summed up by other answers, it'll match coding with any prefix, but if you'd like to be as PEP-compliant as it gets (even though, as far as I can tell, using encoding instead of coding does not violate PEP 263 in any way) — stick with 'plain' coding, with no prefixes.
If I'm not mistaken, the original proposal for source file encodings was to use a regular expression for the first couple of lines, which would allow both.
I think the regex was something along the lines of coding: followed by something.
I found this: http://www.python.org/dev/peps/pep-0263/
Which is the original proposal, but I can't seem to find the final spec stating exactly what they did.
I've certainly used encoding: to great effect, so obviously that works.
Try changing to something completely different, like duhcoding: ... to see if that works just as well.
I suspect it is similar to Ruby - either method is okay.
This is largely because different text editors use different methods (ie, these two) of marking encoding.
With Ruby, as long as the first, or second if there is a shebang line contains a string that matches:
coding: encoding-name
and ignoring any whitespace and other fluff on those lines. (It can often be a = instead of :, too).

SyntaxError: Non-ASCII character '\xa3' in file when function returns '£'

Say I have a function:
def NewFunction():
return '£'
I want to print some stuff with a pound sign in front of it and it prints an error when I try to run this program, this error message is displayed:
SyntaxError: Non-ASCII character '\xa3' in file 'blah' but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details
Can anyone inform me how I can include a pound sign in my return function? I'm basically using it in a class and it's within the '__str__' part that the pound sign is included.
I'd recommend reading that PEP the error gives you. The problem is that your code is trying to use the ASCII encoding, but the pound symbol is not an ASCII character. Try using UTF-8 encoding. You can start by putting # -*- coding: utf-8 -*- at the top of your .py file. To get more advanced, you can also define encodings on a string by string basis in your code. However, if you are trying to put the pound sign literal in to your code, you'll need an encoding that supports it for the entire file.
Adding the following two lines at the top of my .py script worked for me (first line was necessary):
#!/usr/bin/env python
# -*- coding: utf-8 -*-
First add the # -*- coding: utf-8 -*- line to the beginning of the file and then use u'foo' for all your non-ASCII unicode data:
def NewFunction():
return u'£'
or use the magic available since Python 2.6 to make it automatic:
from __future__ import unicode_literals
The error message tells you exactly what's wrong. The Python interpreter needs to know the encoding of the non-ASCII character.
If you want to return U+00A3 then you can say
return u'\u00a3'
which represents this character in pure ASCII by way of a Unicode escape sequence. If you want to return a byte string containing the literal byte 0xA3, that's
return b'\xa3'
(where in Python 2 the b is implicit; but explicit is better than implicit).
The linked PEP in the error message instructs you exactly how to tell Python "this file is not pure ASCII; here's the encoding I'm using". If the encoding is UTF-8, that would be
# coding=utf-8
or the Emacs-compatible
# -*- encoding: utf-8 -*-
If you don't know which encoding your editor uses to save this file, examine it with something like a hex editor and some googling. The Stack Overflow character-encoding tag has a tag info page with more information and some troubleshooting tips.
In so many words, outside of the 7-bit ASCII range (0x00-0x7F), Python can't and mustn't guess what string a sequence of bytes represents. https://tripleee.github.io/8bit#a3 shows 21 possible interpretations for the byte 0xA3 and that's only from the legacy 8-bit encodings; but it could also very well be the first byte of a multi-byte encoding. But in fact, I would guess you are actually using Latin-1, so you should have
# coding: latin-1
as the first or second line of your source file. Anyway, without knowledge of which character the byte is supposed to represent, a human would not be able to guess this, either.
A caveat: coding: latin-1 will definitely remove the error message (because there are no byte sequences which are not technically permitted in this encoding), but might produce completely the wrong result when the code is interpreted if the actual encoding is something else. You really have to know the encoding of the file with complete certainty when you declare the encoding.
Adding the following two lines in the script solved the issue for me.
# !/usr/bin/python
# coding=utf-8
Hope it helps !
You're probably trying to run Python 3 file with Python 2 interpreter. Currently (as of 2019), python command defaults to Python 2 when both versions are installed, on Windows and most Linux distributions.
But in case you're indeed working on a Python 2 script, a not yet mentioned on this page solution is to resave the file in UTF-8+BOM encoding, that will add three special bytes to the start of the file, they will explicitly inform the Python interpreter (and your text editor) about the file encoding.

Python special character unicode

I have a python script in which one I specify an argument :
parser = optparse.OptionParser()
parser.add_option("-D", "--departure", dest="departure",default="", type="string",help="specify departure")
and in my script i have to to a few things with the string entered.
When I type : -D "Düsseldorf"
the string is not recognized properly in the script
somebody told me to do u"Düsseldorf" but I need to stock "Düsseldorf" in a variable
something like variable = u+"Düsseldorf" .... hmm I really don;t know how to do that.
Thank you for your help.
Regards.
PEP-0264 explains you how to use Unicode in python scripts.
Or, for lazy ones, start your script with:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
print u"Düsseldorf"
And do not forget to solve it as UTF-8 without BOM.
Not only do you need to specify a character encoding for your Python source that can represent the ü character:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
But you also need to keep in mind that command line arguments (in Unix at least, I can't speak for Windows) are bytes. So you should specify the option as a byte-string not a character (Unicode) string.
For example:
parser.add_option("-D", "--departure", dest="departure",
default=u"Düsseldorf".encode('UTF-8'),
type="string",help="specify departure")
Now the default argument is a byte-string, just like all the other arguments you have passed to the add_option method.
Additionally you must ensure that if someone enters this string into their terminal, they do so with a terminal character encoding of UTF-8. If they use a different terminal character encoding, a different byte-string will show up in the command line. This is simply how Unix works, and Python has no power to change it.

Categories