Python special character unicode

Python special character unicode - python

I have a python script in which one I specify an argument :
parser = optparse.OptionParser()
parser.add_option("-D", "--departure", dest="departure",default="", type="string",help="specify departure")
and in my script i have to to a few things with the string entered.
When I type : -D "Düsseldorf"
the string is not recognized properly in the script
somebody told me to do u"Düsseldorf" but I need to stock "Düsseldorf" in a variable
something like variable = u+"Düsseldorf" .... hmm I really don;t know how to do that.
Thank you for your help.
Regards.

PEP-0264 explains you how to use Unicode in python scripts.
Or, for lazy ones, start your script with:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
print u"Düsseldorf"
And do not forget to solve it as UTF-8 without BOM.

Not only do you need to specify a character encoding for your Python source that can represent the ü character:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
But you also need to keep in mind that command line arguments (in Unix at least, I can't speak for Windows) are bytes. So you should specify the option as a byte-string not a character (Unicode) string.
For example:
parser.add_option("-D", "--departure", dest="departure",
default=u"Düsseldorf".encode('UTF-8'),
type="string",help="specify departure")
Now the default argument is a byte-string, just like all the other arguments you have passed to the add_option method.
Additionally you must ensure that if someone enters this string into their terminal, they do so with a terminal character encoding of UTF-8. If they use a different terminal character encoding, a different byte-string will show up in the command line. This is simply how Unix works, and Python has no power to change it.

Related

Python Encoding Comment Format

Originally, I've learned to specify the source code encoding in Python 2.7 this way:
# -*- coding: utf-8 -*-
Now I just noticed, that PEP263 also allows this:
# coding=utf-8
Is there any differences between these? What about editor compatiblity, cross-platform etc.?
What about Python 3? Is this comment still needed for python 3 or is any code in python 3 expected to be utf-8 by default?

Take a look at PEP3120 which changed the default encoding of python source code to be UTF-8
For python 3.x one therefore finds in the docs:
If a comment in the first or second line of the Python script matches
the regular expression coding[=:]\s*([-\w.]+), this comment is
processed as an encoding declaration [...] The recommended forms of an
encoding expression are:
# -*- coding: <encoding-name> -*-
which is recognized also by GNU Emacs, and
# vim:fileencoding=<encoding-name>
which is recognized by Bram Moolenaar’s VIM.
If no encoding declaration is found, the default encoding is UTF-8
The take home message is therefore:
python 3.x does not neccessarily need to have utf-8 specified, since it is the default
The way the coding line is written is to some degree personal choice (only a recommendation in the docs), it only has to match the regex.

Since Python 3 the default encoding is utf-8. You can still change the encoding using the special-formatted comment # -*- coding: <encoding name> -*-.
The docs recommend to use this coding expression as it is recognized also by GNU Emacs.
As python checks whether the first two lines are matching the regex coding[=:]\s*([-\w.]+), # coding=utf-8 works also to ensure utf-8 encoding but it is not recognized by GNU Emacs.

Encoding in python 2.7, from IDE to string

So here I am, I read about encoding all day, now I need some clarification.
First off I'm using eclipse mars with pydev.
Unicode is a (character set + code points), basicaly a table of symbols associated with numerical value.
The way those value are going to be stored at a binary level are defined by the encoding, let's say UTF-8.
1 : shebang
What is the shebang for? when I put # -*- coding: utf-8 -*- does it do something? or does it just indicate that my file is encoded in UTF-8 (but since it's just an indication it could be a lie :o)
2 : Eclipse file encoding
After I wrote my shebang and saved I went into the property of the file, and it said encoding : ISO-8859-1, so my guess is that the shebang does nothing beside indicate in which encoding my file is.
Do I need to manually set every files to UTF-8 or is there a way to teach eclipse to read the shebang and act accordingly.
3 : Why does the shebang only specify the encoding?
My shebang say utf-8, ok right, so what? it does not tell me which caracter set is used.
Since UTF-8 is just an encoding I could use UTF-8 with any character set no?
I could encode ASCII in UTF-8 if I wanted, since an encoding is just a way to convert and store/read code points.
What if my character set encoded in utf-8 does not have the same code points than unicode? (is this possible?)
4 : maybe a solution?
I oftenly read that utf-8 is an implementation of unicode, does that mean that each times you read encoding = UTF-8 you can be 100%, and I say 100%, sure that the characterset+code points is unicode?
I'm lost

There are multiple misconceptions in your question.
Unicode is a standard that is commonly used for working with text. It is not "character set + code points" e.g., Unicode standard defines how to find word boundaries or how to compare Unicode string.
# -*- coding: utf-8 -*- is an encoding declaration. It is not a shebang. Shebang (as it name suggests) starts with #! e.g., #! /usr/bin/env python.
You might need the encoding declaration if there are non-ascii literal characters in your Python source code e.g., you don't need an encoding declaration if you write:
#!/usr/bin/env python2
print u"\N{SNOWMAN}"
But you need it if you use literal non-ascii characters:
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
print u"☃"
Both scripts produce the same output if the second script is saved using utf-8 encoding. The encoding declaration says how to interpret bytes that constitute the Python source code to get the program text.
"is there a way to teach eclipse to read the shebang encoding declaration and act accordingly." is a good separate question. If IDE has explicit Python support then it should do it automatically.
My shebang encoding declaration say utf-8, ok right, so what? it does not tell me which character set is used.
"character encoding", codepage, and charset may be used interchangeably in many contexts. See What's the difference between encoding and charset? The distinctions are irrelevant for the task of converting from bytes to text and back in Python:
unicode_text = bytestring.decode(character_encoding)
bytestring = unicode_text.encode(character_encoding)
A bytestring is an immutable sequence of bytes in Python (roughly speaking numbers in 0..255 range) that is used to represent arbitrary binary data e.g., images, zip-archives, encrypted data, and text encoded using some character encoding. A Unicode string is an immutable sequence of Unicode codepoints (roughly speaking, numbers in 0..sys.maxunicode range) that is used to represent text in Python.
Some character encodings such as cp437 support only a few Unicode characters. Others such as utf-8 support the full range of Unicode codepoints.

The right way to add the encoding declaration is > # -*- coding: utf-8 -*-
It tells python to change the encoding in the current script to UTF-8 it has nothing to do with the user .

Ok I think I found an awnser to all those questions
1/
thanks to J.Dev, the shebang only tells the python interpreter in what the file is encoded, but YOU have to encode the file in what you put in the shebang
2/
Apparently I have to do it manually
3/
Because an encoding is associated with a charset, if you say encoding=utf-8 then it will always be a unicode charset
Some old 1 byte charset don't have encoding, you don't need encoding since the char are all stored on 1 byte, the natural binary translation is the encoding.
So when you say ASCII for instance you mean the charset and encoding = ASCII
But this leave me wondering, is there other type of charset out there with multiple encoding implementation (like unicode can be encoded in utf-8/16/32)

Unsupported characters in input In Python IDLE

suffixes = {
1: ["ो", "े", "ू", "ु", "ी", "ि", "ा"]}
When I done
message given by IDLE is
Unsupported characters in input
Also not see the proper font in MS-DOS.

What encoding is your source file in?
If it is UTF8, put the comment
# -*- coding: utf-8 -*-
at the top of the file.

If you don't declare encoding in your first or second line in your python source file, then the python interpreter will use ASCII encoding system to decode the characters in the file. As these characters you used couldn't be decoded by ASCII encoding system, errors happended.
The solution is as #RemcoGerlich said. Here is the doc.
The encoding is used for all lexical analysis, in particular to find the end of a string, and to interpret the contents of Unicode literals. String literals are converted to Unicode for syntactical analysis, then converted back to their original encoding before interpretation starts. The encoding declaration must appear on a line of its own.

This seems to be a known bug in the 2.x IDLE console: http://bugs.python.org/issue15809. A fix was made for Python 3.x, but doesn't appear to be backported.
Instead, use an alternative console, such as iPython/Jupyter, or a fully-fledged IDE, such as PyCharm.

SyntaxError: Non-ASCII character '\xa3' in file when function returns '£'

Say I have a function:
def NewFunction():
return '£'
I want to print some stuff with a pound sign in front of it and it prints an error when I try to run this program, this error message is displayed:
SyntaxError: Non-ASCII character '\xa3' in file 'blah' but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details
Can anyone inform me how I can include a pound sign in my return function? I'm basically using it in a class and it's within the '__str__' part that the pound sign is included.

I'd recommend reading that PEP the error gives you. The problem is that your code is trying to use the ASCII encoding, but the pound symbol is not an ASCII character. Try using UTF-8 encoding. You can start by putting # -*- coding: utf-8 -*- at the top of your .py file. To get more advanced, you can also define encodings on a string by string basis in your code. However, if you are trying to put the pound sign literal in to your code, you'll need an encoding that supports it for the entire file.

Adding the following two lines at the top of my .py script worked for me (first line was necessary):
#!/usr/bin/env python
# -*- coding: utf-8 -*-

First add the # -*- coding: utf-8 -*- line to the beginning of the file and then use u'foo' for all your non-ASCII unicode data:
def NewFunction():
return u'£'
or use the magic available since Python 2.6 to make it automatic:
from __future__ import unicode_literals

The error message tells you exactly what's wrong. The Python interpreter needs to know the encoding of the non-ASCII character.
If you want to return U+00A3 then you can say
return u'\u00a3'
which represents this character in pure ASCII by way of a Unicode escape sequence. If you want to return a byte string containing the literal byte 0xA3, that's
return b'\xa3'
(where in Python 2 the b is implicit; but explicit is better than implicit).
The linked PEP in the error message instructs you exactly how to tell Python "this file is not pure ASCII; here's the encoding I'm using". If the encoding is UTF-8, that would be
# coding=utf-8
or the Emacs-compatible
# -*- encoding: utf-8 -*-
If you don't know which encoding your editor uses to save this file, examine it with something like a hex editor and some googling. The Stack Overflow character-encoding tag has a tag info page with more information and some troubleshooting tips.
In so many words, outside of the 7-bit ASCII range (0x00-0x7F), Python can't and mustn't guess what string a sequence of bytes represents. https://tripleee.github.io/8bit#a3 shows 21 possible interpretations for the byte 0xA3 and that's only from the legacy 8-bit encodings; but it could also very well be the first byte of a multi-byte encoding. But in fact, I would guess you are actually using Latin-1, so you should have
# coding: latin-1
as the first or second line of your source file. Anyway, without knowledge of which character the byte is supposed to represent, a human would not be able to guess this, either.
A caveat: coding: latin-1 will definitely remove the error message (because there are no byte sequences which are not technically permitted in this encoding), but might produce completely the wrong result when the code is interpreted if the actual encoding is something else. You really have to know the encoding of the file with complete certainty when you declare the encoding.

Adding the following two lines in the script solved the issue for me.
# !/usr/bin/python
# coding=utf-8
Hope it helps !

You're probably trying to run Python 3 file with Python 2 interpreter. Currently (as of 2019), python command defaults to Python 2 when both versions are installed, on Windows and most Linux distributions.
But in case you're indeed working on a Python 2 script, a not yet mentioned on this page solution is to resave the file in UTF-8+BOM encoding, that will add three special bytes to the start of the file, they will explicitly inform the Python interpreter (and your text editor) about the file encoding.

Unsupported characters in input

I want to assign a string of characters to a variable but it says
: there isn't a "code to show.
I have a string that i want to assign to a variable
d="stunning:/ËstÊnÉªÅ/"
Unsupported characters in input
or
word="stuning:/ˈstraɪkɪŋ/"
Unsupported characters in input
so basically the interpreter doesn't allow me to assign it to a variable, so I can't code on it.
How can I extract, delete those characters from the text, or is there anything to do , so python will support this kind of input.
I've tried to converted it into others format like ansi, utf, etc. but without success.
P.S.: I'm using python 2.7

Set the source file encoding accordingly to the actual encoding of the file, so that the interpreter knows how to parse it.
For instance, if you use UTF-8, just add this string to the header of the file:
# -*- coding: utf8 -*-
It must be the first or the second line of the file. See PEP 0263: Defining Python Source Code Encodings.

Just a hint (waiting for the actual code): prepend u to the string to mark it as unicode.
u"/ËstraÉªkÉªÅ/"

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.