Python Notepad++ encoding error - python

I am using Python for the first time and am running into an encoding error that I can't seem to get around. Here is the code:
#!/usr/bin/python
#-*- coding: utf -*-
import pandas as pd
a = "C:\Users"
print(a)
When I do this, I get:
File "C:\Users\Public\Documents\Python Scripts\ImportExcel.py", line
5
a = "C:\Users"
^ SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in positio n 2-3: truncated \UXXXXXXXX escape
In Notepad++ I have tried all of the encoding options. Nothing seems to work.
Any suggestions?

Specifically, the problem is that the '\' is an escape character.
If you want to print the string
"C:\Users"
then you have to do it thus:
a = "C:\\Users"
Hope this helps.

The error message suggests you're on a Windows machine, but you're using *nix notation for #!/usr/bin/python. That line should look something like #!C:\Python33\python.exe on a Windows machine, depending on where you've installed Python.

Use this: # -*- coding: utf-8 -*- instead of #-- coding: utf --

You can set the encoding in Notepad++, but you also need to tell Python about it.
In legacy Python (2.7), source code is ASCII unless specified otherwise. In Python 3, source code is UTF-8 unless otherwise specified.
You should use the following as the first or second line of the file to specify the encoding of the source code. The documentation gives:
# -*- coding: <encoding> -*-
This is the format originally from the Emacs editor, but according to PEP263 you can also use:
# vim: set fileencoding=<encoding>:
of even:
# coding=<encoding>
Where <encoding> can be any encoding that Python supports, but utf-8 is generally a good choice for portable code.

Related

Unicode with cp1251 and utf-8 on windows

I am playing around with unicode in python.
So there is a simple script:
# -*- coding: cp1251 -*-
print 'юникод'.decode('cp1251')
print unicode('юникод', 'cp1251')
print unicode('юникод', 'utf-8')
In cmd I've switched encoding to Active code page: 1251.
And there is the output:
СЋРЅРёРєРѕРґ
СЋРЅРёРєРѕРґ
юникод
I am a little bit confused.
Since I've specified encoding to cp1251 I expect that it would be decoded correctly.
But as result there is some trash code points were interpreted.
I am understand that 'юникод' is just a bytes like:
'\xd1\x8e\xd0\xbd\xd0\xb8\xd0\xba\xd0\xbe\xd0\xb4'.
But there is a way to get correct output in terminal with cp1251?
Should I build byte string manually?
Seems like I misunderstood something.
I think I can understand what happened to you. The last line gave me the hint, that your trash codepoints confirmed. You try to display cp1251 characters but your editor is configured to use utf8.
The # -*- coding: cp1251 -*- is only used by the Python interpretor to convert characters from source python files that are outside of the ASCII range. And anyway it it is only used for unicode litterals because bytes from original source give er... exactly same bytes in byte strings. Some text editors are kind enough to automagically use this line (IDLE editor is), but I'm little confident in that and allways switch manually to the proper encoding when I use gvim for example. Short story: # -*- coding: cp1251 -*- in unused in your code and can only mislead a reader since it it not the actual encoding.
If you want to be sure of what lies in your source, you'd better use explicit escapes. In code page 1251, this word юникод is composed by those characters: '\xfe\xed\xe8\xea\xee\xe4'
If you write this source:
txt = '\xfe\xed\xe8\xea\xee\xe4'
print txt
print txt.decode('cp1251')
print unicode(txt, 'cp1251')
print unicode(txt, 'utf-8')
and execute it in a console configured to use CP1251 charset, the first three lines will output юникод, and the last one will throw a UnicodeDecodeError exception because the input is no longer valid 'utf8'.
Alternatively, if you find comfortable with you current editor, you could write:
# -*- coding: utf8 -*-
txt = 'юникод'.decode('utf8').encode('cp1251') # or simply txt = u'юникод'.encode('cp1251')
print txt
print txt.decode('cp1251')
print unicode(txt, 'cp1251')
print unicode(txt, 'utf-8')
which should give same results - but now the declared source encoding should be the actual encoding of your python source.
BTW, a Python 3.5 IDLE that natively uses unicode confirmed that:
>>> 'СЋРЅРёРєРѕРґ'.encode('cp1251').decode('utf8')
'юникод'
Your issue is that the encoding declaration is wrong: your editor uses utf-8 character encoding to save the source code. Use # -*- coding: utf-8 -*- to fix it.
>>> u'юникод'
u'\u044e\u043d\u0438\u043a\u043e\u0434'
>>> u'юникод'.encode('utf-8')
'\xd1\x8e\xd0\xbd\xd0\xb8\xd0\xba\xd0\xbe\xd0\xb4'
>>> print _.decode('cp1251') # mojibake due to the wrong encoding
СЋРЅРёРєРѕРґ
>>> print u'юникод'
юникод
Do not use bytestrings ('' literals create bytes object on Python 2) to represent text; use Unicode strings (u'' literals -- unicode type) instead.
If your code uses Unicode strings then a code page that your Windows console uses doesn't matter as long as the chosen font can display the corresponding (non-BMP) characters. See Python, Unicode, and the Windows console
Here's complete code, for reference:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
print(u'юникод')
Note: no .decode(), unicode(). If you are using a literal to create a string; you should use Unicode literals if the string contains text. It is the only option on Python 3 where you can't put non-ascii characters inside a bytes literal and it is a good practice (to use Unicode for text instead of bytestrings) on Python 2 too.
If you are given a bytestring as an input (not literal) by some API then its encoding has nothing to do with the encoding declaration. What specific encoding to use depends on the source of the data.
Just use the following, but ensure you save the source code in the declared encoding. It can be any encoding that supports the characters you want to print. The terminal can be in a different encoding, as long as it also supports the characters you want to print:
#coding:utf8
print u'юникод'
The advantage is that you don't need to know the terminal's encoding. Python will normally1 detect the terminal encoding and encode the print output correctly.
1Unless your terminal is misconfigured.

Inserting a unicode text into pyx canvas

I have a set of UTF-8 characters that I would like to insert into a PyX generated pdf file.
I have included # -*- coding: utf-8 -*- to top of the file. The code is somewhat similar to the following:
# -*- coding: utf-8 -*-
c = canvas.canvas()
txt = "u'aあä'"
c.text(2, 2, "ID: %s"%txt)
c.writeEPSfile("filename.eps")
But I still can't get my head around this.
Error:
'ascii' codec can't encode character u'\xae' in position 47: ordinal not in range(128)
Try this:
# -*- coding: utf-8 -*-
c = canvas.canvas()
txt = u'aあä'.encode('utf-8')
c.text(1, 4, "UID: %s"%(txt))
c.writeEPSfile("filename.eps")
You can setup PyX to pass unicode characters to (La)TeX. Then it all becomes a problem to produce the characters in question within TeX/LaTeX. Here is a rather minimal solution to produce the output in question:
from pyx import *
text.set(cls=text.LatexRunner, texenc='utf-8')
text.preamble(r'\usepackage[utf8]{inputenc}')
text.preamble(r'\usepackage{newunicodechar}')
text.preamble(r"\newunicodechar{あ}{{\usefont{U}{min}{m}{n}\symbol{'102}}}")
text.preamble(r'\DeclareFontFamily{U}{min}{}')
text.preamble(r'\DeclareFontShape{U}{min}{m}{n}{<-> udmj30}{}')
c = canvas.canvas()
c.text(0, 0, 'UID: aあä')
c.writeGSfile('utf8.png')
This directly results in the output (as PNG as uploaded here):
Note that this was done using PyX 0.13 on Python 3 and a rather standard LaTeX installation. Also, I used some information from https://tex.stackexchange.com/questions/171611/how-to-write-a-single-hiragana-character-in-latex about creating those characters in LaTeX. There seem to be solutions like CJKutf8 to setup all kind of characters for direct use as unicode characters within LaTeX, but this is way out of my experience. Anyway, it should all work fine from within PyX like it does from LaTeX itself if all the setup has been done properly. Good luck!
Maybe you can find an according set from the babel package
I ran into the same error when I tried to insert the german ä (a umlaut). I simply added the german babel package:
text.preamble(r"\usepackage[ngerman]{babel}")
After that, this was possible without errors:
c.text(12, 34, "äöüßß")
I also used an utf8 input encoding, I think this it is necessary as well.
Further reading:
https://en.wikibooks.org/wiki/LaTeX/Internationalization
https://en.wikibooks.org/wiki/LaTeX/Fonts

Trying to print check mark U+2713 in Python on Windows produces UnicodeEncodeError

I have read through Print the "approval" sign/check mark (✓) U+2713 in Python, but none of the answers work for me. I am running Python 2.7 on Windows.
print u'\u2713'
produces this error:
exceptions.UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-1: character maps to
This:
#!/usr/bin/python
# -*- coding: UTF-8 -*-
print '✓'
does not work because I am using Windows.
print u'\u2713'.encode('utf8')
Prints out ✓, which is not the right character.
print('\N{check mark}')
Is just silly. This prints \N{check mark} literally.
Read http://www.joelonsoftware.com/articles/Unicode.html and you will understand what is going on.
The bad news is: you won't be able to print that caracter because it simply is not available in the default text encoding of your Windows terminal. Modify your terminal configuration to use "utf-8" instead of the default "cp-852" or whatever is the default for Window's cmd these days, and you should be good, but do so only after reading the above article, seriously.

urllib.parse.quote won't take utf8

I am trying to use urllib.parse.quote as intended but cant get it to work. I even tried the example given in the documentation
Example: quote('/El Niño/') yields '/El%20Ni%C3%B1o/'.
If I try this following happens.
quote('/El Niño/')
File "<stdin>", line 0
^
SyntaxError: 'utf-8' codec can't decode byte 0xf1 in position 13: invalid continuation byte
Anyone got a hint what is wrong? I am using Python 3.2.3
PS: Link to the docs http://docs.python.org/3.2/library/urllib.parse.html
\xf1 is a latin-1 encoded ñ
>>> print(b'\xf1'.decode("latin-1"))
ñ
..not a utf-8 encoded character, like Python 3 assumes by default:
>>> print(b'\xf1'.decode("utf-8"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 0: unexpected end of data
Meaning, there is an encoding issue either with the .py file you have written, or the terminal in which you are running the Python shell - it is supplying latin-1 encoded data to Python, not utf-8
try adding the following line at the begining of your source code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os, sys
…
per default, python assumes that your source code is written in ASCII, and thus it can't read unicode strings from your source file. Read PEP-0263 on the topic.
Though, if you switch to python3, you don't need to place that coding: utf-8 comment after the shebang line, because utf-8 is the default.
Edit: Just noticed that you are actually trying to do python3, which should be utf-8-safe. Though looking at the error, it looks to me that you're actually executing python2 code whereas you think you are executing python3.
Is the shebang line correctly set?
Are you calling the script with the right interpreter?
here's the right shebang line:
#/usr/bin/env python3
or
#/usr/bin/python3
and not just /usr/bin/python or /usr/bin/env python.
can you give your full failing script, and the way you're calling it in your question?

emacs constantly asking before saving python code with # -*- coding: ASCII -*-

I have a python script that starts with:
#!/usr/bin/env python
# -*- coding: ASCII -*-
and prior to saving, it always splits my window, and asks:
Warning (mule): Invalid coding system `ASCII' is specified
for the current buffer/file by the :coding tag.
It is highly recommended to fix it before writing to a file.
and I need to say yes, it there a way to disable this ? Sorry for asking but I had no luck on google.
Gabriel
A solution that doesn't involve changing the script is to tell Emacs what ASCII means as a coding system. (By default, Emacs calls it US-ASCII instead.) Add this to your .emacs file:
(define-coding-system-alias 'ascii 'us-ascii)
Then Emacs should be able to understand # -*- coding: ASCII -*-.
The Python Enhancement Proposal (PEP) 263, Defining Python Source Code Encodings, discusses a number of ways of defining the source code encoding. Two particular points are relevant here:
Without encoding comment, Python's parser will assume ASCII
So you don't need this at all in your file. Still, if you do want to be explicit about the file encoding:
To define a source code encoding, a magic comment must be placed into the source files either as first or second line in the file, such as:
# coding=<encoding name>
(note that the = can be replaced by a :). So you can use # coding: ascii instead of the more verbose # -*- coding: ASCII -*-, as suggested by this answer. This seems to keep emacs happy.

Categories