Can't input (č ć š ž đ ) in python 2.7.x console - python

So , I've searching trough internet and it is so frustrating. When I try to search I get explanations on how to unicode decode and encode files. But I'm not interested in that. I know this is possible since I was able to do that. I don't know what happened. Also, I've tried reinstalling python. Changing the options under the configure IDLE etc. On my laptop there are no problems at all. I can do this:
>> a = 'ć'
>>
>> print a
>> ć
And on my PC I get:
>> a = 'ć'
>> Unsupported characters in input
I repeat, I'm not talking about encoding in the program. I'm talking about Python console, and it works on my laptop and worked on my previous machines. There's got to be a solution to this issue.
Also, take look at this:
>>> a = u'ç'
>>> a
u'\xe7'
>>> print a
ç
>>> a = u'ć'
Unsupported characters in input
>>>

The Windows console is limited in what it can display. You can change the code page using the old DOS CHCP command.
CHCP 65001
This will change the code page to UTF-8, and make the console more relaxed. You will probably see a square instead of the actual character, but at least you won't see an error.

Try to:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
...

Related

Python pipe cp1252 string from PowerShell to a python (2.7) script

After a few days of dwelling over stackoverflow and python 2.7 doc, I have come to no conclusion about this.
Basically I'm running a python script on a windows server that must have as input a block of text. This block of text (unfortunately) has to be passed by a pipe. Something like:
PS > [something_that_outputs_text] | python .\my_script.py
So the problem is:
The server uses cp1252 encoding and I really cannot change it due to administrative regulations and whatnot. And when I pipe the text to my python script, when I read it, it comes already with ? whereas characters like \xe1 should be.
What I have done so far:
Tested with UTF-8. Yep, chcp 65001 and $OutputEncoding = [Console]::OutputEncoding "solve it", as in python gets the text perfectly and then I can decode it to unicode etc. But apparently they don't let me do it on the server /sadface.
A little script to test what the hell is happening:
import codecs
import sys
def main(argv=None):
if argv is None:
argv = sys.argv
if len(argv)>1:
for arg in argv[1:]:
print arg.decode('cp1252')
sys.stdin = codecs.getreader('cp1252')(sys.stdin)
text = sys.stdin.read().strip()
print text
return 0
if __name__=="__main__":
sys.exit(main())
Tried it with both the codecs wrapping and without it.
My input & output:
PS > echo "Blá" | python .\testinput.py blé
blé
Bl?
--> So there's no problem with the argument (blé) but the piped text (Blá) is no good :(
I even converted the text string to hex and, yes, it gets flooded with 3f (AKA mr ?), so it's not a problem with the print.
[Also: it's my first question here... feel free to ask any more info about what I did]
EDIT
I don't know if this is relevant or not, but when I do sys.stdin.encoding it yields None
Update: So... I have no problems with cmd. Checked sys.stdin.encoding while running the program on cmd and everything went fine. I think my head just exploded.
How about saving the data into a file and piping it to Python on a CMD session? Invoke Powershell and Python on CMD. Like so,
c:\>powershell -command "c:\genrateDataForPython.ps1 -output c:\data.txt"
c:\>type c:\data.txt | python .\myscript.py
Edit
Another an idea: convert the data into base64 format in Powershell and decode it in Python. Base64 is simple in Powershell, I guess in Python it isn't hard either. Like so,
# Convert some accent chars to base64
$s = [Text.Encoding]::UTF8.GetBytes("éêèë")
[System.Convert]::ToBase64String($s)
# Output:
w6nDqsOow6s=
# Decode:
$d = [System.Convert]::FromBase64String("w6nDqsOow6s=")
[Text.Encoding]::UTF8.GetString($d)
# Output
éêèë

How to work with UTF-8 in Python 2.7?

I just want to get UTF-8 working. I tried this:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
t = "одобрение за"
print t
But when I run this program from the command line, output looks like: одобрение за
I've searched up and down the net, tried the whole sys.setdefaultencoding thing, tried calling encode() and decode(), tried placing the little "u" in front, tried unicode(), etc.
I'm about ready to explode from frustration. Is there a definitive answer for what the heck you're supposed to do?
Your code works for me (tm)
In [1]: t = u"одобрение за"
In [2]: print t
одобрение за
Make sure your terminal supports UTF-8. One way is to check the LANG env-variable:
$ echo $LANG
en_US.UTF-8
also, try the locale command.
$LANG/locale just tells you what your system will use when writing to stdout/stderr.
Best way to test if terminal supports UTF-8 is probably to print something to it and see if it looks correct. Something like this:
echo -e '\xe2\x82\xac'
You should get a €-sign.
If not, try a different shell...
Since you are using Windows cmd.exe, you have to follow two steps:
Make sure your console is using Lucidia console font family (other fonts cannot display UTF-8 properly).
Type chcp 65001 (that's change codepage) and hit enter.
Run your command.
For subsequent runs (once you close the cmd.exe window), you'll have to change the codepage again. The font should be permanent.

Python, windows console and encodings (cp 850 vs cp1252)

I thought I knew everything about encodings and Python, but today I came across a weird problem: although the console is set to code page 850 - and Python reports it correctly - parameters I put on the command line seem to be encoded in code page 1252. If I try to decode them with sys.stdin.encoding, I get the wrong result. If I assume 'cp1252', ignoring what sys.stdout.encoding reports, it works.
Am I missing something, or is this a bug in Python ? Windows ? Note: I am running Python 2.6.6 on Windows 7 EN, locale set to French (Switzerland).
In the test program below, I check that literals are correctly interpreted and can be printed - this works. But all values I pass on the command line seem to be encoded wrongly:
#!/usr/bin/python
# -*- encoding: utf-8 -*-
import sys
literal_mb = 'utf-8 literal: üèéÃÂç€ÈÚ'
literal_u = u'unicode literal: üèéÃÂç€ÈÚ'
print "Testing literals"
print literal_mb.decode('utf-8').encode(sys.stdout.encoding,'replace')
print literal_u.encode(sys.stdout.encoding,'replace')
print "Testing arguments ( stdin/out encodings:",sys.stdin.encoding,"/",sys.stdout.encoding,")"
for i in range(1,len(sys.argv)):
arg = sys.argv[i]
print "arg",i,":",arg
for ch in arg:
print " ",ch,"->",ord(ch),
if ord(ch)>=128 and sys.stdin.encoding == 'cp850':
print "<-",ch.decode('cp1252').encode(sys.stdout.encoding,'replace'),"[assuming input was actually cp1252 ]"
else:
print ""
In a newly created console, when running
C:\dev>test-encoding.py abcé€
I get the following output
Testing literals
utf-8 literal: üèéÃÂç?ÈÚ
unicode literal: üèéÃÂç?ÈÚ
Testing arguments ( stdin/out encodings: cp850 / cp850 )
arg 1 : abcÚÇ
a -> 97
b -> 98
c -> 99
Ú -> 233 <- é [assuming input was actually cp1252 ]
Ç -> 128 <- ? [assuming input was actually cp1252 ]
while I would expect the 4th character to have an ordinal value of 130 instead of 233 (see the code pages 850 and 1252).
Notes: the value of 128 for the euro symbol is a mystery - since cp850 does not have it. Otherwise, the '?' are expected - cp850 cannot print the characters and I have used 'replace' in the conversions.
If I change the code page of the console to 1252 by issuing chcp 1252 and run the same command, I (correctly) obtain
Testing literals
utf-8 literal: üèéÃÂç€ÈÚ
unicode literal: üèéÃÂç€ÈÚ
Testing arguments ( stdin/out encodings: cp1252 / cp1252 )
arg 1 : abcé€
a -> 97
b -> 98
c -> 99
é -> 233
€ -> 128
Any ideas what I'm missing ?
Edit 1: I've just tested by reading sys.stdin. This works as expected: in cp850, typing 'é' results in an ordinal value of 130. So the problem is really for the command line only. So, is the command line treated differently than the standard input ?
Edit 2: It seems I had the wrong keywords. I found another very close topic on SO: Read Unicode characters from command-line arguments in Python 2.x on Windows. Still, if the command line is not encoded like sys.stdin, and since sys.getdefaultencoding() reports 'ascii', it seems there is no way to know its actual encoding. I find the answer using win32 extensions pretty hacky.
Replying to myself:
On Windows, the encoding used by the console (thus, that of sys.stdin/out) differs from the encoding of various OS-provided strings - obtained through e.g. os.getenv(), sys.argv, and certainly many more.
The encoding provided by sys.getdefaultencoding() is really that - a default, chosen by Python developers to match the "most reasonable encoding" the interpreter use in extreme cases. I get 'ascii' on my Python 2.6, and tried with portable Python 3.1, which yields 'utf-8'. Both are not what we are looking for - they are merely fallbacks for encoding conversion functions.
As this page seems to state, the encoding used by OS-provided strings is governed by the Active Code Page (ACP). Since Python does not have a native function to retrieve it, I had to use ctypes:
from ctypes import cdll
os_encoding = 'cp' + str(cdll.kernel32.GetACP())
Edit: But as Jacek suggests, there actually is a more robust and Pythonic way to do it (semantics would need validation, but until proven wrong, I'll use this)
import locale
os_encoding = locale.getpreferredencoding()
# This returns 'cp1252' on my system, yay!
and then
u_argv = [x.decode(os_encoding) for x in sys.argv]
u_env = os.getenv('myvar').decode(os_encoding)
On my system, os_encoding = 'cp1252', so it works. I am quite certain this would break on other platforms, so feel free to edit and make it more generic. We would certainly need some kind of translation table between the ACP reported by Windows and the Python encoding name - something better than just prepending 'cp'.
This is a unfortunately a hack, although I find it a bit less intrusive than the one suggested by this ActiveState Code Recipe (linked to by the SO question mentioned in Edit 2 of my question). The advantage I see here is that this can be applied to os.getenv(), and not only to sys.argv.
I tried the solutions. It may still have some encoding problems. We need to use true type fonts.
Fix:
Run chcp 65001 in cmd to change the encoding to UTF-8.
Change cmd font to a True-Type one like Lucida Console that supports the
preceding code pages before 65001
Here's my complete fix for the encoding error:
def fixCodePage():
import sys
import codecs
import ctypes
if sys.platform == 'win32':
if sys.stdout.encoding != 'cp65001':
os.system("echo off")
os.system("chcp 65001") # Change active page code
sys.stdout.write("\x1b[A") # Removes the output of chcp command
sys.stdout.flush()
LF_FACESIZE = 32
STD_OUTPUT_HANDLE = -11
class COORD(ctypes.Structure):
_fields_ = [("X", ctypes.c_short), ("Y", ctypes.c_short)]
class CONSOLE_FONT_INFOEX(ctypes.Structure):
_fields_ = [("cbSize", ctypes.c_ulong),
("nFont", ctypes.c_ulong),
("dwFontSize", COORD),
("FontFamily", ctypes.c_uint),
("FontWeight", ctypes.c_uint),
("FaceName", ctypes.c_wchar * LF_FACESIZE)]
font = CONSOLE_FONT_INFOEX()
font.cbSize = ctypes.sizeof(CONSOLE_FONT_INFOEX)
font.nFont = 12
font.dwFontSize.X = 7
font.dwFontSize.Y = 12
font.FontFamily = 54
font.FontWeight = 400
font.FaceName = "Lucida Console"
handle = ctypes.windll.kernel32.GetStdHandle(STD_OUTPUT_HANDLE)
ctypes.windll.kernel32.SetCurrentConsoleFontEx(handle, ctypes.c_long(False), ctypes.pointer(font))
Note: You can see a font change while executing the program.
Well what worked for me was using following code sniped:
# -*- coding: utf-8 -*-
import os
import sys
print (f"OS: {os.device_encoding(0)}, sys: {sys.stdout.encoding}")
comparing both on some windows systems with python 3.8, showed that os.device_encoding(0) always reflected code page setting in terminal. (Tested with Powershell and with old cmd-shell on Windows 10 and Windows 7)
This was even true after changing the terminals code page with shell command:
chcp 850
or e.g.:
chcp 1252
Now using os.device_encoding(0) for tasks like decoding a subprocess stdout result from bytes to string worked out even with Non-ASCII chars like é, ö, ³, ↓, ...
So as other already pointed out on windows local setting is really just a system information, about user preferences, but not what shell actually might currently use.

Lost with encodings (shell and accents)

I'm having trouble with encodings.
I'm using version
Python 2.7.2+ (default, Oct 4 2011, 20:03:08)
[GCC 4.6.1] on linux2
I have chars with accents like é à.
My scripts uses utf-8 encoding
#!/usr/bin/python
# -*- coding: utf-8 -*-
Users can type strings usings raw_input() with .
def rlinput(prompt, prefill=''):
readline.set_startup_hook(lambda: readline.insert_text( prefill))
try:
return raw_input(prompt)
finally:
readline.set_startup_hook()
called in the main loop 'pseudo' shell
while to_continue :
to_continue, feedback = action( unicode(rlinput(u'todo > '),'utf-8') )
os.system('clear')
print T, u"\n" + feedback
Data are stored as pickle in files.
I managed to have the app working but finaly get stupid things like
core file :
class Task()
...
def __str__(self):
r = (u"OK" if self._done else u"A faire").ljust(8) + self.getDesc()
return r.encode('utf-8')
and so in shell file :
feedback = jaune + str(t).decode('utf-8') + vert + u" supprimée"
That's where i realize that i might be totaly wrong with encoding/decoding.
So I tried to decode directly in rlinput but failed.
I read some post in stackoverflow, re-read http://docs.python.org/library/codecs.html
Waiting for my python book, i'm lost :/
I guess there is a lot of bad code but my question here is only related to encoding issus.
You can find the code here : (most comments in french, sorry that's for personnal use and i'm a beginner, you'll also need yapsy - http://yapsy.sourceforge.net/ ) (then configure paths, then in py_todo : ./todo_shell.py) : http://bit.ly/rzp9Jm
Standard input and output are byte-based on all Unix systems. That's why you have to call the unicode function to get character-strings for them. The decode error indicates that the bytes coming in are not valid UTF-8.
Basically, the problem is the assumption of UTF-8 encoding, which is not guaranteed. Confirm this by changing the encoding in your unicode call to 'ISO-8859-1', or by changing the character encoding of your terminal emulator to UTF-8. (Putty supports this, in the "Translation" menu.)
If the above experiment confirms this, your challenge is to support the locale of the user and deduce the correct encoding, or perhaps to make the user declare the encoding in a command line argument or configuration. The $LANG environment variable is about the best you can do without an explicit declaration, and I find it to be a poor indicator of the desired character encoding.
As #wberry suggested i checked the encodings : ok
$ file --mime-encoding todo_shell.py task.py todo.py
todo_shell.py: utf-8
task.py: utf-8
todo.py: utf-8
$ echo $LANG
fr_FR.UTF-8
$ python -c "import sys; print sys.stdin.encoding"
UTF-8
As #eryksun suggested as decoded the user inputs (+encode the strings submited before) (solved some problems if my memory is good) (will test deeply later) :
def rlinput(prompt, prefill=''):
readline.set_startup_hook(lambda: readline.insert_text( prefill.encode(sys.stdin.encoding) ))
try:
return raw_input( prompt ).decode( sys.stdin.encoding )
finally:
readline.set_startup_hook()
I still hav problems but my question was not well defined so i can't get clear answer.
I feel less lost now and have directions to search.
Thank you !
EDIT : i replaced the str methodes with unicode and it killed some (all?) probs.
Thanks #eryksun for the tips. (this links helped me : Python __str__ versus __unicode__ )

How to print japanese utf-8 on console in windows?

#coding=<utf8>
import os
os.popen('chcp 65001')
a = 'こんにちは世界'
print a.decode('utf8')
x = raw_input()
PYTHON 2.6 on Windows 7
It will run in IDLE with no errors.
However when run from the console, it errors and flashes very quickly and I can't read the error message.
How can it be done in windows console?
By the way, doing this with other languages like spanish or portuguese will work fine. It's languages like japanese, russian, greek, hebrew that have this error behavior in the windows console.
*EDIT
as requested I changed to this code:
#coding=<utf8>
import os, sys
os.popen('chcp 65001')
print(sys.stdout.encoding)
x = raw_input('press enter to continue')
a = 'こんにちは世界'
print a.decode('utf8')
x = raw_input()
It will print:
cp437
and then of course, continue on to flash and fail on the decoding bit...
It looks like the popen('chcp 65001') doesn't work in changing the codepage.
I still don't think this is the root of the problem, however it would be helpful to know an efficient way of changing this codepage.
Update
Never mind. The OP is using Windows.
Interestingly changing the encoding declaration to #encoding=<utf8> did not work in Ubuntu.
Original Answer
This worked for me (Ubuntu Jaunty, Python 2.6.2). The only change I made was to the first line declaring the encoding.
# encoding: utf-8
import os
os.popen('chcp 65001')
a = 'こんにちは世界'
print a.decode('utf8')
x = raw_input()

Categories