Lost with encodings (shell and accents) - python

I'm having trouble with encodings.
I'm using version
Python 2.7.2+ (default, Oct 4 2011, 20:03:08)
[GCC 4.6.1] on linux2
I have chars with accents like é à.
My scripts uses utf-8 encoding
#!/usr/bin/python
# -*- coding: utf-8 -*-
Users can type strings usings raw_input() with .
def rlinput(prompt, prefill=''):
readline.set_startup_hook(lambda: readline.insert_text( prefill))
try:
return raw_input(prompt)
finally:
readline.set_startup_hook()
called in the main loop 'pseudo' shell
while to_continue :
to_continue, feedback = action( unicode(rlinput(u'todo > '),'utf-8') )
os.system('clear')
print T, u"\n" + feedback
Data are stored as pickle in files.
I managed to have the app working but finaly get stupid things like
core file :
class Task()
...
def __str__(self):
r = (u"OK" if self._done else u"A faire").ljust(8) + self.getDesc()
return r.encode('utf-8')
and so in shell file :
feedback = jaune + str(t).decode('utf-8') + vert + u" supprimée"
That's where i realize that i might be totaly wrong with encoding/decoding.
So I tried to decode directly in rlinput but failed.
I read some post in stackoverflow, re-read http://docs.python.org/library/codecs.html
Waiting for my python book, i'm lost :/
I guess there is a lot of bad code but my question here is only related to encoding issus.
You can find the code here : (most comments in french, sorry that's for personnal use and i'm a beginner, you'll also need yapsy - http://yapsy.sourceforge.net/ ) (then configure paths, then in py_todo : ./todo_shell.py) : http://bit.ly/rzp9Jm

Standard input and output are byte-based on all Unix systems. That's why you have to call the unicode function to get character-strings for them. The decode error indicates that the bytes coming in are not valid UTF-8.
Basically, the problem is the assumption of UTF-8 encoding, which is not guaranteed. Confirm this by changing the encoding in your unicode call to 'ISO-8859-1', or by changing the character encoding of your terminal emulator to UTF-8. (Putty supports this, in the "Translation" menu.)
If the above experiment confirms this, your challenge is to support the locale of the user and deduce the correct encoding, or perhaps to make the user declare the encoding in a command line argument or configuration. The $LANG environment variable is about the best you can do without an explicit declaration, and I find it to be a poor indicator of the desired character encoding.

As #wberry suggested i checked the encodings : ok
$ file --mime-encoding todo_shell.py task.py todo.py
todo_shell.py: utf-8
task.py: utf-8
todo.py: utf-8
$ echo $LANG
fr_FR.UTF-8
$ python -c "import sys; print sys.stdin.encoding"
UTF-8
As #eryksun suggested as decoded the user inputs (+encode the strings submited before) (solved some problems if my memory is good) (will test deeply later) :
def rlinput(prompt, prefill=''):
readline.set_startup_hook(lambda: readline.insert_text( prefill.encode(sys.stdin.encoding) ))
try:
return raw_input( prompt ).decode( sys.stdin.encoding )
finally:
readline.set_startup_hook()
I still hav problems but my question was not well defined so i can't get clear answer.
I feel less lost now and have directions to search.
Thank you !
EDIT : i replaced the str methodes with unicode and it killed some (all?) probs.
Thanks #eryksun for the tips. (this links helped me : Python __str__ versus __unicode__ )

Related

ArcPy and Python encoding messing up?

I am faced with a strange behavior between ArcPy and Python encoding. I work with VisualStudio 2010 Shell with Python tools for VS (PTVS) installed. I isolated my problem through a simple script file. The py script file that contains the following commands. In VisualStudio, I have set the « Advanced Save Options...» to « UTF-8 without signature ». The script simply print on the screen a accented string, then import arcpy module, then again print the same string. Importing Arcpy seems to change the Python encoding setup but I don't know why and I would like to restablish it correctly because it causes problems a little bit everywhere in the original script.
I checked the python « encoding » folder and erased every pyc file. Than I ran the script and it generated 3 pyc files :
cp850.pyc (which corresponds to my stdout.encoding)
cp1252.pyc (which corresponds to my Windows environment encoding)
utf_8.pyc (which fits the encoding of my script)
When ArcPy is being imported, something comes altering the encoding that affects the initial variables.
Why?
Is it possible with some Python command to find where the ArcPy encode cp1252 is located and read it so that I can make a function that deals with it?
# -*- coding: utf-8 -*-
import sys
print ('Loaded encoding : %(t)s'%{'t':sys.getdefaultencoding()})
reload(sys) # See stackoverflow question 2276200
sys.setdefaultencoding('utf-8')
print ('Set default encoding : %(t)s'%{'t':sys.getdefaultencoding()})
print ''
texte = u'Récuperation des données'
print ('Original type : %(t)s'%{'t':type(texte)})
print ('Original text : %(t)s'%{'t':texte})
print ''
import arcpy
print ('imported arcpy')
print ('Loaded encoding : %(t)s'%{'t':sys.getdefaultencoding()})
print ''
print ('arcpy mess up original type : %(t)s'%{'t':type(texte)})
print ('arcpy mess up original text : %(t)s'%{'t':texte})
print ''
print ('arcpy mess up reencoded with cp1252 type : %(t)s'%{'t':type(texte.encode('cp1252'))})
print ('arcpy mess up reencoded with cp1252 text : %(t)s'%{'t':texte.encode('cp1252')})
raw_input()
and when I run the script, I get these results :
Loaded encoding : ascii
Set encoding : utf-8
Original type : type 'unicode'
Original text : Récuperation des données <--- This is right
import arcpy
Loaded encoding : utf-8
arcpy mess up original type : type 'unicode'
arcpy mess up original text : R'cuperation des donn'es> <--- This is wrong
arcpy mess up ReEncode with cp1252 type : type 'str'
arcpy mess up ReEncode with cp1252 text : Récuperation des données> <--- This is fits with the original unicode
Answering my question.
From ESRI support, I got this information :
By default, python in the command line is not going to change the code page to a UTF-8 based text for print statements to show up in Unicode. ArcGIS on the other hand specifically allows unicode values to be passed to it and has changed the code page within the command line so that the values you see printed are the values ArcGIS is using. This is why the command line should be the only environment where you see the import sys followed by import arcpy give you a different printed value.
Since my application run scripts that does not always need arcpy, depending of what I want it to do, to solve my problem, I made a generic function that deals with the encoding, whether or not arcpy has been imported, using the information provided by :
Coding_CMD_Window = sys.stdout.encoding
Coding_OS = locale.getpreferredencoding()
Coding_Script = sys.getdefaultencoding()
Coding2Use = Coding_CMD_Window
if any('arcpy' in importedmodules for importedmodules in sys.modules):
Coding2Use = Coding_OS
Also, I made sure that all of my scripts had the proper UTF-8 encoding without signature.
Hope this helps anyone.
For those in doubt, try something like the following (e.g., in a .py file):
import codecs
#import arcpy
f = codecs.open('utf.file.txt', encoding='utf-8-sig') #assuming a BOM present
l = f.readlines()
print u''.join(l)
Then run the same code once more, but first remove the hash comment from the arcpy line. It'll take about 6 seconds more time.
What I get is perfectly fine text running the first version, gibberish when allowing arcpy to load.
ArcGIS for Desktop version used: 10.2.1

Python pipe cp1252 string from PowerShell to a python (2.7) script

After a few days of dwelling over stackoverflow and python 2.7 doc, I have come to no conclusion about this.
Basically I'm running a python script on a windows server that must have as input a block of text. This block of text (unfortunately) has to be passed by a pipe. Something like:
PS > [something_that_outputs_text] | python .\my_script.py
So the problem is:
The server uses cp1252 encoding and I really cannot change it due to administrative regulations and whatnot. And when I pipe the text to my python script, when I read it, it comes already with ? whereas characters like \xe1 should be.
What I have done so far:
Tested with UTF-8. Yep, chcp 65001 and $OutputEncoding = [Console]::OutputEncoding "solve it", as in python gets the text perfectly and then I can decode it to unicode etc. But apparently they don't let me do it on the server /sadface.
A little script to test what the hell is happening:
import codecs
import sys
def main(argv=None):
if argv is None:
argv = sys.argv
if len(argv)>1:
for arg in argv[1:]:
print arg.decode('cp1252')
sys.stdin = codecs.getreader('cp1252')(sys.stdin)
text = sys.stdin.read().strip()
print text
return 0
if __name__=="__main__":
sys.exit(main())
Tried it with both the codecs wrapping and without it.
My input & output:
PS > echo "Blá" | python .\testinput.py blé
blé
Bl?
--> So there's no problem with the argument (blé) but the piped text (Blá) is no good :(
I even converted the text string to hex and, yes, it gets flooded with 3f (AKA mr ?), so it's not a problem with the print.
[Also: it's my first question here... feel free to ask any more info about what I did]
EDIT
I don't know if this is relevant or not, but when I do sys.stdin.encoding it yields None
Update: So... I have no problems with cmd. Checked sys.stdin.encoding while running the program on cmd and everything went fine. I think my head just exploded.
How about saving the data into a file and piping it to Python on a CMD session? Invoke Powershell and Python on CMD. Like so,
c:\>powershell -command "c:\genrateDataForPython.ps1 -output c:\data.txt"
c:\>type c:\data.txt | python .\myscript.py
Edit
Another an idea: convert the data into base64 format in Powershell and decode it in Python. Base64 is simple in Powershell, I guess in Python it isn't hard either. Like so,
# Convert some accent chars to base64
$s = [Text.Encoding]::UTF8.GetBytes("éêèë")
[System.Convert]::ToBase64String($s)
# Output:
w6nDqsOow6s=
# Decode:
$d = [System.Convert]::FromBase64String("w6nDqsOow6s=")
[Text.Encoding]::UTF8.GetString($d)
# Output
éêèë

Problems with Encoding in Eclipse Console and Python

I guess I need some help regarding encodings in Python (2.6) and Eclipse. I used Google and the so-search and tried a lot of things but as a matter of fact I don't get it.
So, how do I achieve, that the output in the Eclipse console is able to show äöü etc.?
I tried:
Declaring the document encoding in the first line with
# -*- coding: utf-8 -*-
I changed the encoding settings in Window/Preferences/General/Workspace and Project/Properties to UTF-8
As nothing changed I tried the following things alone and in combination but nothing seemed to work out:
Changing the stdout as mentioned in the Python Cookbook:
sys.stdout = codecs.lookup("utf-8")-1
Adding an unicode u:
print u"äöü".encode('UTF8')
reloading sys (I don't know what for but it doesn't work either ;-))
I am trying to do this in order to debug the encoding-problems I have in my programs... (argh)
Any ideas? Thanks in advance!
EDIT:
I work on Windows 7 and it is EasyEclipse
Got it! If you have the same problem go to
Run/Run Configurations/Common and select the UTF-8 (e.g.) as console encoding.
So, finally, print "ö" results in "ö"
Even this is a bit old question, I'm new in StackOverflow and I'd like to contribute a bit. You can change the default encoding in Eclipse (currently Neon) for the all text editors from the menu Window -> Preferences -> General -> Workspace : Text file encoding
Item Path

Python, windows console and encodings (cp 850 vs cp1252)

I thought I knew everything about encodings and Python, but today I came across a weird problem: although the console is set to code page 850 - and Python reports it correctly - parameters I put on the command line seem to be encoded in code page 1252. If I try to decode them with sys.stdin.encoding, I get the wrong result. If I assume 'cp1252', ignoring what sys.stdout.encoding reports, it works.
Am I missing something, or is this a bug in Python ? Windows ? Note: I am running Python 2.6.6 on Windows 7 EN, locale set to French (Switzerland).
In the test program below, I check that literals are correctly interpreted and can be printed - this works. But all values I pass on the command line seem to be encoded wrongly:
#!/usr/bin/python
# -*- encoding: utf-8 -*-
import sys
literal_mb = 'utf-8 literal: üèéÃÂç€ÈÚ'
literal_u = u'unicode literal: üèéÃÂç€ÈÚ'
print "Testing literals"
print literal_mb.decode('utf-8').encode(sys.stdout.encoding,'replace')
print literal_u.encode(sys.stdout.encoding,'replace')
print "Testing arguments ( stdin/out encodings:",sys.stdin.encoding,"/",sys.stdout.encoding,")"
for i in range(1,len(sys.argv)):
arg = sys.argv[i]
print "arg",i,":",arg
for ch in arg:
print " ",ch,"->",ord(ch),
if ord(ch)>=128 and sys.stdin.encoding == 'cp850':
print "<-",ch.decode('cp1252').encode(sys.stdout.encoding,'replace'),"[assuming input was actually cp1252 ]"
else:
print ""
In a newly created console, when running
C:\dev>test-encoding.py abcé€
I get the following output
Testing literals
utf-8 literal: üèéÃÂç?ÈÚ
unicode literal: üèéÃÂç?ÈÚ
Testing arguments ( stdin/out encodings: cp850 / cp850 )
arg 1 : abcÚÇ
a -> 97
b -> 98
c -> 99
Ú -> 233 <- é [assuming input was actually cp1252 ]
Ç -> 128 <- ? [assuming input was actually cp1252 ]
while I would expect the 4th character to have an ordinal value of 130 instead of 233 (see the code pages 850 and 1252).
Notes: the value of 128 for the euro symbol is a mystery - since cp850 does not have it. Otherwise, the '?' are expected - cp850 cannot print the characters and I have used 'replace' in the conversions.
If I change the code page of the console to 1252 by issuing chcp 1252 and run the same command, I (correctly) obtain
Testing literals
utf-8 literal: üèéÃÂç€ÈÚ
unicode literal: üèéÃÂç€ÈÚ
Testing arguments ( stdin/out encodings: cp1252 / cp1252 )
arg 1 : abcé€
a -> 97
b -> 98
c -> 99
é -> 233
€ -> 128
Any ideas what I'm missing ?
Edit 1: I've just tested by reading sys.stdin. This works as expected: in cp850, typing 'é' results in an ordinal value of 130. So the problem is really for the command line only. So, is the command line treated differently than the standard input ?
Edit 2: It seems I had the wrong keywords. I found another very close topic on SO: Read Unicode characters from command-line arguments in Python 2.x on Windows. Still, if the command line is not encoded like sys.stdin, and since sys.getdefaultencoding() reports 'ascii', it seems there is no way to know its actual encoding. I find the answer using win32 extensions pretty hacky.
Replying to myself:
On Windows, the encoding used by the console (thus, that of sys.stdin/out) differs from the encoding of various OS-provided strings - obtained through e.g. os.getenv(), sys.argv, and certainly many more.
The encoding provided by sys.getdefaultencoding() is really that - a default, chosen by Python developers to match the "most reasonable encoding" the interpreter use in extreme cases. I get 'ascii' on my Python 2.6, and tried with portable Python 3.1, which yields 'utf-8'. Both are not what we are looking for - they are merely fallbacks for encoding conversion functions.
As this page seems to state, the encoding used by OS-provided strings is governed by the Active Code Page (ACP). Since Python does not have a native function to retrieve it, I had to use ctypes:
from ctypes import cdll
os_encoding = 'cp' + str(cdll.kernel32.GetACP())
Edit: But as Jacek suggests, there actually is a more robust and Pythonic way to do it (semantics would need validation, but until proven wrong, I'll use this)
import locale
os_encoding = locale.getpreferredencoding()
# This returns 'cp1252' on my system, yay!
and then
u_argv = [x.decode(os_encoding) for x in sys.argv]
u_env = os.getenv('myvar').decode(os_encoding)
On my system, os_encoding = 'cp1252', so it works. I am quite certain this would break on other platforms, so feel free to edit and make it more generic. We would certainly need some kind of translation table between the ACP reported by Windows and the Python encoding name - something better than just prepending 'cp'.
This is a unfortunately a hack, although I find it a bit less intrusive than the one suggested by this ActiveState Code Recipe (linked to by the SO question mentioned in Edit 2 of my question). The advantage I see here is that this can be applied to os.getenv(), and not only to sys.argv.
I tried the solutions. It may still have some encoding problems. We need to use true type fonts.
Fix:
Run chcp 65001 in cmd to change the encoding to UTF-8.
Change cmd font to a True-Type one like Lucida Console that supports the
preceding code pages before 65001
Here's my complete fix for the encoding error:
def fixCodePage():
import sys
import codecs
import ctypes
if sys.platform == 'win32':
if sys.stdout.encoding != 'cp65001':
os.system("echo off")
os.system("chcp 65001") # Change active page code
sys.stdout.write("\x1b[A") # Removes the output of chcp command
sys.stdout.flush()
LF_FACESIZE = 32
STD_OUTPUT_HANDLE = -11
class COORD(ctypes.Structure):
_fields_ = [("X", ctypes.c_short), ("Y", ctypes.c_short)]
class CONSOLE_FONT_INFOEX(ctypes.Structure):
_fields_ = [("cbSize", ctypes.c_ulong),
("nFont", ctypes.c_ulong),
("dwFontSize", COORD),
("FontFamily", ctypes.c_uint),
("FontWeight", ctypes.c_uint),
("FaceName", ctypes.c_wchar * LF_FACESIZE)]
font = CONSOLE_FONT_INFOEX()
font.cbSize = ctypes.sizeof(CONSOLE_FONT_INFOEX)
font.nFont = 12
font.dwFontSize.X = 7
font.dwFontSize.Y = 12
font.FontFamily = 54
font.FontWeight = 400
font.FaceName = "Lucida Console"
handle = ctypes.windll.kernel32.GetStdHandle(STD_OUTPUT_HANDLE)
ctypes.windll.kernel32.SetCurrentConsoleFontEx(handle, ctypes.c_long(False), ctypes.pointer(font))
Note: You can see a font change while executing the program.
Well what worked for me was using following code sniped:
# -*- coding: utf-8 -*-
import os
import sys
print (f"OS: {os.device_encoding(0)}, sys: {sys.stdout.encoding}")
comparing both on some windows systems with python 3.8, showed that os.device_encoding(0) always reflected code page setting in terminal. (Tested with Powershell and with old cmd-shell on Windows 10 and Windows 7)
This was even true after changing the terminals code page with shell command:
chcp 850
or e.g.:
chcp 1252
Now using os.device_encoding(0) for tasks like decoding a subprocess stdout result from bytes to string worked out even with Non-ASCII chars like é, ö, ³, ↓, ...
So as other already pointed out on windows local setting is really just a system information, about user preferences, but not what shell actually might currently use.

Writing unicode strings via sys.stdout in Python

Assume for a moment that one cannot use print (and thus enjoy the benefit of automatic encoding detection). So that leaves us with sys.stdout. However, sys.stdout is so dumb as to not do any sensible encoding.
Now one reads the Python wiki page PrintFails and goes to try out the following code:
$ python -c 'import sys, codecs, locale; print str(sys.stdout.encoding); \
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout);
However this too does not work (at least on Mac). Too see why:
>>> import locale
>>> locale.getpreferredencoding()
'mac-roman'
>>> sys.stdout.encoding
'UTF-8'
(UTF-8 is what one's terminal understands).
So one changes the above code to:
$ python -c 'import sys, codecs, locale; print str(sys.stdout.encoding); \
sys.stdout = codecs.getwriter(sys.stdout.encoding)(sys.stdout);
And now unicode strings are properly sent to sys.stdout and hence printed properly on the terminal (sys.stdout is attached the terminal).
Is this the correct way to write unicode strings in sys.stdout or should I be doing something else?
EDIT: at times--say, when piping the output to less--sys.stdout.encoding will be None. in this case, the above code will fail.
export PYTHONIOENCODING=utf-8
will do the job, but can't set it on python itself ...
what we can do is verify if isn't setting and tell the user to set it before call script with :
if __name__ == '__main__':
if (sys.stdout.encoding is None):
print >> sys.stderr, "please set python env PYTHONIOENCODING=UTF-8, example: export PYTHONIOENCODING=UTF-8, when write to stdout."
exit(1)
Best idea is to check if you are directly connected to a terminal. If you are, use the terminal's encoding. Otherwise, use system preferred encoding.
if sys.stdout.isatty():
default_encoding = sys.stdout.encoding
else:
default_encoding = locale.getpreferredencoding()
It's also very important to always allow the user specify whichever encoding she wants. Usually I make it a command-line option (like -e ENCODING), and parse it with the optparse module.
Another good thing is to not overwrite sys.stdout with an automatic encoder. Create your encoder and use it, but leave sys.stdout alone. You could import 3rd party libraries that write encoded bytestrings directly to sys.stdout.
There is an optional environment variable "PYTHONIOENCODING" which may be set to a desired default encoding. It would be one way of grabbing the user-desired encoding in a way consistent with all of Python. It is buried in the Python manual here.
This is what I am doing in my application:
sys.stdout.write(s.encode('utf-8'))
This is the exact opposite fix for reading UTF-8 names from argv:
for file in sys.argv[1:]:
file = file.decode('utf-8')
This is very ugly (IMHO) as it force you to work with UTF-8.. which is the norm on Linux/Mac, but not on windows... Works for me anyway :)
It's not clear to my why you wouldn't be able to do print; but assuming so, yes, the approach looks right to me.

Categories