I am faced with a strange behavior between ArcPy and Python encoding. I work with VisualStudio 2010 Shell with Python tools for VS (PTVS) installed. I isolated my problem through a simple script file. The py script file that contains the following commands. In VisualStudio, I have set the « Advanced Save Options...» to « UTF-8 without signature ». The script simply print on the screen a accented string, then import arcpy module, then again print the same string. Importing Arcpy seems to change the Python encoding setup but I don't know why and I would like to restablish it correctly because it causes problems a little bit everywhere in the original script.
I checked the python « encoding » folder and erased every pyc file. Than I ran the script and it generated 3 pyc files :
cp850.pyc (which corresponds to my stdout.encoding)
cp1252.pyc (which corresponds to my Windows environment encoding)
utf_8.pyc (which fits the encoding of my script)
When ArcPy is being imported, something comes altering the encoding that affects the initial variables.
Why?
Is it possible with some Python command to find where the ArcPy encode cp1252 is located and read it so that I can make a function that deals with it?
# -*- coding: utf-8 -*-
import sys
print ('Loaded encoding : %(t)s'%{'t':sys.getdefaultencoding()})
reload(sys) # See stackoverflow question 2276200
sys.setdefaultencoding('utf-8')
print ('Set default encoding : %(t)s'%{'t':sys.getdefaultencoding()})
print ''
texte = u'Récuperation des données'
print ('Original type : %(t)s'%{'t':type(texte)})
print ('Original text : %(t)s'%{'t':texte})
print ''
import arcpy
print ('imported arcpy')
print ('Loaded encoding : %(t)s'%{'t':sys.getdefaultencoding()})
print ''
print ('arcpy mess up original type : %(t)s'%{'t':type(texte)})
print ('arcpy mess up original text : %(t)s'%{'t':texte})
print ''
print ('arcpy mess up reencoded with cp1252 type : %(t)s'%{'t':type(texte.encode('cp1252'))})
print ('arcpy mess up reencoded with cp1252 text : %(t)s'%{'t':texte.encode('cp1252')})
raw_input()
and when I run the script, I get these results :
Loaded encoding : ascii
Set encoding : utf-8
Original type : type 'unicode'
Original text : Récuperation des données <--- This is right
import arcpy
Loaded encoding : utf-8
arcpy mess up original type : type 'unicode'
arcpy mess up original text : R'cuperation des donn'es> <--- This is wrong
arcpy mess up ReEncode with cp1252 type : type 'str'
arcpy mess up ReEncode with cp1252 text : Récuperation des données> <--- This is fits with the original unicode
Answering my question.
From ESRI support, I got this information :
By default, python in the command line is not going to change the code page to a UTF-8 based text for print statements to show up in Unicode. ArcGIS on the other hand specifically allows unicode values to be passed to it and has changed the code page within the command line so that the values you see printed are the values ArcGIS is using. This is why the command line should be the only environment where you see the import sys followed by import arcpy give you a different printed value.
Since my application run scripts that does not always need arcpy, depending of what I want it to do, to solve my problem, I made a generic function that deals with the encoding, whether or not arcpy has been imported, using the information provided by :
Coding_CMD_Window = sys.stdout.encoding
Coding_OS = locale.getpreferredencoding()
Coding_Script = sys.getdefaultencoding()
Coding2Use = Coding_CMD_Window
if any('arcpy' in importedmodules for importedmodules in sys.modules):
Coding2Use = Coding_OS
Also, I made sure that all of my scripts had the proper UTF-8 encoding without signature.
Hope this helps anyone.
For those in doubt, try something like the following (e.g., in a .py file):
import codecs
#import arcpy
f = codecs.open('utf.file.txt', encoding='utf-8-sig') #assuming a BOM present
l = f.readlines()
print u''.join(l)
Then run the same code once more, but first remove the hash comment from the arcpy line. It'll take about 6 seconds more time.
What I get is perfectly fine text running the first version, gibberish when allowing arcpy to load.
ArcGIS for Desktop version used: 10.2.1
Related
I'm learning Python and tried to make a hanging game (literal translation - don't know the real name of the game in English. Sorry.). For those who aren't familiar with this game, the player must discover a secret word by guessing one letter at a time.
In my code, I get a collection of secret words which is imported from a txt file using the following code:
words_bank = open('palavras.txt', 'r')
words = []
for line in words_bank:
words.append(line.strip().lower())
words_bank.close()
print(words)
The output of print(words) is ['ma\xc3\xa7\xc3\xa3', 'a\xc3\xa7a\xc3\xad', 'tucum\xc3\xa3'] but if I try print('maçã, açaí, tucumã') in order to check the special characters, everything is printed correctly. Looks like the issue is in the encoding (or decoding... I'm still reading lots of articles about it to really understand) special characters from files.
The content of line 1 of my code is # coding: utf-8 because after some research I found out that I have to specify the Unicode format that is required for the text to be encoded/decoded. Before adding it, I was receiving the following message when running the code:
File "path/forca.py", line 12
SyntaxError: Non-ASCII character '\xc3' in file path/forca.py on line 12, but no encoding declared
Line 12 content: print('maçã, açaí, tucumã')
Things that I've already tried:
add encode='utf-8' as parameter in open('palavras.txt', 'r')
add decode='utf-8' as parameter in open('palavras.txt', 'r')
same as above but with latin1
substitute line 1 content for #coding: latin1
My OS is Ubuntu 20.04 LTS, my IDE is VS Code.
Nothing works!
I don't know what search and what to do anymore.
SOLUTION HERE
Thanks to the help given by the friends above, I was able to find out that the real problem was in the combo VS Code extension (Code Runner) + python alias version from Ubuntu 20.04 LTS.
Code Runner is set to run codes in Terminal in my situation, so apparently, when it calls for python the alias version was python 2.7.x. To overcome this situation I've used this thread to set python 3 as default.
It's done! Whenever python is called, both in terminal and VS Code with Code Runner, all special characters works just fine.
Thank's everybody for your time and your help =)
This only happens when using Python 2.x.
The error is probably because you're printing a list not printing items in the list.
When calling print(words) (words is a list), Python invokes a special function called repr on the list object. The list then creates a summary representation of the list by calling repr in each child in the list, then creates a neat string visualisation.
repr(string) actually returns an ASCII representation (with escapes) rather than a suitable version for your terminal.
Instead, try:
for x in words:
print(x)
Note. The option for open is encoding. E.g
open('myfile.txt', encoding='utf-8')
You should always, always pass the encoding option to open - Python <=3.8 on Linux and Mac will assume UTF-8 (for most people). Python <=3.8 on Windows will use an 8-bit code page.
Python 3.9 will always use UTF-8
See Python 2.x vs 3.x behaviour:
Py2
>>> print ['maçã', 'açaí', 'tucumã']
['ma\xc3\xa7\xc3\xa3', 'a\xc3\xa7a\xc3\xad', 'tucum\xc3\xa3']
>>> repr('maçã')
"'ma\\xc3\\xa7\\xc3\\xa3'"
>>> print 'maçã'
maçã
Py3
>>> print(['maçã', 'açaí', 'tucumã'])
['maçã', 'açaí', 'tucumã']
>>> repr('maçã')
"'maçã'"
Image to show the problemHere is the code to illustrate the problem:
# -*- coding:utf-8 -*-
text = u"严"
print text
If I run the code above in VSCode debug, it will prints "涓" instead of "严", which is the result of the first 2 byte (\xe4\xb8) of u"严" in UTF-8 (\xe4\xb8\xa5), decoded in gbk codec. \xe4\xb8 in gbk is "涓".
However if I run the same code in pycharm it prints "严" exactly as I expected. And it is the same If I run the code in powershell.
Wired the VSCode python debugger behaves different with python interpreter. How can I get the print result correct, I do not think add a decode("gbk") in the end of every text would be a good idea.
My Environment data
VS Code version: 1.21
VSCode Python Extension version : 2018.2.1
OS and version: Windows 10
Python version : 2.7.14
Type of virtual environment used : No
For Windows users, in your System Variables, add PYTHONIOENCODING Variables,change its value to UTF-8, then restart vscode, this worked on my pc.
Modify task.json file in vscode, I am not sure if it will still work on version 2.0.
You can find it here:Changing the encoding for a task output
or here in github:
Tasks should support specifying the output encoding
add this before you start a py script:
import io
import sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf8')
If you open your python file in VS2017 you can do the following:
Go to:
File->
Save selected item as ->
click on the down-arrow next to "Save button"
clicking "Save With Encoding...
select the type of coding you need...
if .py already saved then overwrite file > select "yes"
select for example : "Chinese Simplified (GB18030) - Codepage 54936"
Also, add the following on line 2 of your .py file:
# -*- coding: gb18030 -*- or # -*- coding: gb2312 -*-
Those encodings accept your 严 character.
Nice link to endocoder/decoder tester here.
I want to write a non-ascii character, lets say → to standard output. The tricky part seems to be that some of the data that I want to concatenate to that string is read from json. Consider the follwing simple json document:
{"foo":"bar"}
I include this because if I just want to print → then it seems enough to simply write:
print("→")
and it will do the right thing in python2 and python3.
So I want to print the value of foo together with my non-ascii character →. The only way I found to do this such that it works in both, python2 and python3 is:
getattr(sys.stdout, 'buffer', sys.stdout).write(data["foo"].encode("utf8")+u"→".encode("utf8"))
or
getattr(sys.stdout, 'buffer', sys.stdout).write((data["foo"]+u"→").encode("utf8"))
It is important to not miss the u in front of → because otherwise a UnicodeDecodeError will be thrown by python2.
Using the print function like this:
print((data["foo"]+u"→").encode("utf8"), file=(getattr(sys.stdout, 'buffer', sys.stdout)))
doesnt seem to work because python3 will complain TypeError: 'str' does not support the buffer interface.
Did I find the best way or is there a better option? Can I make the print function work?
The most concise I could come up with is the following, which you may be able to make more concise with a few convenience functions (or even replacing/overriding the print function):
# -*- coding=utf-8 -*-
import codecs
import os
import sys
# if you include the -*- coding line, you can use this
output = 'bar' + u'→'
# otherwise, use this
output = 'bar' + b'\xe2\x86\x92'.decode('utf-8')
if sys.stdout.encoding == 'UTF-8':
print(output)
else:
output += os.linesep
if sys.version_info[0] >= 3:
sys.stdout.buffer.write(bytes(output.encode('utf-8')))
else:
codecs.getwriter('utf-8')(sys.stdout).write(output)
The best option is using the -*- encoding line, which allows you to use the actual character in the file. But if for some reason, you can't use the encoding line, it's still possible to accomplish without it.
This (both with and without the encoding line) works on Linux (Arch) with python 2.7.7 and 3.4.1.
It also works if the terminal's encoding is not UTF-8. (On Arch Linux, I just change the encoding by using a different LANG environment variable.)
LANG=zh_CN python test.py
It also sort of works on Windows, which I tried with 2.6, 2.7, 3.3, and 3.4. By sort of, I mean I could get the '→' character to display only on a mintty terminal. On a cmd terminal, that character would display as 'ΓåÆ'. (There may be something simple I'm missing there.)
If you don't need to print to sys.stdout.buffer, then the following should print fine to sys.stdout. I tried it in both Python 2.7 and 3.4, and it seemed to work fine:
# -*- coding=utf-8 -*-
print("bar" + u"→")
I guess I need some help regarding encodings in Python (2.6) and Eclipse. I used Google and the so-search and tried a lot of things but as a matter of fact I don't get it.
So, how do I achieve, that the output in the Eclipse console is able to show äöü etc.?
I tried:
Declaring the document encoding in the first line with
# -*- coding: utf-8 -*-
I changed the encoding settings in Window/Preferences/General/Workspace and Project/Properties to UTF-8
As nothing changed I tried the following things alone and in combination but nothing seemed to work out:
Changing the stdout as mentioned in the Python Cookbook:
sys.stdout = codecs.lookup("utf-8")-1
Adding an unicode u:
print u"äöü".encode('UTF8')
reloading sys (I don't know what for but it doesn't work either ;-))
I am trying to do this in order to debug the encoding-problems I have in my programs... (argh)
Any ideas? Thanks in advance!
EDIT:
I work on Windows 7 and it is EasyEclipse
Got it! If you have the same problem go to
Run/Run Configurations/Common and select the UTF-8 (e.g.) as console encoding.
So, finally, print "ö" results in "ö"
Even this is a bit old question, I'm new in StackOverflow and I'd like to contribute a bit. You can change the default encoding in Eclipse (currently Neon) for the all text editors from the menu Window -> Preferences -> General -> Workspace : Text file encoding
Item Path
I'm having trouble with encodings.
I'm using version
Python 2.7.2+ (default, Oct 4 2011, 20:03:08)
[GCC 4.6.1] on linux2
I have chars with accents like é à.
My scripts uses utf-8 encoding
#!/usr/bin/python
# -*- coding: utf-8 -*-
Users can type strings usings raw_input() with .
def rlinput(prompt, prefill=''):
readline.set_startup_hook(lambda: readline.insert_text( prefill))
try:
return raw_input(prompt)
finally:
readline.set_startup_hook()
called in the main loop 'pseudo' shell
while to_continue :
to_continue, feedback = action( unicode(rlinput(u'todo > '),'utf-8') )
os.system('clear')
print T, u"\n" + feedback
Data are stored as pickle in files.
I managed to have the app working but finaly get stupid things like
core file :
class Task()
...
def __str__(self):
r = (u"OK" if self._done else u"A faire").ljust(8) + self.getDesc()
return r.encode('utf-8')
and so in shell file :
feedback = jaune + str(t).decode('utf-8') + vert + u" supprimée"
That's where i realize that i might be totaly wrong with encoding/decoding.
So I tried to decode directly in rlinput but failed.
I read some post in stackoverflow, re-read http://docs.python.org/library/codecs.html
Waiting for my python book, i'm lost :/
I guess there is a lot of bad code but my question here is only related to encoding issus.
You can find the code here : (most comments in french, sorry that's for personnal use and i'm a beginner, you'll also need yapsy - http://yapsy.sourceforge.net/ ) (then configure paths, then in py_todo : ./todo_shell.py) : http://bit.ly/rzp9Jm
Standard input and output are byte-based on all Unix systems. That's why you have to call the unicode function to get character-strings for them. The decode error indicates that the bytes coming in are not valid UTF-8.
Basically, the problem is the assumption of UTF-8 encoding, which is not guaranteed. Confirm this by changing the encoding in your unicode call to 'ISO-8859-1', or by changing the character encoding of your terminal emulator to UTF-8. (Putty supports this, in the "Translation" menu.)
If the above experiment confirms this, your challenge is to support the locale of the user and deduce the correct encoding, or perhaps to make the user declare the encoding in a command line argument or configuration. The $LANG environment variable is about the best you can do without an explicit declaration, and I find it to be a poor indicator of the desired character encoding.
As #wberry suggested i checked the encodings : ok
$ file --mime-encoding todo_shell.py task.py todo.py
todo_shell.py: utf-8
task.py: utf-8
todo.py: utf-8
$ echo $LANG
fr_FR.UTF-8
$ python -c "import sys; print sys.stdin.encoding"
UTF-8
As #eryksun suggested as decoded the user inputs (+encode the strings submited before) (solved some problems if my memory is good) (will test deeply later) :
def rlinput(prompt, prefill=''):
readline.set_startup_hook(lambda: readline.insert_text( prefill.encode(sys.stdin.encoding) ))
try:
return raw_input( prompt ).decode( sys.stdin.encoding )
finally:
readline.set_startup_hook()
I still hav problems but my question was not well defined so i can't get clear answer.
I feel less lost now and have directions to search.
Thank you !
EDIT : i replaced the str methodes with unicode and it killed some (all?) probs.
Thanks #eryksun for the tips. (this links helped me : Python __str__ versus __unicode__ )