I'm trying to write the symbol ● to a text file in python. I think it has something to do with the encoding (utf-8). Here is the code:
# -*- coding: utf-8 -*-
outFile = open('./myFile.txt', 'wb')
outFile.write("●")
outFile.close()
Instead of the black "●" I get "â—". How can I fix this?
Open the file using the io package for this to work with both python2 and python3 with encoding set to utf8 for this to work. When printing, When writing, write as a unicode string.
import io
outFile = io.open('./myFile.txt', 'w', encoding='utf8')
outFile.write(u'●')
outFile.close()
Tested on Python 2.7.8 and Python 3.4.2
If you are using Python 2, use codecs.open instead of open and unicode instead of str:
# -*- coding: utf-8 -*-
import codecs
outFile = codecs.open('./myFile.txt', 'wb', 'utf-8')
outFile.write(u"●")
outFile.close()
In Python 3, pass the encoding keyword argument to open:
# -*- coding: utf-8 -*-
outFile = open('./myFile.txt', 'w', encoding='utf-8')
outFile.write("●")
outFile.close()
>>> ec = u'\u25cf' # unicode("●", "UTF-8")
>>> open("/tmp/file.txt", "w").write(ec.encode('UTF-8'))
This should do the trick
# -*- coding: utf-8 -*-
outFile = open('./myFile.txt', 'wb')
outFile.write(u"\u25CF".encode('utf-8'))
outFile.close()
have a look at this
What your program does is to produce an output file in the same encoding as your program editor (the coding at the top does not matter, unless your program editor uses it for saving the file). Thus, if you open myFile.txt with a program that uses the same encoding as your program editor, everything looks fine.
This does not mean that your program works for everybody.
For this, you must do two things. You must first indicate the encoding used for text files on your machine. This is a little hard to detect, but the following should often work:
# coding=utf-8 # Put your editor's encoding here
import codecs
import locale
import sys
# Selection of the first non-None, reasonable encoding:
out_encoding = (locale.getlocale()[1]
or locale.getpreferredencoding()
or sys.stdin.encoding or sys.stdout.encoding
# Default:
or "UTF8")
outFile = codecs.open('./myFile.txt', 'w', out_encoding)
Note that it is very important to specify the right coding on top of the file: this must be your program editor's encoding.
If you know the encoding you want for your output file, you can directly put it in open(). Otherwise, the more general and portable out_encoding expression above should work for most users on most computers (i.e., whatever their encoding of choice is, they should be able to read "●" in the resulting file—assuming their computer's encoding can represent it).
Then you must print a string, not bytes:
outFile.write(u"●")
(note the leading u, meaning "unicode string").
For a deeper understanding of the issues at hand, one of my previous answers should be very helpful: UnicodeDecodeError when redirecting to file.
I'm very sorry, but writing a symbol to a text file without saying what the encoding of the file should be is simply non-sense.
It may not be evident at first sight, but text files are indeed encoded and may be encoded in different ways. If you have only letters (upper and lower case, but not accented oned), digits and simple symbols (everything that has an ASCII code below 128), all should be fine, because ASCII 7 bits is now a standard and in fact those characters have same representation in major encodings.
But as soon as you get true symbols, or accented chars, their representation vary from one encoding to the other. For example, the symbol ● has a UTF-8 representation of (Python coding) : \xe2\x97\x8f. What is worse, it cannot be represented in latin1 (ISO-8859-1) encoding.
Another example is the french e accent aigu : é it is represented in UTF8 as \xc3\xa9 (note 2 bytes), but is represented in Latin1 as \x89 (one single byte)
So I tested your code in my Ubuntu box using a UTF8 encoding and the command
cat myFile.txt ... correctly showed the bullet !
sba#sba-ubuntu:~/stackoverflow$ cat myFile.txt
●sba#sba-ubuntu:~/stackoverflow$
(as you didn't add any newline after the bullet, the prompt immediately follows it)
In conclusion :
Your code correctly writes the bullet to the file in UTF8 encoding. If your system uses natively another encoding (ISO-8859-1 or its variant Windows-1252) you cannot natively convert it because this character simply does not exist in this encodings.
But you can always see it in a text editor that supports different encoding like the excellent vim that exists on all major systems.
Proof of above :
On a Windows 7 computer, I opened a vim window and instructed it to accept utf8 with :set encoding='utf8'. I then pasted original code from OP and saved it to a file foo.py.
I opened a cmd.exe window and executed python foo.py (using a Python 2.7) : it created a file myFile.txt containing the 3 bytes (hexa) : e2 97 8f that is the utf8 representation of the bullet ● (I could confirm it with vim Tools/Hexa convert).
I could even open myFile.txt in idle and actually saw the bullet. Even notepad.exe could show the bullet !
So even on a Windows 7 computer that does not natively accept utf-8, the code from OP correctly generates a text file that when opened with a text editor accepting UTF-8 contains the bullet ●.
Of course, if I try to open myFile.txt with vim in latin1 mode, I get : â—, on a cmd windows with codepage 850, type myFile.txt shows ÔùÅ, and with codepage 1252 (variant of latin1) : â—.
In conclusion original OP code creates a correct utf8 encoded file - it is up to the reading part to interpret correctly utf8.
Related
I'm doing a fairly simple code that transforms European Portuguese input into Brazilian Portuguese -- so there are a lot of accented characters such as á,é,À,ç, etc.
Basically, the goal is to find words in the text from a list and replace them with the BR words from a second list.
Here's the code:
#-*- coding: latin-1 -*-
listapt=["gestão","utilizador","telemóvel"]
listabr=["gerenciamento", "usuário", "celular"]
while True:
#this is all because I need to be able to input multiple lines of text, seems to be working fine
print ("Insert text")
lines = []
while True:
line = raw_input()
if line != "FIM":
lines.append(line)
else:
break
text = '\n'.join(lines)
for word in listapt:
if word in text:
num = listapt.index(word)
wordbr = listabr[num]
print(word + " --> " + wordbr) #just to show what changes were made
text = text.replace(word, wordbr)
print(text)
I run the code on Windows using IDLE and by double-clicking on the .py file.
The code works fine when using IDLE, but does not match and replace characters when double-clicking the .py file.
Here's why the code works as expected in IDLE but not from CMD or by doubleclicking:
Your code is UTF-8 encoded, not latin-1 encoded
IDLE always works in UTF-8 "input/output" mode.
On Windows, CMD/Doubleclicking will use a non-UTF-8 8bit locale.
When your code compares the input to the hardcoded strings it's doing so at a byte level. On IDLE, it's comparing UTF-8 to hardcoded UTF-8. On CMD, it's comparing non-UTF-8 8bit to hardcoded UTF-8 (If you were on a stock MacOS, it would also work).
The way to fix this is to make sure you're comparing "apples with apples". You could do this by converting everything to the same encoding. E.g. Convert the input read to UTF-8 so it matches the hardcoded strings. The better solution is to convert all [byte] strings to Unicode strings (Strings with no encoding). If you were on Python 3, this would be all automatic.
On Python 2.x, you need to do three things:
Prefix all sourcecode strings with u to make them Unicode strings:
listapt=[u"gestão",u"utilizador",u"telemóvel"]
listabr=[u"gerenciamento",u"usuário", u"celula]
...
if line != u"FIM":
Alternatively, add from __future__ import unicode_literals to avoid changing all your code.
Use the correct coding header for the encoding of your file. I suspect your header should read utf-8. E.g.
#-*- coding: utf-8 -*-
Convert the result of raw_input to Unicode. This must be done with the detected encoding of the standard input:
import sys
line = raw_input().decode(sys.stdin.encoding)
By the way, the better way to model list of words to replace it to use a dict. The keys are the original word, the value is the replacement. E.g.
words = { u"telemóvel": u"celula"}
I don't see that problem over here.
Based on your use of raw_input, it seems like you're using Python 2.x
This may be because I'm copypasting off of stack overflow, and have a different dev environment to you.
Try running your script under the latest Python 3 interpreter, as well as removing the "#-*- coding:" line.
This should either hit UnicodeDecodeError issues a lot sooner in your code, or work fine.
The problem you have here is Python 2.x getting confused at some point while trying to translate between byte sequences (what Python 2.x strings contain, eg binary file contents), and human-meaningful text (unicode, eg for things like user informational display of chinese characters), because it makes incorrect assumptions about how human-readable text was encoded into the byte sequence seen in the Python strings.
It's a detail that Python 3 attempts to address a lot better/less ambiguously.
First try executing the code below, it should resolve the issue:
# -*- coding: latin-1 -*-
listapt=[u"gestão",u"utilizador",u"telemóvel"]
listabr=[u"gerenciamento",u"usuário", u"celular"]
lines=[]
line = raw_input()
line = line.decode('latin-1')
if line != "FIM":
lines.append(line)
text = u'\n'.join(lines)
for word in listapt:
if word in text:
print("Hello")
num = listapt.index(word)
print(num)
wordbr = listabr[num]
print(wordbr)
I am playing around with unicode in python.
So there is a simple script:
# -*- coding: cp1251 -*-
print 'юникод'.decode('cp1251')
print unicode('юникод', 'cp1251')
print unicode('юникод', 'utf-8')
In cmd I've switched encoding to Active code page: 1251.
And there is the output:
СЋРЅРёРєРѕРґ
СЋРЅРёРєРѕРґ
юникод
I am a little bit confused.
Since I've specified encoding to cp1251 I expect that it would be decoded correctly.
But as result there is some trash code points were interpreted.
I am understand that 'юникод' is just a bytes like:
'\xd1\x8e\xd0\xbd\xd0\xb8\xd0\xba\xd0\xbe\xd0\xb4'.
But there is a way to get correct output in terminal with cp1251?
Should I build byte string manually?
Seems like I misunderstood something.
I think I can understand what happened to you. The last line gave me the hint, that your trash codepoints confirmed. You try to display cp1251 characters but your editor is configured to use utf8.
The # -*- coding: cp1251 -*- is only used by the Python interpretor to convert characters from source python files that are outside of the ASCII range. And anyway it it is only used for unicode litterals because bytes from original source give er... exactly same bytes in byte strings. Some text editors are kind enough to automagically use this line (IDLE editor is), but I'm little confident in that and allways switch manually to the proper encoding when I use gvim for example. Short story: # -*- coding: cp1251 -*- in unused in your code and can only mislead a reader since it it not the actual encoding.
If you want to be sure of what lies in your source, you'd better use explicit escapes. In code page 1251, this word юникод is composed by those characters: '\xfe\xed\xe8\xea\xee\xe4'
If you write this source:
txt = '\xfe\xed\xe8\xea\xee\xe4'
print txt
print txt.decode('cp1251')
print unicode(txt, 'cp1251')
print unicode(txt, 'utf-8')
and execute it in a console configured to use CP1251 charset, the first three lines will output юникод, and the last one will throw a UnicodeDecodeError exception because the input is no longer valid 'utf8'.
Alternatively, if you find comfortable with you current editor, you could write:
# -*- coding: utf8 -*-
txt = 'юникод'.decode('utf8').encode('cp1251') # or simply txt = u'юникод'.encode('cp1251')
print txt
print txt.decode('cp1251')
print unicode(txt, 'cp1251')
print unicode(txt, 'utf-8')
which should give same results - but now the declared source encoding should be the actual encoding of your python source.
BTW, a Python 3.5 IDLE that natively uses unicode confirmed that:
>>> 'СЋРЅРёРєРѕРґ'.encode('cp1251').decode('utf8')
'юникод'
Your issue is that the encoding declaration is wrong: your editor uses utf-8 character encoding to save the source code. Use # -*- coding: utf-8 -*- to fix it.
>>> u'юникод'
u'\u044e\u043d\u0438\u043a\u043e\u0434'
>>> u'юникод'.encode('utf-8')
'\xd1\x8e\xd0\xbd\xd0\xb8\xd0\xba\xd0\xbe\xd0\xb4'
>>> print _.decode('cp1251') # mojibake due to the wrong encoding
СЋРЅРёРєРѕРґ
>>> print u'юникод'
юникод
Do not use bytestrings ('' literals create bytes object on Python 2) to represent text; use Unicode strings (u'' literals -- unicode type) instead.
If your code uses Unicode strings then a code page that your Windows console uses doesn't matter as long as the chosen font can display the corresponding (non-BMP) characters. See Python, Unicode, and the Windows console
Here's complete code, for reference:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
print(u'юникод')
Note: no .decode(), unicode(). If you are using a literal to create a string; you should use Unicode literals if the string contains text. It is the only option on Python 3 where you can't put non-ascii characters inside a bytes literal and it is a good practice (to use Unicode for text instead of bytestrings) on Python 2 too.
If you are given a bytestring as an input (not literal) by some API then its encoding has nothing to do with the encoding declaration. What specific encoding to use depends on the source of the data.
Just use the following, but ensure you save the source code in the declared encoding. It can be any encoding that supports the characters you want to print. The terminal can be in a different encoding, as long as it also supports the characters you want to print:
#coding:utf8
print u'юникод'
The advantage is that you don't need to know the terminal's encoding. Python will normally1 detect the terminal encoding and encode the print output correctly.
1Unless your terminal is misconfigured.
Some people use the following to declare the encoding method for the text of their Python source code:
# -*- coding: utf-8 -*-
Back in 2001, it is said the default encoding method that Python interpreter assumes is ASCII. I have dealt with strings using non-ASCII characters in my Python code, without declaring encoding method of my code, and I don't remember I have bumped into encoding error before. What is the default encoding for code assumed by Python interpreter now?
I am not sure if this is relevant.
My OS is Ubuntu, and I am using the default Python interpreter, and gedit or emacs for editing.
Will the default encoding method by Python interpreter changes if the above changes?
Thanks.
Without any explicit encoding declaration, the assumed encoding for your source code will be
ascii for Python 2.x
utf-8 for Python 3.x
See PEP 0263 and Using source code encoding for Python 2.x, and PEP 3120 for the new default of utf-8 for Python 3.x.
So the default encoding assumened for source code will be directly dependent of the version of the Python interpreter, and it is not configurable.
Note that the source code encoding is something entirely different than dealing with non-ASCII characters as part of your data in strings.
There are two distinct cases where you may encounter non-ASCII characters:
As part of your programs data, during runtime
As part of your source code (and since you can't have non-ASCII characters in identifiers, that usually means hard coded string data in your source code or comments).
The source code encoding declaration affects what encoding your source code will be interpreted with - so it's only needed if you decide to directly put non-ASCII characters in your source code.
So, the following code will eventually have to deal with the fact that there might be non-ASCII characters in data.txt:
with open('data.txt') as f:
for line in f:
# do something with `line`
But it doesn't contain any non-ASCII characters in the source code, therefore it doesn't need an encoding declaration at the top of the file. It will however need to properly decode line if it wants to turn it into unicode. Simply doing unicode(line) will use the system default encoding, which is ascii (different from the default source encoding, but happens to also be ascii). So to explicitely decode the string using utf-8 you'd need to do line.decode('utf-8').
This code however does contain non-ASCII characters directly in its source code:
TEST_DATA = 'Bär' # <--- non-ASCII character on this line
print TEST_DATA
And it will fail with a SyntaxError similar to this, unless you declare an explicit source code encoding:
SyntaxError: Non-ASCII character '\xc3' in file foo.py on line 1, but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details
So assuming your text editor is configured to save files in utf-8, you'd need to put the line
# -*- coding: utf-8 -*-
at the top of the file for Python to interpret the source code correctly.
My advice however would be to generally avoid putting non-ASCII characters in your source code, exactly because if it depends on your and your co-workers editor and terminal settings wheter it will be written and read correctly.
Instead you can use escaped strings to safely enter non-ASCII characters in your code:
TEST_DATA = 'B\xc3\xa4r'
By default, Python source files are treated as encoded in UTF-8. In that encoding, — although the standard library only uses ASCII characters for identifiers, a convention that any portable code should follow. To display all these characters properly, the editor must recognize that the file is UTF-8, and it must use a font that supports all the characters in the file.
It is also possible to specify a different encoding for source files. In order to do this, we put the below code on top of our code !
# -*- coding: encoding -*-
https://docs.python.org/dev/tutorial/interpreter.html
Attempting to write a line to a text file in Python 2.7, and have the following code:
# -*- coding: utf-8 -*-
...
f = open(os.path.join(os.path.dirname(__file__), 'output.txt'), 'w')
f.write('Smith’s BaseBall Cap') // Note the strangely shaped apostrophe
However, in output.txt, I get Smith’s BaseBall Cap, instead. Not sure how to correct this encoding problem? Any protips with this sort of issue?
You have declared your file to be encoded with UTF-8, so your byte-string literal is in UTF-8. The curly apostrophe is U+2019. In UTF-8, this is encoded as three bytes, \xE2\x80\x99. Those three bytes are written to your output file. Then, when you examine the output file, it is interpreted as something other than UTF-8, and you see the three incorrect characters instead.
In Mac OS Roman, those three bytes display as ’.
Your file is a correct UTF-8 file, but you are viewing it incorrectly.
There are a couple possibilities, but the first one to check is that the output file actually contains what you think it does. Are you sure you're not viewing the file with the wrong encoding? Some editors have an option to choose what encoding you're viewing the file in. The editor needs to know the file's encoding, and if it interprets the file as being in some other encoding than UTF-8, it will display the wrong thing even though the contents of the file are correct.
When I run your code (on Python 2.6) I get the correct output in the file. Another thing to try: Use the codecs module to open the file for UTF-8 writing: f = codecs.open("file.txt", "w", "utf-8"). Then declare the string as a unicode string withu"'Smith’s BaseBall Cap'"`.
I wrote a simple file parser and writer, but then I came across an article talking about the importance of unicode and then it occurred to me that I'm assuming the input file is ascii encoded, which may not be the case all the time, though it would be rare in my situation.
In those rare cases, I would expect UTF-8 encoded files.
Is there a way to work with UTF-8 files by simply changing how I read and write? All I do with the strings is store them and then write them out, so I just need to make sure I can read them, store them, and write them properly.
Furthermore, would I have to treat ascii and UTF-8 files separately and write different functions for each? I have not worked with anything other than ascii files yet and only read about handling unicode.
Python natively supports Unicode. If you directly read and write from the first file to the second, then no data is lost as it copies the bytes verbatim. However, if you decode the string and then re-encode it, you'll need to make sure you use the right encoding.
If you are using Python 2, you can simply change all your str objects to unicode objects. Unicode objects have all the same methods as strings but are encoded in a unicode format instead of ASCII. See http://docs.python.org/library/functions.html#unicode .
If you are using Python 3, strings are encoded in UTF-8 by default.
If you are using Python 2.6 or later, you can use the io library and its io.open method to open the files you want. It has an encoding argument which should be set to 'utf-8' in your case. When you read or write the returned file objects, string are automatically en-/decoded.
Anyway, you don't need to do something special for ASCII, because UTF-8 is a superset of ASCII.
So long as you are only reading and writing to files and not expecting any other type of encoded input, then you should not have to do anything special.
% cat /tmp/u
π is 3.14.
% file /tmp/u
/tmp/u: UTF-8 Unicode text
% cat f.py
f = open('/tmp/u', 'r')
d = f.read()
print d.split()
f.close()
% python f.py
['\xcf\x80', 'is', '3.14.']
This changes when you declare or accept standard input using UTF-8.
% cat g.py
s = 'π is 3.14.'
print s.split()
% python g.py
File "g.py", line 1
SyntaxError: Non-ASCII character '\xcf' in file g.py on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
To handle this properly, declare the encoding for the Python program at the beginning per PEP 263 (referenced by the SyntaxError exception above).
% cat h.py
# -*- coding: utf-8 -*-
s = 'π is 3.14.'
print s.split()
% python h.py
['\xcf\x80', 'is', '3.14.']