I have a set of UTF-8 characters that I would like to insert into a PyX generated pdf file.
I have included # -*- coding: utf-8 -*- to top of the file. The code is somewhat similar to the following:
# -*- coding: utf-8 -*-
c = canvas.canvas()
txt = "u'aあä'"
c.text(2, 2, "ID: %s"%txt)
c.writeEPSfile("filename.eps")
But I still can't get my head around this.
Error:
'ascii' codec can't encode character u'\xae' in position 47: ordinal not in range(128)
Try this:
# -*- coding: utf-8 -*-
c = canvas.canvas()
txt = u'aあä'.encode('utf-8')
c.text(1, 4, "UID: %s"%(txt))
c.writeEPSfile("filename.eps")
You can setup PyX to pass unicode characters to (La)TeX. Then it all becomes a problem to produce the characters in question within TeX/LaTeX. Here is a rather minimal solution to produce the output in question:
from pyx import *
text.set(cls=text.LatexRunner, texenc='utf-8')
text.preamble(r'\usepackage[utf8]{inputenc}')
text.preamble(r'\usepackage{newunicodechar}')
text.preamble(r"\newunicodechar{あ}{{\usefont{U}{min}{m}{n}\symbol{'102}}}")
text.preamble(r'\DeclareFontFamily{U}{min}{}')
text.preamble(r'\DeclareFontShape{U}{min}{m}{n}{<-> udmj30}{}')
c = canvas.canvas()
c.text(0, 0, 'UID: aあä')
c.writeGSfile('utf8.png')
This directly results in the output (as PNG as uploaded here):
Note that this was done using PyX 0.13 on Python 3 and a rather standard LaTeX installation. Also, I used some information from https://tex.stackexchange.com/questions/171611/how-to-write-a-single-hiragana-character-in-latex about creating those characters in LaTeX. There seem to be solutions like CJKutf8 to setup all kind of characters for direct use as unicode characters within LaTeX, but this is way out of my experience. Anyway, it should all work fine from within PyX like it does from LaTeX itself if all the setup has been done properly. Good luck!
Maybe you can find an according set from the babel package
I ran into the same error when I tried to insert the german ä (a umlaut). I simply added the german babel package:
text.preamble(r"\usepackage[ngerman]{babel}")
After that, this was possible without errors:
c.text(12, 34, "äöüßß")
I also used an utf8 input encoding, I think this it is necessary as well.
Further reading:
https://en.wikibooks.org/wiki/LaTeX/Internationalization
https://en.wikibooks.org/wiki/LaTeX/Fonts
Related
I am using Python for the first time and am running into an encoding error that I can't seem to get around. Here is the code:
#!/usr/bin/python
#-*- coding: utf -*-
import pandas as pd
a = "C:\Users"
print(a)
When I do this, I get:
File "C:\Users\Public\Documents\Python Scripts\ImportExcel.py", line
5
a = "C:\Users"
^ SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in positio n 2-3: truncated \UXXXXXXXX escape
In Notepad++ I have tried all of the encoding options. Nothing seems to work.
Any suggestions?
Specifically, the problem is that the '\' is an escape character.
If you want to print the string
"C:\Users"
then you have to do it thus:
a = "C:\\Users"
Hope this helps.
The error message suggests you're on a Windows machine, but you're using *nix notation for #!/usr/bin/python. That line should look something like #!C:\Python33\python.exe on a Windows machine, depending on where you've installed Python.
Use this: # -*- coding: utf-8 -*- instead of #-- coding: utf --
You can set the encoding in Notepad++, but you also need to tell Python about it.
In legacy Python (2.7), source code is ASCII unless specified otherwise. In Python 3, source code is UTF-8 unless otherwise specified.
You should use the following as the first or second line of the file to specify the encoding of the source code. The documentation gives:
# -*- coding: <encoding> -*-
This is the format originally from the Emacs editor, but according to PEP263 you can also use:
# vim: set fileencoding=<encoding>:
of even:
# coding=<encoding>
Where <encoding> can be any encoding that Python supports, but utf-8 is generally a good choice for portable code.
I am playing around with unicode in python.
So there is a simple script:
# -*- coding: cp1251 -*-
print 'юникод'.decode('cp1251')
print unicode('юникод', 'cp1251')
print unicode('юникод', 'utf-8')
In cmd I've switched encoding to Active code page: 1251.
And there is the output:
СЋРЅРёРєРѕРґ
СЋРЅРёРєРѕРґ
юникод
I am a little bit confused.
Since I've specified encoding to cp1251 I expect that it would be decoded correctly.
But as result there is some trash code points were interpreted.
I am understand that 'юникод' is just a bytes like:
'\xd1\x8e\xd0\xbd\xd0\xb8\xd0\xba\xd0\xbe\xd0\xb4'.
But there is a way to get correct output in terminal with cp1251?
Should I build byte string manually?
Seems like I misunderstood something.
I think I can understand what happened to you. The last line gave me the hint, that your trash codepoints confirmed. You try to display cp1251 characters but your editor is configured to use utf8.
The # -*- coding: cp1251 -*- is only used by the Python interpretor to convert characters from source python files that are outside of the ASCII range. And anyway it it is only used for unicode litterals because bytes from original source give er... exactly same bytes in byte strings. Some text editors are kind enough to automagically use this line (IDLE editor is), but I'm little confident in that and allways switch manually to the proper encoding when I use gvim for example. Short story: # -*- coding: cp1251 -*- in unused in your code and can only mislead a reader since it it not the actual encoding.
If you want to be sure of what lies in your source, you'd better use explicit escapes. In code page 1251, this word юникод is composed by those characters: '\xfe\xed\xe8\xea\xee\xe4'
If you write this source:
txt = '\xfe\xed\xe8\xea\xee\xe4'
print txt
print txt.decode('cp1251')
print unicode(txt, 'cp1251')
print unicode(txt, 'utf-8')
and execute it in a console configured to use CP1251 charset, the first three lines will output юникод, and the last one will throw a UnicodeDecodeError exception because the input is no longer valid 'utf8'.
Alternatively, if you find comfortable with you current editor, you could write:
# -*- coding: utf8 -*-
txt = 'юникод'.decode('utf8').encode('cp1251') # or simply txt = u'юникод'.encode('cp1251')
print txt
print txt.decode('cp1251')
print unicode(txt, 'cp1251')
print unicode(txt, 'utf-8')
which should give same results - but now the declared source encoding should be the actual encoding of your python source.
BTW, a Python 3.5 IDLE that natively uses unicode confirmed that:
>>> 'СЋРЅРёРєРѕРґ'.encode('cp1251').decode('utf8')
'юникод'
Your issue is that the encoding declaration is wrong: your editor uses utf-8 character encoding to save the source code. Use # -*- coding: utf-8 -*- to fix it.
>>> u'юникод'
u'\u044e\u043d\u0438\u043a\u043e\u0434'
>>> u'юникод'.encode('utf-8')
'\xd1\x8e\xd0\xbd\xd0\xb8\xd0\xba\xd0\xbe\xd0\xb4'
>>> print _.decode('cp1251') # mojibake due to the wrong encoding
СЋРЅРёРєРѕРґ
>>> print u'юникод'
юникод
Do not use bytestrings ('' literals create bytes object on Python 2) to represent text; use Unicode strings (u'' literals -- unicode type) instead.
If your code uses Unicode strings then a code page that your Windows console uses doesn't matter as long as the chosen font can display the corresponding (non-BMP) characters. See Python, Unicode, and the Windows console
Here's complete code, for reference:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
print(u'юникод')
Note: no .decode(), unicode(). If you are using a literal to create a string; you should use Unicode literals if the string contains text. It is the only option on Python 3 where you can't put non-ascii characters inside a bytes literal and it is a good practice (to use Unicode for text instead of bytestrings) on Python 2 too.
If you are given a bytestring as an input (not literal) by some API then its encoding has nothing to do with the encoding declaration. What specific encoding to use depends on the source of the data.
Just use the following, but ensure you save the source code in the declared encoding. It can be any encoding that supports the characters you want to print. The terminal can be in a different encoding, as long as it also supports the characters you want to print:
#coding:utf8
print u'юникод'
The advantage is that you don't need to know the terminal's encoding. Python will normally1 detect the terminal encoding and encode the print output correctly.
1Unless your terminal is misconfigured.
I'm trying to write the symbol ● to a text file in python. I think it has something to do with the encoding (utf-8). Here is the code:
# -*- coding: utf-8 -*-
outFile = open('./myFile.txt', 'wb')
outFile.write("●")
outFile.close()
Instead of the black "●" I get "â—". How can I fix this?
Open the file using the io package for this to work with both python2 and python3 with encoding set to utf8 for this to work. When printing, When writing, write as a unicode string.
import io
outFile = io.open('./myFile.txt', 'w', encoding='utf8')
outFile.write(u'●')
outFile.close()
Tested on Python 2.7.8 and Python 3.4.2
If you are using Python 2, use codecs.open instead of open and unicode instead of str:
# -*- coding: utf-8 -*-
import codecs
outFile = codecs.open('./myFile.txt', 'wb', 'utf-8')
outFile.write(u"●")
outFile.close()
In Python 3, pass the encoding keyword argument to open:
# -*- coding: utf-8 -*-
outFile = open('./myFile.txt', 'w', encoding='utf-8')
outFile.write("●")
outFile.close()
>>> ec = u'\u25cf' # unicode("●", "UTF-8")
>>> open("/tmp/file.txt", "w").write(ec.encode('UTF-8'))
This should do the trick
# -*- coding: utf-8 -*-
outFile = open('./myFile.txt', 'wb')
outFile.write(u"\u25CF".encode('utf-8'))
outFile.close()
have a look at this
What your program does is to produce an output file in the same encoding as your program editor (the coding at the top does not matter, unless your program editor uses it for saving the file). Thus, if you open myFile.txt with a program that uses the same encoding as your program editor, everything looks fine.
This does not mean that your program works for everybody.
For this, you must do two things. You must first indicate the encoding used for text files on your machine. This is a little hard to detect, but the following should often work:
# coding=utf-8 # Put your editor's encoding here
import codecs
import locale
import sys
# Selection of the first non-None, reasonable encoding:
out_encoding = (locale.getlocale()[1]
or locale.getpreferredencoding()
or sys.stdin.encoding or sys.stdout.encoding
# Default:
or "UTF8")
outFile = codecs.open('./myFile.txt', 'w', out_encoding)
Note that it is very important to specify the right coding on top of the file: this must be your program editor's encoding.
If you know the encoding you want for your output file, you can directly put it in open(). Otherwise, the more general and portable out_encoding expression above should work for most users on most computers (i.e., whatever their encoding of choice is, they should be able to read "●" in the resulting file—assuming their computer's encoding can represent it).
Then you must print a string, not bytes:
outFile.write(u"●")
(note the leading u, meaning "unicode string").
For a deeper understanding of the issues at hand, one of my previous answers should be very helpful: UnicodeDecodeError when redirecting to file.
I'm very sorry, but writing a symbol to a text file without saying what the encoding of the file should be is simply non-sense.
It may not be evident at first sight, but text files are indeed encoded and may be encoded in different ways. If you have only letters (upper and lower case, but not accented oned), digits and simple symbols (everything that has an ASCII code below 128), all should be fine, because ASCII 7 bits is now a standard and in fact those characters have same representation in major encodings.
But as soon as you get true symbols, or accented chars, their representation vary from one encoding to the other. For example, the symbol ● has a UTF-8 representation of (Python coding) : \xe2\x97\x8f. What is worse, it cannot be represented in latin1 (ISO-8859-1) encoding.
Another example is the french e accent aigu : é it is represented in UTF8 as \xc3\xa9 (note 2 bytes), but is represented in Latin1 as \x89 (one single byte)
So I tested your code in my Ubuntu box using a UTF8 encoding and the command
cat myFile.txt ... correctly showed the bullet !
sba#sba-ubuntu:~/stackoverflow$ cat myFile.txt
●sba#sba-ubuntu:~/stackoverflow$
(as you didn't add any newline after the bullet, the prompt immediately follows it)
In conclusion :
Your code correctly writes the bullet to the file in UTF8 encoding. If your system uses natively another encoding (ISO-8859-1 or its variant Windows-1252) you cannot natively convert it because this character simply does not exist in this encodings.
But you can always see it in a text editor that supports different encoding like the excellent vim that exists on all major systems.
Proof of above :
On a Windows 7 computer, I opened a vim window and instructed it to accept utf8 with :set encoding='utf8'. I then pasted original code from OP and saved it to a file foo.py.
I opened a cmd.exe window and executed python foo.py (using a Python 2.7) : it created a file myFile.txt containing the 3 bytes (hexa) : e2 97 8f that is the utf8 representation of the bullet ● (I could confirm it with vim Tools/Hexa convert).
I could even open myFile.txt in idle and actually saw the bullet. Even notepad.exe could show the bullet !
So even on a Windows 7 computer that does not natively accept utf-8, the code from OP correctly generates a text file that when opened with a text editor accepting UTF-8 contains the bullet ●.
Of course, if I try to open myFile.txt with vim in latin1 mode, I get : â—, on a cmd windows with codepage 850, type myFile.txt shows ÔùÅ, and with codepage 1252 (variant of latin1) : â—.
In conclusion original OP code creates a correct utf8 encoded file - it is up to the reading part to interpret correctly utf8.
Huh, ok, so I have this massive problem with encodings and I just do not know how to deal with it. After two days of Google searches I think I just run out of options :)
What I want to do is the following.
Place text in a textbox on a website
Send the text to the backend (written in Python)
Use the text to create:
a. An image in PIL.
b. An entry in MySQL.
Now all of this works smoothly when we're talking about regular characters. But when I try to use Korean, Polish, Japanese characters I get very weird looking characters inserted in both the image and the database. In the examples below I'll use a three character string of Polish characters - "ąść".
Here's what I have done after Googling.
Inserted the following in .htaccess:
AddCharset UTF-8 .py .css .js .html
My python file now starts with:
#!/usr/bin/python
# -*- coding: utf-8 -*-
All of my MySQL databases are encoded in "utf8_unicode_ci".
Now, here's an example of what I'm trying to do... Whenever I parse "ąść" (three Polish characters) it gets saved in the database and generated on the image as:
ąść
Now, a few debugging issues. I go directly to Python and assign the following to the variable (value_text1) that usually has its text parsed (so - no text parsing, simply set fixed text to generate the image with and put into the database):
A) If I go with value_text1 = 'ąść' I get …ść as a result.
B) If I go with value_text1 = u'ąść' I get the following error message:
UnicodeEncodeError: 'latin-1' codec can't encode characters in
position 0-1: ordinal not in range(256)
C) If I go with value_text1 = u'ąść'.encode('UTF-8') I get …ść as a result.
D) If I go with value_text1 = u'\u0105\u015B\u0107'.encode('UTF-8'), where "\u0105\u015B\u0107" is the actual unicode for "ąść" I get …ść as a result.
Really no clue what I'm doing wrong - server settings, python file settings, wrong command? Will appreciate any thoughts, huge thank you in advance.
If I try it in an interactive shell or from a .py file
#!/usr/bin/python
# -*- coding: utf-8 -*-
value_text1 = u'ąść'
print value_text1
it works perfectly well for me, so I guess it's something with your server configuration.
BTW, make sure to use charset="utf-8" when connecting to the server.
I have a python script that starts with:
#!/usr/bin/env python
# -*- coding: ASCII -*-
and prior to saving, it always splits my window, and asks:
Warning (mule): Invalid coding system `ASCII' is specified
for the current buffer/file by the :coding tag.
It is highly recommended to fix it before writing to a file.
and I need to say yes, it there a way to disable this ? Sorry for asking but I had no luck on google.
Gabriel
A solution that doesn't involve changing the script is to tell Emacs what ASCII means as a coding system. (By default, Emacs calls it US-ASCII instead.) Add this to your .emacs file:
(define-coding-system-alias 'ascii 'us-ascii)
Then Emacs should be able to understand # -*- coding: ASCII -*-.
The Python Enhancement Proposal (PEP) 263, Defining Python Source Code Encodings, discusses a number of ways of defining the source code encoding. Two particular points are relevant here:
Without encoding comment, Python's parser will assume ASCII
So you don't need this at all in your file. Still, if you do want to be explicit about the file encoding:
To define a source code encoding, a magic comment must be placed into the source files either as first or second line in the file, such as:
# coding=<encoding name>
(note that the = can be replaced by a :). So you can use # coding: ascii instead of the more verbose # -*- coding: ASCII -*-, as suggested by this answer. This seems to keep emacs happy.