Unicode problems in Python

Unicode problems in Python - python

Huh, ok, so I have this massive problem with encodings and I just do not know how to deal with it. After two days of Google searches I think I just run out of options :)
What I want to do is the following.
Place text in a textbox on a website
Send the text to the backend (written in Python)
Use the text to create:
a. An image in PIL.
b. An entry in MySQL.
Now all of this works smoothly when we're talking about regular characters. But when I try to use Korean, Polish, Japanese characters I get very weird looking characters inserted in both the image and the database. In the examples below I'll use a three character string of Polish characters - "ąść".
Here's what I have done after Googling.
Inserted the following in .htaccess:
AddCharset UTF-8 .py .css .js .html
My python file now starts with:
#!/usr/bin/python
# -*- coding: utf-8 -*-
All of my MySQL databases are encoded in "utf8_unicode_ci".
Now, here's an example of what I'm trying to do... Whenever I parse "ąść" (three Polish characters) it gets saved in the database and generated on the image as:
Ä…Å›Ä‡
Now, a few debugging issues. I go directly to Python and assign the following to the variable (value_text1) that usually has its text parsed (so - no text parsing, simply set fixed text to generate the image with and put into the database):
A) If I go with value_text1 = 'ąść' I get …Å›Ä‡ as a result.
B) If I go with value_text1 = u'ąść' I get the following error message:
UnicodeEncodeError: 'latin-1' codec can't encode characters in
position 0-1: ordinal not in range(256)
C) If I go with value_text1 = u'ąść'.encode('UTF-8') I get …Å›Ä‡ as a result.
D) If I go with value_text1 = u'\u0105\u015B\u0107'.encode('UTF-8'), where "\u0105\u015B\u0107" is the actual unicode for "ąść" I get …Å›Ä‡ as a result.
Really no clue what I'm doing wrong - server settings, python file settings, wrong command? Will appreciate any thoughts, huge thank you in advance.

If I try it in an interactive shell or from a .py file
#!/usr/bin/python
# -*- coding: utf-8 -*-
value_text1 = u'ąść'
print value_text1
it works perfectly well for me, so I guess it's something with your server configuration.
BTW, make sure to use charset="utf-8" when connecting to the server.

Related

Problems decoding German "Umlaute" of QR-Codes with "pyzbar"

I wrote a Python (V3.9.9) program (Windows 10) to decode QR-Codes of type "EPC QR" -> see Wikipedia
Everything is working fine, except, if there are German "Umlaute" (ÄÖÜäöü) within the text of the QR-Code. Here is a sample program, to demonstrate/isolate the problem:
import cv2 # Read image / camera/video input
from pyzbar.pyzbar import decode
img = cv2.imread ("GiroCodeUmlaute.PNG")
print (decode (img))
for code in decode (img):
print (code.type)
print (code.data.decode ("UTF-8"))
And here is the QR-Code for testing:
GiroCodeUmlaute.PNG -> see QR-Code generator
The 6th line of the encoded QR-Code text contains "Ärzte ohne Grenzen".
But when it's decoded with "UTF-8" (which is the correct character set), then "ﾃвzte ohne Grenzen" is displayed.
I think, also the decoded read hex data are looking a bit strange:
[Decoded(data=b'BCD\n002\n1\nSCT\nRLNWATWW\n\xef\xbe\x83\xd0\xb2zte ohne Grenzen...
From where are the 4(!) hex bytes coming? \xef\xbe\x83\xd0\xb2zte
Where is the 'r' of the original text?
The same problem occurs, if this test-program is running under a Raspberry computer.
If this sample QR-Code is scanned by an android mobile app, the "Umlaut" is correct displayed.
From my point of view it looks like a problem of the "pyzbar" module. But maybe I'm doing something wrong?
Every help and tip is appreciated!
Thanks

I faced the same Problem with Swiss Bill QR-Codes on Linux using the python3-qrtools - guess this module is using the zbar-library as well.
A workaround would be to replace the wrong chars in the decoded string. Depending on the output you get out with print (code.data.decode ("UTF-8")) for your Umlaute, I got the following symbols which I need to replace with ä, ö resp. ü:
string = string.replace('瓣', 'ä')
string = string.replace('繹', 'ö')
string = string.replace('羹', 'ü')
It's not a very elegant solution because it does not deal with all special characters which get wrongly decoded.
As I'm calling the python program from a Pascal/Delphi/lazarus Program, I hope to find an other tool somewhere around which reads QR-codes from the webcam and correctly decode it...

Python: Unable to write slanted apostrophe to file

I'm using Python 2.7 and unable to upgrade to 3.x just yet.
I need to read data from the database and write to a file.
Using the database query tool, I see the string I need to retrieve contains the following slanted apostrophe:
I’ve
When I read the database from python and simply print it to the console, I see the following. It has converted the slanted apostrophe to a normal apostrophe. This is actually my preferred behavior:
print(message)
I've
But when I try to write the string to a file, it instead writes the apostrophe as a question mark:
I?ve
My original code just does this:
file = open(path,"w")
file.write(message)
file.close()
To try to fix it, I did the following, but it did not help. The question mark is still showing up:
# -*- coding: utf-8 -*-
import codecs
file = codecs.open(path,"w", "utf-8")
file.write(message)
file.close()

Inserting a unicode text into pyx canvas

I have a set of UTF-8 characters that I would like to insert into a PyX generated pdf file.
I have included # -*- coding: utf-8 -*- to top of the file. The code is somewhat similar to the following:
# -*- coding: utf-8 -*-
c = canvas.canvas()
txt = "u'aあä'"
c.text(2, 2, "ID: %s"%txt)
c.writeEPSfile("filename.eps")
But I still can't get my head around this.
Error:
'ascii' codec can't encode character u'\xae' in position 47: ordinal not in range(128)

Try this:
# -*- coding: utf-8 -*-
c = canvas.canvas()
txt = u'aあä'.encode('utf-8')
c.text(1, 4, "UID: %s"%(txt))
c.writeEPSfile("filename.eps")

You can setup PyX to pass unicode characters to (La)TeX. Then it all becomes a problem to produce the characters in question within TeX/LaTeX. Here is a rather minimal solution to produce the output in question:
from pyx import *
text.set(cls=text.LatexRunner, texenc='utf-8')
text.preamble(r'\usepackage[utf8]{inputenc}')
text.preamble(r'\usepackage{newunicodechar}')
text.preamble(r"\newunicodechar{あ}{{\usefont{U}{min}{m}{n}\symbol{'102}}}")
text.preamble(r'\DeclareFontFamily{U}{min}{}')
text.preamble(r'\DeclareFontShape{U}{min}{m}{n}{<-> udmj30}{}')
c = canvas.canvas()
c.text(0, 0, 'UID: aあä')
c.writeGSfile('utf8.png')
This directly results in the output (as PNG as uploaded here):
Note that this was done using PyX 0.13 on Python 3 and a rather standard LaTeX installation. Also, I used some information from https://tex.stackexchange.com/questions/171611/how-to-write-a-single-hiragana-character-in-latex about creating those characters in LaTeX. There seem to be solutions like CJKutf8 to setup all kind of characters for direct use as unicode characters within LaTeX, but this is way out of my experience. Anyway, it should all work fine from within PyX like it does from LaTeX itself if all the setup has been done properly. Good luck!

Maybe you can find an according set from the babel package
I ran into the same error when I tried to insert the german ä (a umlaut). I simply added the german babel package:
text.preamble(r"\usepackage[ngerman]{babel}")
After that, this was possible without errors:
c.text(12, 34, "äöüßß")
I also used an utf8 input encoding, I think this it is necessary as well.
Further reading:
https://en.wikibooks.org/wiki/LaTeX/Internationalization
https://en.wikibooks.org/wiki/LaTeX/Fonts

Python ─ UTF-8 filename from HTML form via CherryPy

Python Header: # ! /usr/bin/env python
# -*- coding: utf-8 -*-
# image_upload.py
Cherrypy Config: cherrypy.config.update(
{'tools.encode.on': True,
'tools.encode.encoding': 'utf-8',
'tools.decode.on': True,
},)
HTML Header: <head><meta http-equiv="Content-Type"
content="text/html;charset=ISO-8859-1"></head>
""" Python 2.7.3
Cherrypy 3.2.2
Ubuntu 12.04
"""
With an HTML form, I'm uploading an image file to a database. That works so far without problems. However, if the filename ist not 100% in ASCII, there seems to be no way to retrieve it in UTF-8. This is weird, because with the HTML text input fields it works without problems, from saving until showing. Therefore I assume that it's an encoding or decoding problem with the web application framework CherryPy, because the upload is handeld by it, like here.
How it works:
The HTML form POSTs the uploaded file to another Python function, which receives the file in the standard dictionary **kwargs. From here you get the filename with extention, like this: filename = kwargs['file'].filename. But that's already with the wrong encoding. Until now the image hasn't been processed, stored or used in any way.
I'm asking for a solution, which would prevent it, to just parse the filename and change it back "manually". I guess the result already is in UTF-8, which makes it cumbersome to get it right. That's why getting CherryPy to do it, might be the best way. But maybe it's even an HTML issue, because the file comes from a form.
Here are the wrong decoded umlauts.
What I need is the input as result.
input → result input → result
ä → Ã¤ Ä → Ã„
ö → Ã¶ Ö → Ã–
ü → Ã¼ Ü → Ãœ
Following are the failed attempts to get the right result, which would be: "Würfel"
NOTE: img_file = kwargs['file']
original attempt:
result = img_file.filename.rsplit('.',1)[0]
result: "WÃ¼rfel"
change system encoding:
reload(sys)
sys.setdefaultencoding('utf-8')
result: "WÃ¼rfel"
encoding attempt 1:
result = img_file.filename.rsplit('.',1)[0].encode('utf-8')
result: "WÃ¼rfel"
encoding attempt 2:
result = unicode(img_file.filename.rsplit('.',1)[0], 'urf-8')
Error Message:
TypeError: decoding Unicode is not supported
decoding attempt:
result = img_file.filename.rsplit('.',1)[0].decode('utf-8')
Error Message:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-2: ordinal not in range(128)
cast attempt:
result = str(img_file.filename.rsplit('.',1)[0])
Error Message:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-2: ordinal not in range(128)

Trying with your string it seems I can get the filename using latin1 encoding.
>>> s = u'W\xc3\xbcrfel.jpg'
>>> print s.encode('latin1')
Würfel.jpg
>>>
You simply need to use that .encode('latin1') before splitting.
But the problem here is broader. You really need to figure out why your web encoding is latin1 instead of utf8. I don't know cherrypy but try to ensure to use utf8 or you could get in other glitches when serving your application through a webserver like apache or nginx.

The problem is that you serve your HTML with charset ISO-8859-1; this makes the browsers confused and they use the charset also when sending to server. Serve all your HTML always with UTF-8, code in UTF-8, and set your terminal to UTF-8, and you shouldn't have problems.

Translate special character ½

I am reading a source that contains the special character ½. How do I convert this to 1/2? The character is part of a sentence and I still need to be able to use this string "normally". I am reading webpage sources, so I'm not sure that I will always know the encoding??
Edit: I have tried looking at other answers, but they don't work for me. They always seem to start with something like:
s= u'£10"
but I get an error already there: "no encoding declared". But do I know what encoding I'm getting in, or does that not matter? Do I just pick one?

This is really two questions.
#1. To interpret ½: Use the unicodedata module. You can ask for the numeric value of the character or you can normalize using a canonical normalization form it and parse it yourself.
>>> import unicodedata
>>> unicodedata.numeric(u'½')
0.5
>>> unicodedata.normalize('NFKC', u'½')
'1⁄2'
#2. Encoding problems: If you're working with the terminal, make sure Python knows the terminal encoding. If you're writing source files, make sure Python knows the file encoding. You can't just "pick" an encoding to set for Python, you must inform Python about the encoding that your terminal / text editor already uses.
Python lets you set the encoding of files with Vim/Emacs style comments. Put a comment at the top of the file like this if you use Vim:
# coding=UTF-8
Or this, if you use Emacs:
# -*- coding: UTF-8 -*-
If you use neither Vim nor Emacs, then it doesn't matter which one. Obviously, if you don't use UTF-8 you should substitute the encoding you actually use. (UTF-8 is the only encoding I can recommend.)

Dietrich beat me to the punch, but here is some more detail about setting the encoding for your source file:
Because you want to search for a literal unicode ½, you need to be able to write it in your source file. Unfortunately, the Python interpreter chokes on any unicode input, unless you specify the encoding of that source file with a comment in the first couple of lines, like so:
# coding=utf8
# ... do stuff here ...
This assumes your editor is saving the file as UTF-8. If it's using a different encoding specify that instead. See PEP-0263 for more details.
Once you've specified the encoding you should be able to write something this in your code:
text = text.replace('½', '1/2')
Encoding of the webpage
Depending on how you are downloading the page, you probably don't need to worry about this at all, most HTTP libraries handle choosing the encoding for you automatically.

Did you try using codecs to read your file? [docs]
import codecs
fileObj = codecs.open( "someFile", "r", "utf-8" )
u = fileObj.read() # Returns a Unicode string from the UTF-8 bytes in the file
You can check the whole guide here.
Also a good ref: http://docs.python.org/howto/unicode

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.