I'm having issues reading Unicode text from the shell into Python. I have a test document with the following metadata atrribute:
kMDItemAuthors = (
"To\U0304ny\U0308 Sta\U030ark"
)
I see this when I run mdls -name kMDItemAuthors path/to/the/file
I am attempting to get this data into usable form within a Python script. However, I cannot get the Unicode represented text into actual Unicode in Python.
Here's what I am currently doing:
import unicodedata
import subprocess
import os
os.environ['LANG'] = 'en_US.UTF-8'
cmd = 'mdls -name kMDItemAuthors path/to/the/file'
proc = subprocess.Popen(cmd,
shell=True,
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
(stdout, stderr) = proc.communicate()
u = unicode(stdout, 'utf8')
a = unicodedata.normalize('NFC', u)
Now, when I print(a), I get the exact same string representation is above. I have tried normalizing with all of the options (NFC, NFD, NFKC, NFKD), all with the same result.
The weirder thing is, when I try this code:
print('To\U0304ny\U0308 Sta\U030ark')
I get the following error:
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-7: truncated \UXXXXXXXX escape
So, when that sub-string is within the variable, there's no problem, but as a raw string, it creates an issue.
I had felt pretty strong in my understanding of Python and Unicode, but now the shell has broken me. Any help would be greatly appreciated.
PS. I am running all this in Python 2.7.X
You have multiple problems here.
Like all escape sequences, Python only interprets the \U sequence in string literals in your source code. If a file actually has a \ followed by a U in it, Python isn't going to treat that as anything other than a \ and a U, any more than it'll treat a \ followed by an n as a newline. If you want to unescape them manually, you can, by using the unicodeescape codec. (But note that this will treat your file as ASCII, not UTF-8. If you actually have both UTF-8 and \U sequences, you will have to decode it as UTF8, then encode it with unicodeescape, then decode it back with unicodeescape.)
A Python \U sequence requires 8 digits, not 4. If you only have 4, you have to use \u. So, whatever program generated this string, it can't be parsed with unicodeescape. You might be able to hack it into shape by some quick&dirty workaround like s.replace(r'\U', r'\U0000') or s.replace('r\U', r'\u'), or you may have to write a simple parser for it.
In your test, you're trying to use \U escapes in a string literal. You can only do that in Unicode string literals, like print(u'To\U0304ny\U0308 Sta\U030ark'). (If you do that, of course, you'll get the previous error again.)
Also, since this appears to be a Mac, you probably shouldn't be doing os.environ['LANG'] = 'en_US.UTF-8'. If Python sees that it's on OS X, it assumes everything is UTF-8. Anything you do to try to force UTF-8 will probably do nothing, and could in theory confuse it so it doesn't notice it's on OS X. Unless you're trying to work around a driver program that intentionally sets the locale to "C" before calling your script, you're usually better off not doing this.
as mentioned in the other answers just slightly more direct code example
>>> s="To\U0304ny\U0308 Sta\U030ark"
>>> s
'To\\U0304ny\\U0308 Sta\\U030ark'
>>> s.replace("\\U","\\u").decode("unicode-escape")
u'To\u0304ny\u0308 Sta\u030ark'
>>> print s.replace("\\U","\\u").decode("unicode-escape")
Tōnÿ Stårk
>>>
\U is for characters outside the BMP, i.e. it takes 8 hex digits. For characters within the BMP use \u.
>>> print u'To\u0304ny\u0308 Sta\u030ark'
Tōnÿ Stårk
3>> print('To\u0304ny\u0308 Sta\u030ark')
Tōnÿ Stårk
Related
How can I correctly print Vietnamese cp1258 encoded characters in python 3? My terminal doesnt seem to be the issue as it will print the first print statement in my code correctly. I am trying to decode hex characters to vietnamese
Code:
import binascii
data = 'tạm biệt'
print(data) # tạm biệt
a = binascii.hexlify(data.encode('cp1258', errors='backslashreplace'))
print(a) # b'745c75316561316d2062695c753165633774'
# if i dont use the error handler here, then I get a UnicodeEncodeError for \u1ea1
print(
binascii.unhexlify(a).decode('cp1258') # t\u1ea1m bi\u1ec7t
)
There seems to be an omission in Python's support for code page 1258. The legacy codec does support Vietnamese by way of combining diacritics, but Python doesn't know how to convert Unicode to these combinations. I guess you will have to perform your own conversion.
As a first step, observe that unicodedata.normalize('NFD', data) splits the representation into a base character and a sequence of combining diacritics.
>>> unicodedata.normalize('NFD', data).encode('utf-8')
b'ta\xcc\xa3m bie\xcc\xa3\xcc\x82t'
>>> '{0:04x}'.format(ord(b'\xcc\xa3'.decode('utf-8')))
'0323'
So U+0323 is the combining Unicode diacritic for dot-under, and this correspondence should be known to the codec (the Wikipedia page I link to above shows the same Unicode character code for the CP1258 code point 0xF2).
I don't know enough about the target codec to tell you how to map these to CP1258, but if you are lucky, there is already some sort of mapping of these in the Python codec.
iconv on my Mojave MacOS seems to convert this without a hitch:
$ iconv -t cp1258 <<<'tạm biệt' | xxd
00000000: 7461 f26d 2062 69ea f274 0a ta.m bi..t.
From this, it looks like the diacritic applies straightforwardly as a suffix -- 61 is a, and as noted above, f2 is the combining diacritic to place a dot under the main glyph.
If you have a working iconv, a quick and dirty workaround might be to run it as a subprocess.
import subprocess
converted = subprocess.run(['iconv', '-t', 'cp1258'],
input=data.encode('utf-8'), stdout=subprocess.PIPE).stdout
If my understanding is correct, this should really be reported as a bug in Python. It should definitely know how to round-trip between this codec and Unicode if it wants to claim that it supports it.
I figured it out. Decoding with unicode-escape does the trick.
import binascii
data = u'tạm biệt'
print(data) # tạm biệt
a = binascii.hexlify(data.encode('cp1258', errors='backslashreplace'))
print(a) # b'745c75316561316d2062695c753165633774'
# if i dont use the error handler here, then I get a UnicodeEncodeError for \u1ea1
print(
binascii.unhexlify(a).decode('unicode-escape') # tạm biệt
)
I use python 2.7.10.
On dealing with character encoding, and after reading a lot of stack-overflow etc. etc. on the subject, I encountered this behaviour which looks strange to me. Python interpreter input
>>>u'\u00b0'
results in the following output:
u'\xb0'
I could repeat this behaviour using a dos window, the idle console, and the wing-ide python shell.
My assumptions (correct me if I am wrong):
The "degree symbol" has unicode 0x00b0, utf-8 code 0xc2b0, latin-1 code 0xb0.
Python doc say, a string literal with u-prefix is encoded using unicode.
Question: Why is the result converted to a unicode-string-literal with a byte-escape-sequence which matches the latin-1 encoding, instead of persisting the unicode escape sequence ?
Thanks in advance for any help.
Python uses some rules for determining what to output from repr for each character. The rule for Unicode character codepoints in the 0x0080 to 0x00ff range is to use the sequence \xdd where dd is the hex code, at least in Python 2. There's no way to change it. In Python 3, all printable characters will be displayed without converting to a hex code.
As for why it looks like Latin-1 encoding, it's because Unicode started with Latin-1 as the base. All the codepoints up to 0xff match their Latin-1 counterpart.
When I parse this XML with p = xml.parsers.expat.ParserCreate():
<name>Fortuna Düsseldorf</name>
The character parsing event handler includes u'\xfc'.
How can u'\xfc' be turned into u'ü'?
This is the main question in this post, the rest just shows further (ranting) thoughts about it
Isn't Python unicode broken since u'\xfc' shall yield u'ü' and nothing else?
u'\xfc' is already a unicode string, so converting it to unicode again doesn't work!
Converting it to ASCII as well doesn't work.
The only thing that I found works is: (This cannot be intended, right?)
exec( 'print u\'' + 'Fortuna D\xfcsseldorf'.decode('8859') + u'\'')
Replacing 8859 with utf-8 fails! What is the point of that?
Also what is the point of the Python unicode HOWTO? - it only gives examples of fails instead of showing how to do the conversions one (especially the houndreds of ppl who ask similar questions here) actually use in real world practice.
Unicode is no magic - why do so many ppl here have issues?
The underlying problem of unicode conversion is dirt simple:
One bidirectional lookup table '\xFC' <-> u'ü'
unicode( 'Fortuna D\xfcsseldorf' )
What is the reason why the creators of Python think it is better to show an error instead of simply producing this: u'Fortuna Düsseldorf'?
Also why did they made it not reversible?:
>>> u'Fortuna Düsseldorf'.encode('utf-8')
'Fortuna D\xc3\xbcsseldorf'
>>> unicode('Fortuna D\xc3\xbcsseldorf','utf-8')
u'Fortuna D\xfcsseldorf'
You already have the value. Python simply tries to make debugging easier by giving you a representation that is ASCII friendly. Echoing values in the interpreter gives you the result of calling repr() on the result.
In other words, you are confusing the representation of the value with the value itself. The representation is designed to be safely copied and pasted around, without worry about how other systems might handle non-ASCII codepoints. As such the Python string literal syntax is used, with any non-printable and non-ASCII characters replaced by \xhh and \uhhhh escape sequences. Pasting those strings back into a Python string or interactive Python session will reproduce the exact same value.
As such ü has been replaced by \xfc, because that's the Unicode codepoint for the U+00FC LATIN SMALL LETTER U WITH DIAERESIS codepoint.
If your terminal is configured correctly, you can just use print and Python will encode the Unicode value to your terminal codec, resulting in your terminal display giving you the non-ASCII glyphs:
>>> u'Fortuna Düsseldorf'
u'Fortuna D\xfcsseldorf'
>>> print u'Fortuna Düsseldorf'
Fortuna Düsseldorf
If your terminal is configured for UTF-8, you can also write the UTF-8 bytes directly to your terminal, after encoding explicitly:
>>> u'Fortuna Düsseldorf'.encode('utf8')
'Fortuna D\xc3\xbcsseldorf'
>>> print u'Fortuna Düsseldorf'.encode('utf8')
Fortuna Düsseldorf
The alternative is for you upgrade to Python 3; there repr() only uses escape sequences for codepoints that have no printable glyphs (control codes, reserved codepoints, surrogates, etc; if the codepoint is not a space but falls in a C* or Z* general category, it is escaped). The new ascii() function gives you the Python 2 repr() behaviour still.
I'm on a OSX machine and running Python 2.7. I'm trying to do a os.walk on a smb share.
for root, dirnames, filenames in os.walk("./test"):
for filename in filenames:
print filename
matchObj = re.match( r".*ö.*",filename,re.UNICODE)
if i use the above code it works as long as the filename do not contain umlauts.
In my shell the umlauts are printed fine but when I copy them back to a utf8 formated Textdeditor (in my case Sublime), I get:
screenshot
Expected:
filename.jpeg
filename_ö.jpg
Of course the regex fails with that.
if i hardcode the filename like:
re.match( r".*ö.*",'filename_ö',re.UNICODE)
it works fine.
I tried:
os.walk(u"./test")
filename.decode('utf8')
but gives me:
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0308' in position 10: ordinal not in range(128)
u'\u0308' are the dots above the umlauts.
I'm overlooking something stupid i guess?
Unicode characters can be represented in various forms; there's "ö", but then there's also the possibility to represent that same character using an "o" and separate combining diacritics. OS X generally prefers the separated variant, and your editor doesn't seem to handle that very gracefully, nor do these two separate characters match your regex.
You need to normalize your Unicode data if you require one way or the other in particular. See unicodedata.normalize. You want the NFC normalized form.
There are several issues:
The screenshot as #deceze explained is due to Unicode normalization. Note: it is not necessary for the codepoints to look different e.g., ö (U+00f6) and ö (U+006f U+0308) look the same in my browser
r".*ö.*" is a bytestring in Python 2 and the value depends on the encoding declaration at the top of your Python source file (something like: # -*- coding: utf-8 -*-) e.g., if the declared encoding is utf-8 then 'ö' bytestring is a sequence of two bytes: '\xc3\xb6'.
There is no way for the regex engine to know the actual encoding that should be used to interpret input bytestrings.
You should not use bytestrings, to represent text; use Unicode instead (either use u'' literals or add from __future__ import unicode_literals at the top)
filename.decode('utf8') raises UnicodeEncodeError if you use os.walk(u"./test") because filename is Unicode already. Python 2 tries to encode filename implicitly using the default encoding that is 'ascii'. Do not decode Unicode: drop .decode('utf-8')
btw, the last two issues are impossible in Python 3: r".*ö.*" is a Unicode literal, and you can't create a bytestring with literal non-ascii characters there, and there is no .decode() method (you would get AttributeError if you try to decode Unicode). You could run your script on Python 3, to detect Unicode-related bugs.
I am trying to write some strings to a file (the strings have been given to me by the HTML parser BeautifulSoup).
I can use "print" to display them, but when I use file.write() I get the following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 6: ordinal not in range(128)
How can I parse this?
If I type 'python unicode' into Google, I get about 14 million results; the first is the official doc which describes the whole situation in excruciating detail; and the fourth is a more practical overview that will pretty much spoon-feed you an answer, and also make sure you understand what's going on.
You really do need to read and understand these sorts of overviews, however long they seem. There really isn't any getting around it. Text is hard. There is no such thing as "plain text", there hasn't been a reasonable facsimile for years, and there never really was, although we spent decades pretending there was. But Unicode is at least a standard.
You also should read http://www.joelonsoftware.com/articles/Unicode.html .
This error occurs when you pass a Unicode string containing non-English characters (Unicode characters beyond 128) to something that expects an ASCII bytestring. The default encoding for a Python bytestring is ASCII, "which handles exactly 128 (English) characters". This is why trying to convert Unicode characters beyond 128 produces the error.
The unicode()
unicode(string[, encoding, errors])
constructor has the signature unicode(string[, encoding, errors]). All of its arguments should be 8-bit strings.
The first argument is converted to Unicode using the specified encoding; if you leave off the encoding argument, the ASCII encoding is used for the conversion, so characters greater than 127 will be treated as errors
for example
s = u'La Pe\xf1a'
print s.encode('latin-1')
or
write(s.encode('latin-1'))
will encode using latin-1
The answer to your question is "use codecs". The appeded code also shows some gettext magic, FWIW. http://wiki.wxpython.org/Internationalization
import codecs
import gettext
localedir = './locale'
langid = wx.LANGUAGE_DEFAULT # use OS default; or use LANGUAGE_JAPANESE, etc.
domain = "MyApp"
mylocale = wx.Locale(langid)
mylocale.AddCatalogLookupPathPrefix(localedir)
mylocale.AddCatalog(domain)
translater = gettext.translation(domain, localedir,
[mylocale.GetCanonicalName()], fallback = True)
translater.install(unicode = True)
# translater.install() installs the gettext _() translater function into our namespace...
msg = _("A message that gettext will translate, probably putting Unicode in here")
# use codecs.open() to convert Unicode strings to UTF8
Logfile = codecs.open(logfile_name, 'w', encoding='utf-8')
Logfile.write(msg + '\n')
Despite Google being full of hits on this problem, I found it rather hard to find this simple solution (it is actually in the Python docs about Unicode, but rather burried).
So ... HTH...
GaJ