Python os.walk() umlauts u'\u0308'

Python os.walk() umlauts u'\u0308' - python

I'm on a OSX machine and running Python 2.7. I'm trying to do a os.walk on a smb share.
for root, dirnames, filenames in os.walk("./test"):
for filename in filenames:
print filename
matchObj = re.match( r".*ö.*",filename,re.UNICODE)
if i use the above code it works as long as the filename do not contain umlauts.
In my shell the umlauts are printed fine but when I copy them back to a utf8 formated Textdeditor (in my case Sublime), I get:
screenshot
Expected:
filename.jpeg
filename_ö.jpg
Of course the regex fails with that.
if i hardcode the filename like:
re.match( r".*ö.*",'filename_ö',re.UNICODE)
it works fine.
I tried:
os.walk(u"./test")
filename.decode('utf8')
but gives me:
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0308' in position 10: ordinal not in range(128)
u'\u0308' are the dots above the umlauts.
I'm overlooking something stupid i guess?

Unicode characters can be represented in various forms; there's "ö", but then there's also the possibility to represent that same character using an "o" and separate combining diacritics. OS X generally prefers the separated variant, and your editor doesn't seem to handle that very gracefully, nor do these two separate characters match your regex.
You need to normalize your Unicode data if you require one way or the other in particular. See unicodedata.normalize. You want the NFC normalized form.

There are several issues:
The screenshot as #deceze explained is due to Unicode normalization. Note: it is not necessary for the codepoints to look different e.g., ö (U+00f6) and ö (U+006f U+0308) look the same in my browser
r".*ö.*" is a bytestring in Python 2 and the value depends on the encoding declaration at the top of your Python source file (something like: # -*- coding: utf-8 -*-) e.g., if the declared encoding is utf-8 then 'ö' bytestring is a sequence of two bytes: '\xc3\xb6'.
There is no way for the regex engine to know the actual encoding that should be used to interpret input bytestrings.
You should not use bytestrings, to represent text; use Unicode instead (either use u'' literals or add from __future__ import unicode_literals at the top)
filename.decode('utf8') raises UnicodeEncodeError if you use os.walk(u"./test") because filename is Unicode already. Python 2 tries to encode filename implicitly using the default encoding that is 'ascii'. Do not decode Unicode: drop .decode('utf-8')
btw, the last two issues are impossible in Python 3: r".*ö.*" is a Unicode literal, and you can't create a bytestring with literal non-ascii characters there, and there is no .decode() method (you would get AttributeError if you try to decode Unicode). You could run your script on Python 3, to detect Unicode-related bugs.

Related

encoding issue. Replace special character

I have a dictionary that looks like this:
{ u'Samstag & Sonntag': u'Ganztags ge\xf6ffnet', u'Freitag': u'18:00 & 22:00'}
Now I'm trying to replace the \xf6 with ö ,
but trying .replace('\xf6', 'ö') returns an error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position
0: ordinal not in range(128)
How can I fix this?

Now encoding is a mine field, and I might be off on this one - please correct me if that's the case.
From what I've gathered over the years is that Python2 assumes ASCII unless you defined a encoding at the top of your script. Mainly because either it's compiled that way or the OS/Terminal uses ASCII as it's primary encoding.
With that said, what you see in your example data:
{ u'Samstag & Sonntag': u'Ganztags ge\xf6ffnet', u'Freitag': u'18:00 & 22:00'}
Is the ASCII representation of a unicode string. Some how Python needs to tell you there's an ö in there - but it can't with ASCII because ö has no representation in the ASCII table.
But when you try to replace it using:
x.replace('\xf6', 'ö')
You're trying to find a ASCII character/string called \xf6 that is outside of the accepted bytes ranges of ASCII, so that will raise an exception. And you're trying to replace it with another invalid ASCII character and that will cause the same exception.
Hence why you get the "'ascii' codec can't decode byte...' message.
You can do unicode replacements like this:
a = u'Ganztags ge\xf6ffnet'
a.replace(u'\xf6', u'ö')
This will tell Python to find a unicode string, and replace it with another unicode string.
But the output data will result in the same thing in the example above, because \xf6 is ö in unicode.
What you want to do, is encode your string into something you want to use, for instance - UTF-8:
a.encode('UTF-8')
'Ganztags ge\xc3\xb6ffnet'
And define UTF-8 as your primary encoding by placing this at the top of your code:
#!/usr/bin/python
# -*- coding: UTF-8
This should in theory make your application a little easier to work with.
And you can from then on work with UTF-8 as your base model.
But there's no way that I know of, to convert your representation into a ASCII ö, because there really isn't such a thing. There's just different ways Python will do this encoding magic for you to make you believe it's possible to "just write ö".
In Python3 most of the strings you encounter will either be bytes data or treated a bit differently from Python2. And for the most part it's a lot easier.
There's numerous ways to change the encoding that is not part of the standard praxis. But there are ways to do it.
The closest to "good" praxis, would be the locale:
locale.setlocale(locale.LC_ALL, 'sv_SE.UTF-8')
I also had a horrendous solution and approach to this years back, it looked something like this (it was a great bodge for me at the time):
Python - Encoding string - Swedish Letters
tl;dr:
Your code usually assume/use ASCII as it's encoder/decoder.
ö is not a part of ASCII, there for you'll always see \xf6 if you've some how gotten unicode characters. Normally, if you print u'Ganztags ge\xf6ffnet' it will be shown as a Ö because of automatic encoding, if you need to verify if input matches that string, you have to compare them u'ö' == u'ö', if other systems depend on this data, encode it with something they understand .encode('UTF-8'). But replacing \xf6 with ö is the same thing, just that ö doesn't exist in ASCII and you need to do u'ö' - which, will result in the same data at the end.

As you are using German language, you should be aware of non ascii characters. You know whether your system prefers Latin1 (Windows console and some Unixes), UTF8 (most Linux variants), or native unicode (Windows GUI).
If you can process everything as native unicode things are cleaner and you should just accept the fact that u'ö' and u'\xf6' are the same character - the latter is simply independant of the python source file charset.
If you have to output byte strings of store them in files, you should encode them in UTF8 (can process any unicode character but characters of code above 127 use more than 1 byte) or Latin1 (one byte per character, but only supports unicode code point below 256)
In that case just use an explicit encoding to convert your unicode strings to byte strings:
print u'Ganztags ge\xf6ffnet'.encode('Latin1') # or .encode('utf8')
should give what you expect.

Is u'string' the same as 'string'.decode('XXX')

Although the title is a question, the short answer is apparently no. I've tried in the shell. The real question is why?
ps: string is some non-ascii characters like Chinese and XXX is the current encoding of string
>>> u'中文' == '中文'.decode('gbk')
False
//The first one is u'\xd6\xd0\xce\xc4' while the second one u'\u4e2d\u6587'
The example is above. I am using windows chinese simplyfied. The default encoding is gbk, so is the python shell. And I got the two unicode object unequal.
UPDATES
a = '中文'.decode('gbk')
>>> a
u'\u4e2d\u6587'
>>> print a
中文
>>> b = u'中文'
>>> print b
ÖÐÎÄ

Yes, str.decode() usually returns a unicode string, if the codec successfully can decode the bytes. But the values only represent the same text if the correct codec is used.
Your sample text is not using the right codec; you have text that is GBK encoded, decoded as Latin1:
>>> print u'\u4e2d\u6587'
中文
>>> u'\u4e2d\u6587'.encode('gbk')
'\xd6\xd0\xce\xc4'
>>> u'\u4e2d\u6587'.encode('gbk').decode('latin1')
u'\xd6\xd0\xce\xc4'
The values are indeed not equal, because they are not the same text.
Again, it is important that you use the right codec; a different codec will result in very different results:
>>> print u'\u4e2d\u6587'.encode('gbk').decode('latin1')
ÖÐÎÄ
I encoded the sample text to Latin-1, not GBK or UTF-8. Decoding may have succeeded, but the resulting text is not readable.
Note also that pasting non-ASCII characters only work because the Python interpreter has detected my terminal codec correctly. I can paste text from my browser into my terminal, which then passes the text to Python as UTF-8-encoded data. Because Python has asked the terminal what codec was used, it was able to decode back again from the u'....' Unicode literal value. When printing the encoded.decode('utf8') unicode result, Python once more auto-encodes the data to fit my terminal encoding.
To see what codec Python detected, print sys.stdin.encoding:
>>> import sys
>>> sys.stdin.encoding
'UTF-8'
Similar decisions have to be made when dealing with different sources of text. Reading string literals from the source file, for example, requires that you either use ASCII only (and use escape codes for everything else), or provide Python with an explicit codec notation at the top of the file.
I urge you to read:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
to gain a more complete understanding on how Unicode works, and how Python handles Unicode.

Assuming Python2.7 by the title.
The answer is no. No because when you issue string.decode(XXX) you'll get a Unicode depending on the codec you pass as argument.
When you use u'string' the codec is inferred by the shell's current encoding, or if it's a file, you'll get ascii as default or whatever # coding: utf-8 special comment you insert at the beginning of the script.
Just to clearify, if codec XXX is ensured to always be the same codec used for the script's input (either the shell or the file) then both approaches behave pretty much the same.
Hope this helps!

Reading Unicode Files - Python3.2

I'm trying to read some files using Python3.2, the some of the files may contain unicode while others do not.
When I try:
file = open(item_path + item, encoding="utf-8")
for line in file:
print (repr(line))
I get the error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 13-16: ordinal not in range(128)
I am following the documentation here: http://docs.python.org/release/3.0.1/howto/unicode.html
Why would Python be trying to encode to ascii at any point in this code?

The problem is that repr(line) in Python 3 returns also the Unicode string. It does not convert the above 128 characters to the ASCII escape sequences.
Use ascii(line) instead if you want to see the escape sequences.
Actually, the repr(line) is expected to return the string that if placed in a source code would produce the object with the same value. This way, the Python 3 behaviour is just fine as there is no need for ASCII escape sequences in the source files to express a string with more than ASCII characters. It is quite natural to use UTF-8 or some other Unicode encoding these day. The truth is that Python 2 produced the escape sequences for such characters.

What's your output encoding? If you remove the call to print(), does it start working?
I suspect you've got a non-UTF-8 locale, so Python is trying to encode repr(line) as ASCII as part of printing it.
To resolve the issue, you must either encode the string and print the byte array, or set your default encoding to something that can handle your strings (UTF-8 being the obvious choice).

UnicodeEncodeError when writing to a file

I am trying to write some strings to a file (the strings have been given to me by the HTML parser BeautifulSoup).
I can use "print" to display them, but when I use file.write() I get the following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 6: ordinal not in range(128)
How can I parse this?

If I type 'python unicode' into Google, I get about 14 million results; the first is the official doc which describes the whole situation in excruciating detail; and the fourth is a more practical overview that will pretty much spoon-feed you an answer, and also make sure you understand what's going on.
You really do need to read and understand these sorts of overviews, however long they seem. There really isn't any getting around it. Text is hard. There is no such thing as "plain text", there hasn't been a reasonable facsimile for years, and there never really was, although we spent decades pretending there was. But Unicode is at least a standard.
You also should read http://www.joelonsoftware.com/articles/Unicode.html .

This error occurs when you pass a Unicode string containing non-English characters (Unicode characters beyond 128) to something that expects an ASCII bytestring. The default encoding for a Python bytestring is ASCII, "which handles exactly 128 (English) characters". This is why trying to convert Unicode characters beyond 128 produces the error.
The unicode()
unicode(string[, encoding, errors])
constructor has the signature unicode(string[, encoding, errors]). All of its arguments should be 8-bit strings.
The first argument is converted to Unicode using the specified encoding; if you leave off the encoding argument, the ASCII encoding is used for the conversion, so characters greater than 127 will be treated as errors
for example
s = u'La Pe\xf1a'
print s.encode('latin-1')
or
write(s.encode('latin-1'))
will encode using latin-1

The answer to your question is "use codecs". The appeded code also shows some gettext magic, FWIW. http://wiki.wxpython.org/Internationalization
import codecs
import gettext
localedir = './locale'
langid = wx.LANGUAGE_DEFAULT # use OS default; or use LANGUAGE_JAPANESE, etc.
domain = "MyApp"
mylocale = wx.Locale(langid)
mylocale.AddCatalogLookupPathPrefix(localedir)
mylocale.AddCatalog(domain)
translater = gettext.translation(domain, localedir,
[mylocale.GetCanonicalName()], fallback = True)
translater.install(unicode = True)
# translater.install() installs the gettext _() translater function into our namespace...
msg = _("A message that gettext will translate, probably putting Unicode in here")
# use codecs.open() to convert Unicode strings to UTF8
Logfile = codecs.open(logfile_name, 'w', encoding='utf-8')
Logfile.write(msg + '\n')
Despite Google being full of hits on this problem, I found it rather hard to find this simple solution (it is actually in the Python docs about Unicode, but rather burried).
So ... HTH...
GaJ

Python Unicode CSV export (using Django)

I'm using a Django app to export a string to a CSV file. The string is a message that was submitted through a front end form. However, I've been getting this error when a unicode single quote is provided in the input.
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019'
in position 200: ordinal not in range(128)
I've been trying to convert the unicode to ascii using the code below, but still get a similar error.
UnicodeEncodeError: 'ascii' codec can't encode characters in
position 0-9: ordinal not in range(128)
I've sifted through dozens of websites and learned a lot about unicode, however, I'm still not able to convert this unicode to ascii. I don't care if the algorithm removes the unicode characters. The commented lines indicate some various options I've tried, but the error persists.
import csv
import unicodedata
...
#message = unicode( unicodedata.normalize(
# 'NFKD',contact.message).encode('ascii','ignore'))
#dmessage = (contact.message).encode('utf-8','ignore')
#dmessage = contact.message.decode("utf-8")
#dmessage = "%s" % dmessage
dmessage = contact.message
csv_writer.writerow([
dmessage,
])
Does anyone have any advice in removing unicode characters to I can export them to CSV? This seemingly easy problem has kept my head spinning. Any help is much appreciated.
Thanks,
Joe

You can't encode the Unicode character u'\u2019' (U+2019 Right Single Quotation Mark) into ASCII, because ASCII doesn't have that character in it. ASCII is only the basic Latin alphabet, digits and punctuation; you don't get any accented letters or ‘smart quotes’ like this character.
So you will have to choose another encoding. Now normally the sensible thing to do would be to export to UTF-8, which can hold any Unicode character. Unfortunately for you if your target users are using Office (and they probably are), they're not going to be able to read UTF-8-encoded characters in CSV. Instead Excel will read the files using the system default code page for that machine (also misleadingly known as the ‘ANSI’ code page), and end up with mojibake like â€™ instead of ’.
So that means you have to guess the user's system default code page if you want the characters to show up correctly. For Western users, that will be code page 1252. Users with non-Western Windows installs will see the wrong characters, but there's nothing you can do about that (other than organise a letter-writing campaign to Microsoft to just drop the stupid nonsense with ANSI already and use UTF-8 like everyone else).
Code page 1252 can contain U+2019 (’), but obviously there are many more characters it can't represent. To avoid getting UnicodeEncodeError for those characters you can use the ignore argument (or replace to replace them with question marks).
dmessage= contact.message.encode('cp1252', 'ignore')
alternatively, to give up and remove all non-ASCII characters, so that everyone gets an equally bad experience regardless of locale:
dmessage= contact.message.encode('ascii', 'ignore')

Encoding is a pain, but if you're working in django have you tried smart_unicode(str) from django.utils.encoding? I find that usually does the trick.
The only other option I've found is to use the built-in python encode() and decode() for strings, but you have to specify the encoding for those and honestly, it's a pain.

[caveat: I'm not a djangoist; django may have a better solution].
General non-django-specific answer:
If you have a smallish number of known non-ASCII characters and there are user-acceptable ASCII equivalents for them, you can set up a translation table and use the unicode.translate method:
smashcii = {
0x2019 : u"'",
# etc
#
smashed = input_string.translate(smashcii)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.