Python NLTK snowball stemmer UnicodeDecodeError in terminal but not Eclipse PyDev - python

I'm using the snowball stemmer to stem words in documents as shown in below code snippet.
stemmer = EnglishStemmer()
# Stem, lowercase, substitute all punctuations, remove stopwords.
attribute_names = [stemmer.stem(token.lower()) for token in wordpunct_tokenize(re.sub('[%s]' % re.escape(string.punctuation), '', doc)) if token.lower() not in stopwords.words('english')]
When I run this on documents using PyDev in Eclipse, I receive no errors. When I run it in the Terminal (Mac OSX) I receive below error. Can someone please help?
File "data_processing.py", line 171, in __filter__
attribute_names = [stemmer.stem(token.lower()) for token in wordpunct_tokenize(re.sub('[%s]' % re.escape(string.punctuation), '', doc)) if token.lower() not in stopwords.words('english')]
File "7.3/lib/python2.7/site-packages/nltk-2.0.4-py2.7.egg/nltk/stem/snowball.py", line 694, in stem
word = (word.replace(u"\u2019", u"\x27")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 7: ordinal not in range(128)

This works in PyDev because it configures Python itself to work in the encoding of the console (which is usually UTF-8).
You can reproduce the same error in PyDev if you go to the run configuration (run > run configurations) then on the 'common' tab say that you want the encoding to be ascii.
This happens because your word is a string and you're replacing with unicode chars.
I hope the code below sheds some light for you:
This is all considering ascii as the default encoding:
>>> 'íã'.replace(u"\u2019", u"\x27")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa1 in position 0: ordinal not in range(128)
But if you do it all in unicode, it works (you may need to encode it back afterwards to the encoding you expect if you expect to deal with strings and not unicode).
>>> u'íã'.replace(u"\u2019", u"\x27")
u'\xed\xe3'
So, you can make your string unicode before the replace
>>> 'íã'.decode('cp850').replace(u"\u2019", u"\x27")
u'\xed\xe3'
Or you can encode the replace chars
>>> 'íã'.replace(u"\u2019".encode('utf-8'), u"\x27".encode('utf-8'))
'\xa1\xc6'
Note however that you must know what's the actual encoding you're working on in any place (so, although I'm using cp850 or utf-8 in the examples, it may be different from the encodings you have to use)

As Fabio stated, this happens because Pydev changes Python's default encoding. One you know that, there are three possible solutions :
Test your code outside Pydev
Pydev will hide encoding issues from you, until you run your code outside of Eclipse. So instead of using Eclipse's "run" button, test your code from a shell.
I wouldn't recommend this, though : it means your development environment will be different from your running environment, which can only lead to mistakes being made.
Change Python's default encoding
You could change Python's environment to fit Pydev's. It is discussed in this question ( How to set the default encoding to UTF-8 in Python? ).
This answer will tell you how to do it, and this one will tell you why you shouldn't.
Long story short, don't.
Stop Pydev from changing Python's default encoding
If you're using Python 2, Python's default encoding should be ascii. So instead of making your environment fir Pydev's through a hack, you'd be better off forcing Pydev to "behave". How to do that is discussed here.

Related

Python opening files with utf-8 file names

In my code I used something like file = open(path +'/'+filename, 'wb') to write the file
but in my attempt to support non-ascii filenames, I encode it as such
naming = path+'/'+filename
file = open(naming.encode('utf-8', 'surrogateescape'), 'wb')
write binary data...
so the file is named something like directory/path/\xd8\xb9\xd8\xb1\xd8\xa8\xd9.txt
and it works, but the issue arises when I try to get that file again by crawling into the same directory using:
for file in path:
data = open(file.as_posix(), 'rb)
...
I keep getting this error 'ascii' codec can't encode characters in position..
I tried converting the string to bytes like data = open(bytes(file.as_posix(), encoding='utf-8'), 'rb') but I get 'utf-8' codec can't encode characters in position...'
I also tried file.as_posix().encode('utf-8', 'surrogateescape'), I found that both encode and print just fine but with open() I still get the error 'utf-8' codec can't encode characters in position...'
How can I open a file with a utf-8 filename?
I'm using Python 3.9 on ubuntu linux
Any help is greatly appreciated.
EDIT
I figured out why the issue happens when crawling to the directory after writing.
So, when I write the file and give it the raw string directory/path/\xd8\xb9\xd8\xb1\xd8\xa8\xd9.txt and encode the string to utf, it writes fine.
But when finding the file again by crawling into the directory the str(filepath) or filepath.as_posix() returns the string as directory/path/????????.txt so it gives me an error when I try to encode it to any codec.
Currently I'm investigating if the issue's related to my linux locale, it was set to POSIX, I changed it to C.UTF-8 but still no luck atm.
More context: this is a file system where the file is uploaded through a site, so I receive the filename string in utf-8 format
I don't understand why you feel you need to recode filepaths.
Linux (unix) filenames are just sequences of bytes (with a couple of prohibited byte values). There's no need to break astral characters in surrogate pairs; the UTF-8 sequence for an astral character is perfectly acceptable in a filename. But creating surrogate pairs is likely to get you into trouble, because there's no UTF-8 encoding for a surrogate. So if you actually manage to create something that looks like the UTF-8 encoding for a surrogate codepoint, you're likely to encounter a decoding error when you attempt to turn it back into a Unicode codepoint.
Anyway, there's no need to go to all that trouble. Before running this session, I created a directory called ´ñ´ with two empty files, 𝔐 and mañana. The first one is an astral character, U+1D510. As you can see, everything works fine, with no need for manual decoding.
>>> [*Path('ñ').iterdir()]
[PosixPath('ñ/𝔐'), PosixPath('ñ/mañana')]
>>> Path.mkdir('ñ2')
>>> for path in Path('ñ').iterdir():
... open(Path('ñ2', path.name), 'w').close()
...
>>> [*Path('ñ2').iterdir()]
[PosixPath('ñ2/𝔐'), PosixPath('ñ2/mañana')]
>>> [open(path).read() for path in Path('ñ2').iterdir()]
['', '']
Note:
In a comment, OP says that they had previously tried:
file = open('/upload/\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a.png', 'wb')
and received the error
UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-11: ordinal not in range(128)
Without more details, it's hard to know how to respond to that. It's possible that open will raise that error for a filesystem which doesn't allow non-ascii characters, but that wouldn't be normal on Linux.
However, it's worth noting that the string literal
'/upload/\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a.png'
is not the string you think it is. \x escapes in a Python string are Unicode codepoints (with a maximum value of 255), not individual UTF-8 byte values. The Python string literal, "\xd8\xb9" contains two characters, "O with stroke" (Ø) and "superscript 1" (¹); in other words, it is exactly the same as the string literal "\u00d8\u00b9".
To get the Arabic letter ain (ع), either just type it (if you have an Arabic keyboard setting and your source file encoding is UTF-8, which is the default), or use a Unicode escape for its codepoint U+0639: "\u0639".
If for some reason you insist on using explicit UTF-8 byte encoding, you can use a byte literal as the argument to open:
file = open(b'/upload/\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a.png', 'wb')
But that's not recommended.
So after being in a rabbit hole for the past few days, I figured the issue isn't with python itself but with the locale that my web framework was using. Debugging this, I saw that
import sys
print(sys.getfilesystemencoding())
returned 'ASCII', which was weird considering I had set the linux locale to C.UTF-8 but discovered that since I was running WSGI on Apache2, I had to add locale to my WSGI as such WSGIDaemonProcess my_app locale='C.UTF-8' in the Apache configuration file thanks to this post.

Using unicode character u201c

I'm a new to python and am having problems understand unicode. I'm using
Python 3.4.
I've spent an entire day trying to figure this out by reading about unicode including http://www.fileformat.info/info/unicode/char/201C/index.htm and
http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html
I need to refer to special quotes because they are used in the text I'm analyzing. I did test that the W7 command window can read and write the 2 special quote characters.
To make things simple, I wrote a one line script:
print ('“') # that's the special quote mark in between normal single quotes
and get this output:
Traceback (most recent call last):
File "C:\Users\David\Documents\Python34\Scripts\wordCount3.py", line 1, in <module>
print ('\u201c')
File "C:\Python34\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u201c' in position 0: character maps to <undefined>
So how do I write something to refer to these two characters u201C and u201D?
Is this the correct encoding choice in the file open statement?
with open(fileIn, mode='r', encoding='utf-8', errors='replace') as f:
The reason is that in 3.x Python You can't just mix unicode strings with byte strings. Probably, You've read the manuals dealing with Python 2.x where such things are possible as long as bytestring contains convertable chars.
print('\u201c', '\u201d')
works fine for me, so the only reason is that you're using wrong encoding for source file or terminal.
Also You may explicitly point python to codepage you're using, by throwing the next line ontop of your source:
# -*- coding: utf-8 -*-
Added: it seems that You're working on Windows machine, if so you could change Your console codepage to utf-8 by running
chcp 65001
before You fire up your python interpreter. That changes would be temporary, and if You want permanent, run the next .reg file:
Windows Registry Editor Version 5.00
[HKEY_CURRENT_USER\Console]
"CodePage"=dword:fde9

UnicodeEncodeError in Python on Windows Console

I'm having the following error while recursing the files in a directory and printing file names in the console:
Traceback (most recent call last):
File "C:\Program Files\Python33\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position
53: character maps to <undefined>
According to the error, one of the characters in the file name string is \u2013 which is an EN DASH – character different from the commonly seen - minus character.
I have checked my Windows encoding which is set to 437. Now, I see that I have two options to workaround this by either changing the encoding of Windows console or convert the characters in get from the file names to suit the console encoding. How would I go do that in Python 3.3?
Windows console is using cp437 encoding and there is a character \u2013 that isn't supported by that encoding. Try adding this to your code:
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,'cp437','backslashreplace')
or convert the characters in get from the file names to suit the console encoding
Probably the console encoding is already correct (can't tell from the error message though). Code page 437 simply doesn't include that character so you won't be able to print it.
You can reopen stdout with a text encoder that has a fallback encoding, as demonstrated in iamsudip's answer which uses backslashreplace, to at least get readable (if not reliably recoverable) output instead of an error.
changing the encoding of Windows console
You can do this by executing the console command chcp 1252 before running Python, but that will still only give you a different limited repertoire of printable characters - including U+2013, but not many other Unicode characters.
In theory you can chcp to 65001 to get UTF-8 which would allow you to print any character. Unfortunately there are serious bugs in the C runtime's standard IO implementation, which usually make this unusable in practice.
This sorry state of affairs affects all applications that use the MS C runtime's stdio library calls, including Python and most other languages, with the result that Unicode on the Windows console just doesn't work in most cases.
If you really have to get Unicode out to the Windows console you can use the Win32 WriteConsoleW API directly using ctypes, but it's not much fun.

Can't print character '\u2019' in Python from JSON object

As a project to help me learn Python, I'm making a CMD viewer of Reddit using the json data (for example www.reddit.com/all/.json). When certain posts show up and I attempt to print them (that's what I assume is causing the error), I get this error:
Traceback (most recent call last):
File "C:\Users\nsaba\Desktop\reddit_viewer.py", line 33, in
print ( "%d. (%d) %s\n" % (i+1, obj['data']['score'], obj['data']['title']))
File "C:\Python33\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2019' in position
32: character maps to
Here is where I handle the data:
request = urllib.request.urlopen(url)
content = request.read().decode('utf-8')
jstuff = json.loads(content)
The line I use to print the data as listed in the error above:
print ( "%d. (%d) %s\n" % (i+1, obj['data']['score'], obj['data']['title']))
Can anyone suggest where I might be going wrong?
It's almost certain that you problem has nothing to do with the code you've shown, and can be reproduced in one line:
print(u'\2019')
If your terminal's character set can't handle U+2019 (or if Python is confused about what character set your terminal uses), there's no way to print it out. It doesn't matter whether it comes from JSON or anywhere else.
The Windows terminal (aka "DOS prompt" or "cmd window") is usually configured for a character set like cp1252 that only knows about 256 of the 110000 characters, and there's nothing Python can do about this without a major change to the language implementation.*
See PrintFails on the Python Wiki for details, workarounds, and links to more information. There are also a few hundred dups of this problem on SO (although many of them will be specific to Python 2.x, without mentioning it).
* Windows has a whole separate set of APIs for printing UTF-16 to the terminal, so Python could detect that stdout is a Windows terminal, and if so encode to UTF-16 and use the special APIs instead of encoding to the terminal's charset and using the standard ones. But this raises a bunch of different problems (e.g., different ways of printing to stdout getting out of sync). There's been discussion about making these changes, but even if everyone were to agree and the patch were written tomorrow, it still wouldn't help you until you upgrade to whatever future version of Python it's added to…
#N-Saba, what is the string that causes the error to be thrown?
In my test case, this looks to be a version-specific bug in python 2.7.3.
In the feed I was parsing, the "title" field had the following value:
u'title': u'Intel\u2019s Sharp-Eyed Social Scientist'
I get the expected right single quote char when I call either of these, in python 2.7.6.
python -c "print {u'title': u'Intel\u2019s Sharp-Eyed Social Scientist'}['title']"
Intel’s Sharp-Eyed Social Scientist
In 2.7.3, I get the error, unless I encode the value that I pulled by KeyName.
print {u'title': u'Intel\u2019s Sharp-Eyed Social Scientist'}['title']
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 5: ordinal not in range(128)
print {u'title': u'Intel\u2019s Sharp-Eyed Social Scientist'}['title'].encode('utf-8', 'replace')
Intel’s Sharp-Eyed Social Scientist
fwiw, the #abamert command print('\u2019') prints "9". I think the intended code was print(u'\u2019').
I came across a similar error when attempting to write an API JSON output to a .cav file via pd.DataFrame.to_csv() on a Win install of Python 2.7.14.
Specifying the encoding as utf-8 fixed my process:
pd.DataFrame.to_csv(filename, encoding='utf-8')
For anyone encountering this in macOS, #abarnert's answer is correct and I was able to fix it by putting this at the top of the offending source file:-
# magic to make everything work in Unicode
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
To clarify, this is making sure the terminal output accepts Unicode correctly.
I set IDLE (Python Shell) and Window's CMD default font to Lucida Console (a utf-8 supported font) and these types of errors went away; and you no longer see boxes [][][][][][][][]
:)

How can I display native accents to languages in console in windows?

print "Español\nPortuguês\nItaliano".encode('utf-8')
Errors:
Traceback (most recent call last):
File "", line 1, in
print "Español\nPortuguês\nItaliano".encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf1 in position 4: ordinal not in range(128)
I'm trying to make a multilingual console program in Windows. Is this possible?
I've saved the file in utf-8 encoding as well, I get the same error.
*EDIT
I"m just outputting text in this program. I change to lucida fonts, I keep getting this:
alt text http://img826.imageshack.us/img826/7312/foreignlangwindowsconso.png
I'm just looking for a portable way to correctly display foreign languages in the console in windows. If it can do it cross platform, even better. I thought utf-8 was the answer, but all of you are telling me fonts, etc.. also plays a part. So anyone have a definitive answer?
Short answer:
# -*- coding: utf-8 -*-
print u"Español\nPortuguês\nItaliano".encode('utf-8')
The first line tells Python that your file is encoded in UTF-8 (your editor must use the same settings) and this line should always be on the beginning of your file.
Another thing is that Python 2 knows two different basestring objects - str and unicode. The u prefix will create such a unicode object instead of the default str object, which you can then encode as UTF-8 (but printing unicode objects directly should also work).
First of all, in Python 2.x you can't encode a str that has non-ASCII characters. You have to write
print u"Español\nPortuguês\nItaliano".encode('utf-8')
Using UTF-8 at the Windows console is difficult.
You have to set the Command Prompt font to a Unicode font (of which the only one available by default is Lucida Console), or else you get IBM437 encoding anyway.
chcp 65001
Modify encodings._aliases to treat "cp65001" as an alias of UTF-8.
And even then, it doesn't seem to work right.
This works for me:
# coding=utf-8
print "Español\nPortuguês\nItaliano"
You might want to try running it using chcp 65001 && your_program.py As well, try changing the command prompt font to Lucida Console.

Categories