UnicodeDecodeError in textblob tutorial - python

I'm trying to run through the TextBlob tutorial in Windows (using Git Bash shell) with Python 3.3.
I've installed textblob and nltk as well as any dependencies.
The Python code is:
from text.blob import TextBlob
wiki = TextBlob("Python is a high-level, general-purpose programming language.")
tags = wiki.tags
I'm getting the following error
Traceback (most recent call last):
File "textblob.py", line 4, in <module>
tags = wiki.tags
File "c:\Python33\lib\site-packages\text\decorators.py", line 18, in __get__
value = obj.__dict__[self.func.__name__] = self.func(obj)
File "c:\Python33\lib\site-packages\text\blob.py", line 357, in pos_tags
for word, t in self.pos_tagger.tag(self.raw)
File "c:\Python33\lib\site-packages\text\taggers.py", line 40, in tag
return pattern_tag(sentence, tokenize)
File "c:\Python33\lib\site-packages\text\en.py", line 115, in tag
for sentence in parse(s, tokenize, True, False, False, False, encoding).split():
File "c:\Python33\lib\site-packages\text\en.py", line 99, in parse
return parser.parse(unicode(s), *args, **kwargs)
File "c:\Python33\lib\site-packages\text\text.py", line 1213, in parse
s[i] = self.find_tags(s[i], **kwargs)
File "c:\Python33\lib\site-packages\text\en.py", line 49, in find_tags
return _Parser.find_tags(self, tokens, **kwargs)
File "c:\Python33\lib\site-packages\text\text.py", line 1161, in find_tags
map = kwargs.get( "map", None))
File "c:\Python33\lib\site-packages\text\text.py", line 967, in find_tags
tagged.append([token, lexicon.get(token, i==0 and lexicon.get(token.lower()) or None)])
File "c:\Python33\lib\site-packages\text\text.py", line 98, in get
return self._lazy("get", *args)
File "c:\Python33\lib\site-packages\text\text.py", line 79, in _lazy
self.load()
File "c:\Python33\lib\site-packages\text\text.py", line 367, in load
dict.update(self, (x.split(" ")[:2] for x in _read(self._path) if x.strip()))
File "c:\Python33\lib\site-packages\text\text.py", line 367, in <genexpr>
dict.update(self, (x.split(" ")[:2] for x in _read(self._path) if x.strip()))
File "c:\Python33\lib\site-packages\text\text.py", line 346, in _read
for line in f:
File "c:\Python33\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 16: character maps to <undefined>
Any idea what is wrong here? Adding a 'u' before the string didn't help.

Release 0.7.1 fixes this issue, which means it's time for a
$ pip install -U textblob
The problem was that the en-lexicon.txt file used for part-of-speech tagging opened the file using Windows' default platform encoding, cp1252. The file apparently had characters that Python could not decode from this encoding. This was fixed by explicitly opening the file in utf-8 mode.

Related

UnicodeDecodeError When I use cuda to train dataset

I used chainer to train some images but there is an error.
I don't know whether its UnicodeDecodeError or the error of installation of cupy.
P:\dcgans\chainer-DCGAN\chainer-DCGAN>python DCGAN.py
Traceback (most recent call last):
File "DCGAN.py", line 279, in <module>
train_dcgan_labeled(gen, dis)
File "DCGAN.py", line 171, in train_dcgan_labeled
zvis = (xp.random.uniform(-1, 1, (100, nz), dtype=np.float32))
File "P:\Python35\lib\site-packages\cupy\random\distributions.py", line 132, in uniform
return rs.uniform(low, high, size=size, dtype=dtype)
File "P:\Python35\lib\site-packages\cupy\random\generator.py", line 235, in uniform
rand = self.random_sample(size=size, dtype=dtype)
File "P:\Python35\lib\site-packages\cupy\random\generator.py", line 153, in random_sample
RandomState._1m_kernel(out)
File "cupy/core/elementwise.pxi", line 552, in cupy.core.core.ElementwiseKernel.__call__ (cupy\core\core.cpp:43810)
File "cupy/util.pyx", line 39, in cupy.util.memoize.decorator.ret (cupy\util.cpp:1480)
File "cupy/core/elementwise.pxi", line 409, in cupy.core.core._get_elementwise_kernel (cupy\core\core.cpp:42156)
File "cupy/core/elementwise.pxi", line 12, in cupy.core.core._get_simple_elementwise_kernel (cupy\core\core.cpp:34787)
File "cupy/core/elementwise.pxi", line 32, in cupy.core.core._get_simple_elementwise_kernel (cupy\core\core.cpp:34609)
File "cupy/core/carray.pxi", line 87, in cupy.core.core.compile_with_cache (cupy\core\core.cpp:34264)
File "P:\Python35\lib\site-packages\cupy\cuda\compiler.py", line 133, in compile_with_cache
base = _empty_file_preprocess_cache[env] = preprocess('', options)
File "P:\Python35\lib\site-packages\cupy\cuda\compiler.py", line 99, in preprocess
pp_src = pp_src.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 27-28: invalid continuation byte
It seems nvcc generated non-UTF8 output and CuPy failed to decode it.
This is a bug of CuPy (I posted an issue: #378).
A possible solution for the time being is to replace 'utf-8' in cupy/cuda/compiler.py at the line pp_src = pp_src.decode('utf-8') with something that match your environment. For example, in Japanese environment, 'cp932' should work, and 'cp936' should perhaps work for simplified Chinese.
You could also try locale.getdefaultlocale()[1] as a universal solution (be sure to import locale).
Update: The fix has been merged. It should be fixed in upcoming CuPy v1.0.3.

UnicodeDecodeError: 'utf8' codec can't decode byte 0xa3 in position 6898: invalid start byte-Reading file using Argument Parser in Python

I am implementing code from this link Glove implementation
I am reading a file from the specified path using ArgumentParser in Python.
parser.add_argument('corpus', metavar='corpus_path',
type=partial(codecs.open, encoding='utf-8'))
I am using this command in command prompt to pass arguments
python Glove_python_bbc.py "C:/Users/JAYASHREE/Documents/NLP/text-corpus.txt" --vocab-path C:/Users/JAYASHREE/Documents/NLP/vocabulary --cooccur-path C:/Users/JAYASHREE/Documents/NLP/cooccur_matrix -w 10 --min-count 10 --vector-path C:/Users/JAYASHREE/Documents/NLP/word-vector -s 40 --iterations 10 --learning-rate 0.1 --save-often
But I am getting the following error
2017-08-06 23:03:46,171 Fetching vocab..
2017-08-06 23:03:46,171 Building vocab from corpus
Traceback (most recent call last):
File "Glove_python_bbc.py", line 383, in <module>
main(parse_args())
File "Glove_python_bbc.py", line 352, in main
vocab = get_or_build(arguments.vocab_path, build_vocab, corpus)
File "Glove_python_bbc.py", line 93, in get_or_build
obj = build_fn(*args, **kwargs)
File "Glove_python_bbc.py", line 112, in build_vocab
for line in corpus:
File "C:\Users\JAYASHREE\Anaconda2\lib\codecs.py", line 699, in next
return self.reader.next()
File "C:\Users\JAYASHREE\Anaconda2\lib\codecs.py", line 630, in next
line = self.readline()
File "C:\Users\JAYASHREE\Anaconda2\lib\codecs.py", line 545, in readline
data = self.read(readsize, firstline=True)
File "C:\Users\JAYASHREE\Anaconda2\lib\codecs.py", line 492, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa3 in position 6898: invalid start byte
The screenshot of the file which I am trying to read

spyder - python IDE uft8 encoding loading issue

I am unable to get spyder to load. I am running windows 7, 64 bit.
I have installed Anaconda-2.3.0 64 bit and have attempted to run the bundled spyder.
I have also tried the latest separate spyder build and encountered the exact same issue.
When running spyder via cmd I recieve the following error message:
"
Traceback (most recent call last):
File "C:\Anaconda\scripts\spyder", line 2, in <module>
from spyderlib import start_app
File "C:\Anaconda\lib\site-packages\spyderlib\start_app.py", line 13, in <modu
le>
from spyderlib.config import CONF
File "C:\Anaconda\lib\site-packages\spyderlib\config.py", line 718, in <module
>
subfolder=SUBFOLDER, backup=True, raw_mode=True)
File "C:\Anaconda\lib\site-packages\spyderlib\userconfig.py", line 215, in __i
nit__
self.load_from_ini()
File "C:\Anaconda\lib\site-packages\spyderlib\userconfig.py", line 260, in loa
d_from_ini
self.readfp(configfile)
File "C:\Anaconda\lib\ConfigParser.py", line 324, in readfp
self._read(fp, filename)
File "C:\Anaconda\lib\ConfigParser.py", line 479, in _read
line = fp.readline()
File "C:\Anaconda\lib\codecs.py", line 678, in readline
return self.reader.readline(size)
File "C:\Anaconda\lib\codecs.py", line 533, in readline
data = self.read(readsize, firstline=True)
File "C:\Anaconda\lib\codecs.py", line 480, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe8 in position 59: invalid
continuation byte
"
Is there some way to fix this?
Thanks
Never mind. Figured it out.
uninstall/reinstall anaconda.
delete Libraries\Documents.spyder2 folder
Run spyder. Wait a long time. spyder loads.
yay.

Error during compiling with PyInstaller under Windows XP (not English)

I have next app.spec file - Run Python binaries under Windows XP and try to compile with PyInstaller under Windows XP (Russian localization)
I got follow error:
1763 INFO: Processing hook hook-email.message
Traceback (most recent call last):
File "C:\Anaconda\Scripts\pyinstaller-script.py", line 9, in <module>
load_entry_point('PyInstaller==2.1', 'console_scripts', 'pyinstaller')()
File "C:\Anaconda\lib\site-packages\PyInstaller\main.py", line 88, in run
run_build(opts, spec_file, pyi_config)
File "C:\Anaconda\lib\site-packages\PyInstaller\main.py", line 46, in run_buil
d
PyInstaller.build.main(pyi_config, spec_file, **opts.__dict__)
File "C:\Anaconda\lib\site-packages\PyInstaller\build.py", line 1911, in main
config = configure.get_config(kw.get('upx_dir'))
File "C:\Anaconda\lib\site-packages\PyInstaller\configure.py", line 146, in ge
t_config
find_PYZ_dependencies(config)
File "C:\Anaconda\lib\site-packages\PyInstaller\configure.py", line 116, in fi
nd_PYZ_dependencies
a.analyze_r('pyi_importers')
File "C:\Anaconda\lib\site-packages\PyInstaller\depend\imptracker.py", line 16
6, in analyze_r
newnms = self.analyze_one(name, nm, imptyp, level)
File "C:\Anaconda\lib\site-packages\PyInstaller\depend\imptracker.py", line 22
7, in analyze_one
mod = self.doimport(nm, ctx, fqname)
File "C:\Anaconda\lib\site-packages\PyInstaller\depend\imptracker.py", line 29
9, in doimport
mod = parent.doimport(nm)
File "C:\Anaconda\lib\site-packages\PyInstaller\depend\modules.py", line 130,
in doimport
mod = self.subimporter.getmod(nm)
File "C:\Anaconda\lib\site-packages\PyInstaller\depend\impdirector.py", line 1
39, in getmod
mod = owner.getmod(nm)
File "C:\Anaconda\lib\site-packages\PyInstaller\depend\owner.py", line 127, in
getmod
mod = self._modclass()(nm, pth, co)
File "C:\Anaconda\lib\site-packages\PyInstaller\depend\modules.py", line 78, i
n __init__
self.scancode()
File "C:\Anaconda\lib\site-packages\PyInstaller\depend\modules.py", line 99, i
n scancode
self.binaries = _resolveCtypesImports(self.binaries)
File "C:\Anaconda\lib\site-packages\PyInstaller\depend\utils.py", line 328, in
_resolveCtypesImports
cpath = find_library(os.path.splitext(cbin)[0])
File "C:\Anaconda\lib\ctypes\util.py", line 54, in find_library
fname = os.path.join(directory, name)
File "C:\Anaconda\lib\ntpath.py", line 84, in join
result_path = result_path + p_path
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 34: ordinal
not in range(128)
What should I patch? This error reproducible also under Windows 7 with Russian localization
PS. It's not the same issue, but maybe can be some advise - Run Python binaries under Windows XP
The issue was resolved by avoiding using any print statements in the application and changing print to log methods

same python source code on two different machines yield different behavior

Two machines, both running Ubuntu 14.04.1. Same source code run on the same data. One works fine, one throws codec decode 0xe2 error. Why is this? (More importantly, how do I fix it?)
Offending code appears to be:
def tokenize(self):
"""Tokenizes text using NLTK's tokenizer, starting with sentence tokenizing"""
tokenized=''
for sentence in sent_tokenize(self):
tokenized += ' '.join(word_tokenize(sentence)) + '\n'
return Text(tokenized)
OK... I went into interactive mode and imported sent_tokenize from nltk.tokenize on both machines. The one that works was happy with the following:
>>> fh = open('in/train/legal/legal1a_lm_7.txt')
>>> foo = fh.read()
>>> fh.close()
>>> sent_tokenize(foo)
The UnicodeDecodeError on the machine with issues gives the following traceback:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 82, in sent_tokenize
return tokenizer.tokenize(text)
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1270, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1318, in sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1309, in span_tokenize
return [(sl.start, sl.stop) for sl in slices]
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1348, in _realign_boundaries
for sl1, sl2 in _pair_iter(slices):
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 355, in _pair_iter
for el in it:
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1324, in _slices_from_text
if self.text_contains_sentbreak(context):
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1369, in text_contains_sentbreak
for t in self._annotate_tokens(self._tokenize_words(text)):
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1504, in _annotate_second_pass
for t1, t2 in _pair_iter(tokens):
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 354, in _pair_iter
prev = next(it)
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 621, in _annotate_first_pass
for aug_tok in tokens:
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 586, in _tokenize_words
for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 6: ordinal not in range(128)
Breaking the input file down line by line (via split('\n')), and running each one through sent_tokenize leads us to the offending line:
If you have purchased these Services directly from Cisco Systems, Inc. (“Cisco”), this document is incorporated into your Master Services Agreement or equivalent services agreement (“MSA”) executed between you and Cisco.
Which is actually:
>>> bar[5]
'If you have purchased these Services directly from Cisco Systems, Inc. (\xe2\x80\x9cCisco\xe2\x80\x9d), this document is incorporated into your Master Services Agreement or equivalent services agreement (\xe2\x80\x9cMSA\xe2\x80\x9d) executed between you and Cisco.'
Update: both machines show UnicodeDecodeError for:
unicode(bar[5])
But only one machine shows an error for:
sent_tokenize(bar[5])
Different NLTK versions!
The version that doesn't barf is using NLTK 2.0.4; the version throwing an exception is 3.0.0.
NLTK 2.0.4 was perfectly happy with
sent_tokenize('(\xe2\x80\x9cCisco\xe2\x80\x9d)')
NLTK 3.0.0 needs unicode (as pointed out by #tdelaney in the comments above). So to get results, you need:
sent_tokenize(u'(\u201cCisco\u201d)')

Categories