Two machines, both running Ubuntu 14.04.1. Same source code run on the same data. One works fine, one throws codec decode 0xe2 error. Why is this? (More importantly, how do I fix it?)
Offending code appears to be:
def tokenize(self):
"""Tokenizes text using NLTK's tokenizer, starting with sentence tokenizing"""
tokenized=''
for sentence in sent_tokenize(self):
tokenized += ' '.join(word_tokenize(sentence)) + '\n'
return Text(tokenized)
OK... I went into interactive mode and imported sent_tokenize from nltk.tokenize on both machines. The one that works was happy with the following:
>>> fh = open('in/train/legal/legal1a_lm_7.txt')
>>> foo = fh.read()
>>> fh.close()
>>> sent_tokenize(foo)
The UnicodeDecodeError on the machine with issues gives the following traceback:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 82, in sent_tokenize
return tokenizer.tokenize(text)
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1270, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1318, in sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1309, in span_tokenize
return [(sl.start, sl.stop) for sl in slices]
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1348, in _realign_boundaries
for sl1, sl2 in _pair_iter(slices):
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 355, in _pair_iter
for el in it:
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1324, in _slices_from_text
if self.text_contains_sentbreak(context):
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1369, in text_contains_sentbreak
for t in self._annotate_tokens(self._tokenize_words(text)):
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1504, in _annotate_second_pass
for t1, t2 in _pair_iter(tokens):
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 354, in _pair_iter
prev = next(it)
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 621, in _annotate_first_pass
for aug_tok in tokens:
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 586, in _tokenize_words
for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 6: ordinal not in range(128)
Breaking the input file down line by line (via split('\n')), and running each one through sent_tokenize leads us to the offending line:
If you have purchased these Services directly from Cisco Systems, Inc. (“Cisco”), this document is incorporated into your Master Services Agreement or equivalent services agreement (“MSA”) executed between you and Cisco.
Which is actually:
>>> bar[5]
'If you have purchased these Services directly from Cisco Systems, Inc. (\xe2\x80\x9cCisco\xe2\x80\x9d), this document is incorporated into your Master Services Agreement or equivalent services agreement (\xe2\x80\x9cMSA\xe2\x80\x9d) executed between you and Cisco.'
Update: both machines show UnicodeDecodeError for:
unicode(bar[5])
But only one machine shows an error for:
sent_tokenize(bar[5])
Different NLTK versions!
The version that doesn't barf is using NLTK 2.0.4; the version throwing an exception is 3.0.0.
NLTK 2.0.4 was perfectly happy with
sent_tokenize('(\xe2\x80\x9cCisco\xe2\x80\x9d)')
NLTK 3.0.0 needs unicode (as pointed out by #tdelaney in the comments above). So to get results, you need:
sent_tokenize(u'(\u201cCisco\u201d)')
Related
I am using python 3.7 64 bit. nltk version 3.4.5.
When I try to convert text6 in nltk.book to tokens using word_tokenize, I am getting error.
import nltk
from nltk.tokenize import word_tokenize
from nltk.book import *
tokens=word_tokenize(text6)
code is done in idle 3.7
Below is the error when I execute the last statement.
Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
tokens=word_tokenize(text6)
File "C:\Users\admin\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk\tokenize\__init__.py", line 144, in word_tokenize
sentences = [text] if preserve_line else sent_tokenize(text, language)
File "C:\Users\admin\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk\tokenize\__init__.py", line 106, in sent_tokenize
return tokenizer.tokenize(text)
File "C:\Users\admin\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk\tokenize\punkt.py", line 1277, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "C:\Users\admin\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk\tokenize\punkt.py", line 1331, in sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "C:\Users\admin\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk\tokenize\punkt.py", line 1331, in <listcomp>
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "C:\Users\admin\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk\tokenize\punkt.py", line 1321, in span_tokenize
for sl in slices:
File "C:\Users\admin\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk\tokenize\punkt.py", line 1362, in _realign_boundaries
for sl1, sl2 in _pair_iter(slices):
File "C:\Users\admin\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk\tokenize\punkt.py", line 318, in _pair_iter
prev = next(it)
File "C:\Users\admin\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk\tokenize\punkt.py", line 1335, in _slices_from_text
for match in self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or bytes-like object
Please help. Thanks in advance.
While doing some troubleshooting I have created a sample nltk.text.Text object and tried to tokenize it with nltk.word_tokenize. Still I am getting the same error. Please see the below screenshot.
But while calling the nltk.word_tokenize() on string, its working.
>>> tt="Python is a programming language"
>>> tokens2=nltk.word_tokenize(tt) #Not throwing error
>>> type(tt)
<class 'str'>
>>> type(text6)
<class 'nltk.text.Text'>
>>>
Check the nltk data folder. And where it expects it should be located.
Try using:
nltk.download('punkt')
I am trying to do some rudimentary corpus analysis with Python. I am getting the following error message(s):
Traceback (most recent call last):
File "<pyshell#28>", line 2, in <module>
print(len(poems.words(f)), f)
File "C:\Python38-32\lib\site-packages\nltk\corpus\reader\util.py", line 240, in __len__
for tok in self.iterate_from(self._toknum[-1]):
File "C:\Python38-32\lib\site-packages\nltk\corpus\reader\util.py", line 306, in iterate_from
tokens = self.read_block(self._stream)
File "C:\Python38-32\lib\site-packages\nltk\corpus\reader\plaintext.py", line 134, in _read_word_block
words.extend(self._word_tokenizer.tokenize(stream.readline()))
File "C:\Python38-32\lib\site-packages\nltk\data.py", line 1220, in readline
new_chars = self._read(readsize)
File "C:\Python38-32\lib\site-packages\nltk\data.py", line 1458, in _read
chars, bytes_decoded = self._incr_decode(bytes)
File "C:\Python38-32\lib\site-packages\nltk\data.py", line 1489, in _incr_decode
return self.decode(bytes, 'strict')
File "C:\Python38-32\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 12: invalid start byte
My assumption is that there is a UTF error in one of the 202 text files I am looking at.
Is there any way of telling, from the error messages, which file or files have the problem?
Assuming that you know the files ids (the paths of your corpus files) you can open all of them with encoding="utf-8"
If you don't know the paths, assuming that you are using the nltk corpus loader, you can get them by using:
poems.fileids()
After that, for every file in your list of files (for example fileids) you can try:
for file_ in fileids:
try:
with open(file_, encoding="utf-8") a f_i:
f_i.readlines()
except:
print("You got problems with the file: ", file_)
Anyway, your loader has also a parameter named "encoding" that you can use for the correct encoding of your corpus. By default is set to "utf-8"
More details here: nltk corpus loader
I have had great success parsing RSS feeds from the National Hurricane Center using the feedparser module:
import feedparser
feedparser.parse('https://www.nhc.noaa.gov/gis-at.xml') #Works Fine
feedparser.parse('https://www.nhc.noaa.gov/gis-ep.xml') #Works Fine
However, when I try to read the superficially similar feed from the Central Pacific Hurricane Center, I generate a KeyError:
feedparser.parse('http://www.prh.noaa.gov/cphc/gis-cp.xml') #Doesn't work
Is this a bug with feedparser? Is the CPHC's feed malformed? Is there an option that I've forgotten to specify? It seems the trouble is that there isn't a key named 'where', but I don't know why this isn't a problem for the NHC feeds. The stack is reproduced below:
>>> import feedparser
>>> feedparser.parse('http://www.prh.noaa.gov/cphc/gis-cp.xml')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../anaconda3/lib/python3.6/site-packages/feedparser.py", line 3956, in parse
saxparser.parse(source)
File ".../anaconda3/lib/python3.6/xml/sax/expatreader.py", line 111, in parse
xmlreader.IncrementalParser.parse(self, source)
File ".../anaconda3/lib/python3.6/xml/sax/xmlreader.py", line 125, in parse
self.feed(buffer)
File ".../anaconda3/lib/python3.6/xml/sax/expatreader.py", line 217, in feed
self._parser.Parse(data, isFinal)
File "/tmp/build/80754af9/python_1516124163501/work/Modules/pyexpat.c", line 414, in StartElement
File ".../anaconda3/lib/python3.6/xml/sax/expatreader.py", line 370, in start_element_ns
AttributesNSImpl(newattrs, qnames))
File ".../anaconda3/lib/python3.6/site-packages/feedparser.py", line 2031, in startElementNS
self.unknown_starttag(localname, list(attrsD.items()))
File ".../anaconda3/lib/python3.6/site-packages/feedparser.py", line 666, in unknown_starttag
return method(attrsD)
File ".../anaconda3/lib/python3.6/site-packages/feedparser.py", line 1500, in _start_gml_point
self._parse_srs_attrs(attrsD)
File ".../anaconda3/lib/python3.6/site-packages/feedparser.py", line 1496, in _parse_srs_attrs
context['where']['srsName'] = srsName
File ".../anaconda3/lib/python3.6/site-packages/feedparser.py", line 356, in __getitem__
return dict.__getitem__(self, key)
KeyError: 'where'
I know this is an old question but I faced myself this issue and became my first opensource contribution :)
Is this a bug with feedparser?
Yes, it was.
Is the CPHC's feed malformed?
Also yes, or at least it doesn't follow the GeoRSSS GML model to the letter. If you check the GMLPoint description you will see the following structure:
<georss:where>
<gml:Point>
<gml:pos>45.256 -71.92</gml:pos>
</gml:Point>
</georss:where>
but the feed data is structured this way:
<gml:Point>
<gml:pos>45.256 -71.92</gml:pos>
</gml:Point>
So that's why the KeyError: 'where' occurs, due to the absent of where tag.
This was fixed on feedparser's 6.0.9 hotfix (see https://github.com/kurtmckee/feedparser/pull/306)
I'm trying to run through the TextBlob tutorial in Windows (using Git Bash shell) with Python 3.3.
I've installed textblob and nltk as well as any dependencies.
The Python code is:
from text.blob import TextBlob
wiki = TextBlob("Python is a high-level, general-purpose programming language.")
tags = wiki.tags
I'm getting the following error
Traceback (most recent call last):
File "textblob.py", line 4, in <module>
tags = wiki.tags
File "c:\Python33\lib\site-packages\text\decorators.py", line 18, in __get__
value = obj.__dict__[self.func.__name__] = self.func(obj)
File "c:\Python33\lib\site-packages\text\blob.py", line 357, in pos_tags
for word, t in self.pos_tagger.tag(self.raw)
File "c:\Python33\lib\site-packages\text\taggers.py", line 40, in tag
return pattern_tag(sentence, tokenize)
File "c:\Python33\lib\site-packages\text\en.py", line 115, in tag
for sentence in parse(s, tokenize, True, False, False, False, encoding).split():
File "c:\Python33\lib\site-packages\text\en.py", line 99, in parse
return parser.parse(unicode(s), *args, **kwargs)
File "c:\Python33\lib\site-packages\text\text.py", line 1213, in parse
s[i] = self.find_tags(s[i], **kwargs)
File "c:\Python33\lib\site-packages\text\en.py", line 49, in find_tags
return _Parser.find_tags(self, tokens, **kwargs)
File "c:\Python33\lib\site-packages\text\text.py", line 1161, in find_tags
map = kwargs.get( "map", None))
File "c:\Python33\lib\site-packages\text\text.py", line 967, in find_tags
tagged.append([token, lexicon.get(token, i==0 and lexicon.get(token.lower()) or None)])
File "c:\Python33\lib\site-packages\text\text.py", line 98, in get
return self._lazy("get", *args)
File "c:\Python33\lib\site-packages\text\text.py", line 79, in _lazy
self.load()
File "c:\Python33\lib\site-packages\text\text.py", line 367, in load
dict.update(self, (x.split(" ")[:2] for x in _read(self._path) if x.strip()))
File "c:\Python33\lib\site-packages\text\text.py", line 367, in <genexpr>
dict.update(self, (x.split(" ")[:2] for x in _read(self._path) if x.strip()))
File "c:\Python33\lib\site-packages\text\text.py", line 346, in _read
for line in f:
File "c:\Python33\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 16: character maps to <undefined>
Any idea what is wrong here? Adding a 'u' before the string didn't help.
Release 0.7.1 fixes this issue, which means it's time for a
$ pip install -U textblob
The problem was that the en-lexicon.txt file used for part-of-speech tagging opened the file using Windows' default platform encoding, cp1252. The file apparently had characters that Python could not decode from this encoding. This was fixed by explicitly opening the file in utf-8 mode.
I have a question that is a bit hard to explain. I'm forking devsniper's application 'customers' as a base to start a POS system for a local computer shop. The original application uses MySQL, however it is critical that this application uses my client's original data. So I am presented with two options:
1) I can migrate the SQLite Database to a MySQL DB
2) I can modify the program to use the SQLite DB (Preferred)
However, whenever I try to pull up the customers page, I get the following:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
I am not sure where to start with detailing my problem, as there isn't much detail in precisely what is causing this problem, however I will start with the traceback.
Traceback (most recent call last):
File "/home/tabras/posenv/local/lib/python2.7/site-packages/pyramid-1.4.2-py2.7.egg/pyramid/mako_templating.py", line 232, in __call__
result = template.render_unicode(**system)
File "/home/tabras/posenv/local/lib/python2.7/site-packages/Mako-0.8.1-py2.7.egg/mako/template.py", line 452, in render_unicode
as_unicode=True)
File "/home/tabras/posenv/local/lib/python2.7/site-packages/Mako-0.8.1-py2.7.egg/mako/runtime.py", line 783, in _render
**_kwargs_for_callable(callable_, data))
File "/home/tabras/posenv/local/lib/python2.7/site-packages/Mako-0.8.1-py2.7.egg/mako/runtime.py", line 815, in _render_context
_exec_template(inherit, lclcontext, args=args, kwargs=kwargs)
File "/home/tabras/posenv/local/lib/python2.7/site-packages/Mako-0.8.1-py2.7.egg/mako/runtime.py", line 841, in _exec_template
callable_(context, *args, **kwargs)
File "/home/tabras/posenv/customers/customers/templates/base/index.html", line 102, in render_body
${next.body()}
File "/home/tabras/posenv/customers/customers/templates/customer/list.html", line 19, in render_body
<%include file="listPartial.html"/>
File "/home/tabras/posenv/local/lib/python2.7/site-packages/Mako-0.8.1-py2.7.egg/mako/runtime.py", line 710, in _include_file
callable_(ctx, **_kwargs_for_include(callable_, context._data, **kwargs))
File "/home/tabras/posenv/customers/customers/templates/customer/listPartial.html", line 50, in render_body
${pager(customers)}
File "/home/tabras/posenv/customers/customers/templates/base/uiHelpers.html", line 10, in render_pager
${items.pager(format="$link_previous ~2~ $link_next",
File "/home/tabras/posenv/local/lib/python2.7/site-packages/WebHelpers-1.3-py2.7.egg/webhelpers/paginate.py", line 716, in pager
self._pagerlink(self.next_page, symbol_next) or ''
File "/home/tabras/posenv/local/lib/python2.7/site-packages/WebHelpers-1.3-py2.7.egg/webhelpers/paginate.py", line 855, in _pagerlink
return HTML.a(text, href=link_url, onclick=onclick_action, **self.link_attr)
File "/home/tabras/posenv/local/lib/python2.7/site-packages/WebHelpers-1.3-py2.7.egg/webhelpers/html/builder.py", line 213, in __call__
return make_tag(self._tag, *args, **kw)
File "/home/tabras/posenv/local/lib/python2.7/site-packages/WebHelpers-1.3-py2.7.egg/webhelpers/html/builder.py", line 308, in make_tag
chunks.extend(escape(x) for x in args)
File "/home/tabras/posenv/local/lib/python2.7/site-packages/WebHelpers-1.3-py2.7.egg/webhelpers/html/builder.py", line 308, in <genexpr>
chunks.extend(escape(x) for x in args)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
Post Solution Edit:
The problem was here:
${items.pager(format="$link_previous ~2~ $link_next",
symbol_previous="«",
symbol_next="»",
link_attr=link_attr,
curpage_attr=curpage_attr,
dotdot_attr=dotdot_attr,
onclick="$('.list-partial').load('%s'); return false;")}
For some reason the '»' character and its counterpart were giving throwing the error. I simply changed them to standard ascii characters and everything was golden.
Yeah, you were right about slowing down Michael -- It was a really simple error. In uiHelpers.html there was a unicode character '»' which was causing the problem for some reason.. Simply changed that to '>' and it was golden. This was a good lesson in reading the traceback more carefully, thanks for the feedback.
-Tabras