Error using nltk word_tokenize - python

I am doing some exercises from the NLTK book on accesing text from web and from disk (chapter 3). When calling word_tokenize I get an error.
This is my code:
>>> import nltk
>>> from urllib.request import urlopen
>>> url = "http://www.gutenberg.org/files/2554/2554.txt"
>>> raw = urlopen(url).read()
>>> tokens = nltk.word_tokenize(raw)
And this is the traceback:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
for sl1, sl2 in _pair_iter(slices):
File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 310, in _pair_iter
prev = next(it)
File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1289, in _slices_from_text
for match in self._lang_vars.period_context_re().finditer(text):
TypeError: cannot use a string pattern on a bytes-like object
>>> File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\__init__.py", line 109, in word_tokenize
return [token for sent in sent_tokenize(text, language)
File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\__init__.py", line 94, in sent_tokenize
return tokenizer.tokenize(text)
File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1237, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1285, in sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1276, in span_tokenize
return [(sl.start, sl.stop) for sl in slices]
File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1276, in <listcomp>
return [(sl.start, sl.stop) for sl in slices]
File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1316, in _realign_boundaries
Can someone please explain me to me what is going on here and why I cannot seem to use word_tokenize properly?
Many thanks!

You have to convert html (which is obtained as byte object) into a string using decode('utf-8'):
>>> import nltk
>>> from urllib.request import urlopen
>>> url = "http://www.gutenberg.org/files/2554/2554.txt"
>>> raw = urlopen(url).read()
>>> raw = raw.decode('utf-8')
>>> tokens = nltk.word_tokenize(raw)

I was getting the Error 404 for the url so I change the url .This works for me. you can change url to below. may be it works for you as well.
from urllib import request
url = "https://ia803405.us.archive.org/21/items/crimeandpunishme02554gut/2554.txt"
raw = request.urlopen(url).read()

Related

Resource punkt not found. But, it is downloaded and installed

I have the following columns in a dataframe.
Unnamed: 0, title, publication, author, year, month, title.1, content, len_article, gensim_summary, split_words, first_100_words
I am trying to run this small piece of code.
import nltk
nltk.download('punkt')
# TOKENIZE
df.first_100_words = df.first_100_words.str.lower()
df['tokenized_first_100'] = df.first_100_words.apply(lambda x: word_tokenize(x, language = 'en'))
The last line of code throws an error. I'm getting this error message.
df.first_100_words = df.first_100_words.str.lower()
df['tokenized_first_100'] = df.first_100_words.apply(lambda x: word_tokenize(x, language = 'en'))
Traceback (most recent call last):
File "<ipython-input-129-42381e657774>", line 2, in <module>
df['tokenized_first_100'] = df.first_100_words.apply(lambda x: word_tokenize(x, language = 'en'))
File "C:\Users\ryans\Anaconda3\lib\site-packages\pandas\core\series.py", line 3848, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas\_libs\lib.pyx", line 2329, in pandas._libs.lib.map_infer
File "<ipython-input-129-42381e657774>", line 2, in <lambda>
df['tokenized_first_100'] = df.first_100_words.apply(lambda x: word_tokenize(x, language = 'en'))
File "C:\Users\ryans\Anaconda3\lib\site-packages\nltk\tokenize\__init__.py", line 144, in word_tokenize
sentences = [text] if preserve_line else sent_tokenize(text, language)
File "C:\Users\ryans\Anaconda3\lib\site-packages\nltk\tokenize\__init__.py", line 105, in sent_tokenize
tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
File "C:\Users\ryans\Anaconda3\lib\site-packages\nltk\data.py", line 868, in load
opened_resource = _open(resource_url)
File "C:\Users\ryans\Anaconda3\lib\site-packages\nltk\data.py", line 993, in _open
return find(path_, path + ['']).open()
File "C:\Users\ryans\Anaconda3\lib\site-packages\nltk\data.py", line 701, in find
raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
import nltk
nltk.download('punkt')
For more information see: https://www.nltk.org/data.html
Attempted to load tokenizers/punkt/en.pickle
Searched in:
- 'C:\\Users\\ryans/nltk_data'
- 'C:\\Users\\ryans\\Anaconda3\\nltk_data'
- 'C:\\Users\\ryans\\Anaconda3\\share\\nltk_data'
- 'C:\\Users\\ryans\\Anaconda3\\lib\\nltk_data'
- 'C:\\Users\\ryans\\AppData\\Roaming\\nltk_data'
- 'C:\\nltk_data'
- 'D:\\nltk_data'
- 'E:\\nltk_data'
- ''
**********************************************************************
I'm pretty new to all the tokenization stuff.
The sample code is from this site.
https://github.com/AustinKrause/Mod_5_Text_Summarizer/blob/master/Notebooks/Text_Cleaning_and_KMeans.ipynb
I found that and it helped: https://github.com/b0noI/dialog_converter/issues/7
Just add
nltk.download('punkt')
SENT_DETECTOR = nltk.data.load('tokenizers/punkt/english.pickle')

UnicodeDecodeError: 'ascii' codec can't decode byte in Textranking code [duplicate]

This question already has answers here:
How to fix: "UnicodeDecodeError: 'ascii' codec can't decode byte"
(20 answers)
Closed 5 years ago.
When I execute the below code
import networkx as nx
import numpy as np
from nltk.tokenize.punkt import PunktSentenceTokenizer
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
def textrank(document):
sentence_tokenizer = PunktSentenceTokenizer()
sentences = sentence_tokenizer.tokenize(document)
bow_matrix = CountVectorizer().fit_transform(sentences)
normalized = TfidfTransformer().fit_transform(bow_matrix)
similarity_graph = normalized * normalized.T
nx_graph = nx.from_scipy_sparse_matrix(similarity_graph)
scores = nx.pagerank(nx_graph)
return sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)
fp = open("QC")
txt = fp.read()
sents = textrank(txt)
print sents
I get the following error
Traceback (most recent call last):
File "Textrank.py", line 44, in <module>
sents = textrank(txt)
File "Textrank.py", line 10, in textrank
sentences = sentence_tokenizer.tokenize(document)
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1237, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1285, in sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1276, in span_tokenize
return [(sl.start, sl.stop) for sl in slices]
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1316, in _realign_boundaries
for sl1, sl2 in _pair_iter(slices):
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 311, in _pair_iter
for el in it:
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1291, in _slices_from_text
if self.text_contains_sentbreak(context):
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1337, in text_contains_sentbreak
for t in self._annotate_tokens(self._tokenize_words(text)):
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1472, in _annotate_second_pass
for t1, t2 in _pair_iter(tokens):
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
prev = next(it)
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 577, in _annotate_first_pass
for aug_tok in tokens:
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 542, in _tokenize_words
for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 9: ordinal not in range(128)
I am executing the code in Ubuntu. To get the text, I referred this website
https://uwaterloo.ca/institute-for-quantum-computing/quantum-computing-101. I created a file QC (not QC.txt) and copy pasted the data paragraph by paragraph to the file.
Kindly help me resolve the error.
Thank You
Please try if the following works for you.
import networkx as nx
import numpy as np
import sys
reload(sys)
sys.setdefaultencoding('utf8')
from nltk.tokenize.punkt import PunktSentenceTokenizer
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
def textrank(document):
sentence_tokenizer = PunktSentenceTokenizer()
sentences = sentence_tokenizer.tokenize(document)
bow_matrix = CountVectorizer().fit_transform(sentences)
normalized = TfidfTransformer().fit_transform(bow_matrix)
similarity_graph = normalized * normalized.T
nx_graph = nx.from_scipy_sparse_matrix(similarity_graph)
scores = nx.pagerank(nx_graph)
return sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)
fp = open("QC")
txt = fp.read()
sents = textrank(txt.encode('utf-8'))
print sents

word_tokenize in nltk not taking a list of string as argument

from nltk.tokenize import word_tokenize
music_comments = [['So cant you just run the bot outside of the US? ', ''], ["Just because it's illegal doesn't mean it will stop. I hope it actually gets enforced. ", ''], ['Can they do something about all the fucking bots on Tinder next? \n\nEdit: Holy crap my inbox just blew up ', '']]
print(word_tokenize(music_comments[1]))
I found this other question which says to pass a list of strings to word_tokenize, but in my case after running the above I get the following output:
Traceback (most recent call last):
File "testing.py", line 5, in <module>
print(word_tokenize(music_comments[1]))
File "C:\Users\Shraddheya Shendre\Anaconda3\lib\site-packages\nltk\tokenize\__init__.py", line 109, in word_tokenize
return [token for sent in sent_tokenize(text, language)
File "C:\Users\Shraddheya Shendre\Anaconda3\lib\site-packages\nltk\tokenize\__init__.py", line 94, in sent_tokenize
return tokenizer.tokenize(text)
File "C:\Users\Shraddheya Shendre\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1237, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "C:\Users\Shraddheya Shendre\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1285, in sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "C:\Users\Shraddheya Shendre\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1276, in span_tokenize
return [(sl.start, sl.stop) for sl in slices]
File "C:\Users\Shraddheya Shendre\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1276, in <listcomp>
return [(sl.start, sl.stop) for sl in slices]
File "C:\Users\Shraddheya Shendre\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1316, in _realign_boundaries
for sl1, sl2 in _pair_iter(slices):
File "C:\Users\Shraddheya Shendre\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 310, in _pair_iter
prev = next(it)
File "C:\Users\Shraddheya Shendre\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1289, in _slices_from_text
for match in self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or bytes-like object
What is the problem? What am I missing?
You are feeding a list with two items into tokenize():
["Just because it's illegal doesn't mean it will stop. I hope it actually gets enforced. ", '']
i.e the sentence and an empty string.
Changing your code to this should do the trick:
print(word_tokenize(music_comments[1][0]))
def word_tokenize(self, s):
"""Tokenize a string to split off punctuation other than periods"""
return self._word_tokenizer_re().findall(s)
this is part of the 'Source code for nltk.tokenize.punkt'.
The input of function word_tokenize() should be a string,not a list.

How to do tokenization of text file in format UTF-8 in python

I want to do tokenization and have to create file containing tokenized words without stop words for sentiment analysis. I am trying with the code but it gives an error. The code is:
import nltk
from nltk.collocations import *
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
stopset = set(stopwords.words('english'))
with open('Grey.txt', 'r') as text_file,open('step3.txt','w') as outFile:
text = text_file.read()
tokens=word_tokenize(str(text))
tokens = [w for w in tokens if not w in stopset]
print(tokens)
outFile.write(str(tokens))
outFile.close()
and the error is:
(C:\Users\sama\Anaconda2) C:\Users\sama\Anaconda2\Amazon Project>python sw.py
Traceback (most recent call last):
File "sw.py", line 15, in <module>
tokens=word_tokenize(str(text))
File "C:\Users\sama\Anaconda2\lib\site-packages\nltk\tokenize\__init__.py",
line 109, in word_tokenize
return [token for sent in sent_tokenize(text, language)
File "C:\Users\sama\Anaconda2\lib\site-packages\nltk\tokenize\__init__.py",
line 94, in sent_tokenize
return tokenizer.tokenize(text)
File "C:\Users\sama\Anaconda2\lib\site-packages\nltk\tokenize\punkt.py",
line 1237, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "C:\Users\sama\Anaconda2\lib\site-packages\nltk\tokenize\punkt.py",
line
1285, in sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "C:\Users\sama\Anaconda2\lib\site-packages\nltk\tokenize\punkt.py",
line 1276, in span_tokenize
return [(sl.start, sl.stop) for sl in slices]
File "C:\Users\sama\Anaconda2\lib\site-packages\nltk\tokenize\punkt.py",
line
1316, in _realign_boundaries
for sl1, sl2 in _pair_iter(slices):
File "C:\Users\sama\Anaconda2\lib\site-packages\nltk\tokenize\punkt.py",
line 311, in _pair_iter
for el in it:
File "C:\Users\sama\Anaconda2\lib\site-packages\nltk\tokenize\punkt.py",
line 1291, in _slices_from_text
if self.text_contains_sentbreak(context):
File "C:\Users\sama\Anaconda2\lib\site-packages\nltk\tokenize\punkt.py",
line 1337, in text_contains_sentbreak
for t in self._annotate_tokens(self._tokenize_words(text)):
File "C:\Users\sama\Anaconda2\lib\site-packages\nltk\tokenize\punkt.py",
line 1472, in _annotate_second_pass
for t1, t2 in _pair_iter(tokens):
File "C:\Users\sama\Anaconda2\lib\site-packages\nltk\tokenize\punkt.py",
line 310, in _pair_iter
prev = next(it)
File "C:\Users\sama\Anaconda2\lib\site-packages\nltk\tokenize\punkt.py",
line 577, in _annotate_first_pass
for aug_tok in tokens:
File "C:\Users\sama\Anaconda2\lib\site-packages\nltk\tokenize\punkt.py",
line 542, in _tokenize_words
for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 12:
ordinal not in range(128)

Strange behaviour with nltk sentence tokenizer and special characters

I get some strange behavior when using the sent_tokenizer for German text.
Example Code:
sent_tokenizer = nltk.data.load('tokenizers/punkt/german.pickle')
for sent in sent_tokenizer.tokenize("Super Qualität. Tolles Teil.")
print sent
This fails with the error:
Traceback (most recent call last):
for sent in sent_tokenize("Super Qualität. Tolles Teil."):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/__init__.py", line 82, in sent_tokenize
return tokenizer.tokenize(text)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1270, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1318, in sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1309, in span_tokenize
return [(sl.start, sl.stop) for sl in slices]
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1348, in _realign_boundaries
for sl1, sl2 in _pair_iter(slices):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 354, in _pair_iter
prev = next(it)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1324, in _slices_from_text
if self.text_contains_sentbreak(context):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1369, in text_contains_sentbreak
for t in self._annotate_tokens(self._tokenize_words(text)):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1504, in _annotate_second_pass
for t1, t2 in _pair_iter(tokens):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 354, in _pair_iter
prev = next(it)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 621, in _annotate_first_pass
for aug_tok in tokens:
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 586, in _tokenize_words
for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6: ordinal not in range(128)
whereas:
sent_tokenizer = nltk.data.load('tokenizers/punkt/german.pickle')
for sent in sent_tokenizer.tokenize("Super Qualität des Produktes. Tolles Teil.")
print sent
works perfectly
I found the solution on the nltk homepage.
Caution: when tokenizing a Unicode string, make sure you are not using
an encoded version of the string (it may be necessary to decode it
first, e.g. with s.decode("utf8").
So
text = "Super Qualität. Tolles Teil."
sent_tokenizer = nltk.data.load('tokenizers/punkt/german.pickle')
for sent in sent_tokenizer.tokenize(text.decode('utf8')):
print sent
works like a charm.

Categories