nltk NERTagger UnicodeDecodeError in python

nltk NERTagger UnicodeDecodeError in python - python

I am writing a program in python 2.7.6 that uses nltk with Stanford named entity tagger in Window 7 professional to tag a text and print the result as follows:
import re
from nltk.tag.stanford import NERTagger
WORD = re.compile(r'\w+')
st = NERTagger("./classifiers/english.all.3class.distsim.crf.ser.gz", "stanford-ner.jar")
text = "title Wienfilm 1896-1976 (1976)"
words = WORD.findall(text )
print words
answer = st.tag(words )
print answer
The last print statement in the program suppose to return a tuple consisting of five lists as:
[(u'title', u'O'), (u'Wienfilm', u'O'), (u'1896', u'O'), (u'1976', u'O'), (u'1976', u'O')]
However when I run the program, it gives me the following error message:
['title', 'Wienfilm', '1896', '1976', '1976']
Traceback (most recent call last):
File "E:\Google Drive\myPyPrgs\testNLP.py", line 27, in <module>
answer = st.tag(words )
File "C:\Python27\lib\site-packages\nltk\tag\stanford.py", line 59, in tag
return self.tag_sents([tokens])[0]
File "C:\Python27\lib\site-packages\nltk\tag\stanford.py", line 82, in tag_sents
stanpos_output = stanpos_output.decode(encoding)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 23: ordinal not in
range(128)
Note that if I remove the number, '-1976' from the text string the program tags and prints the correct answer. But if the number '-1976' is within the text, I always have the above error.
In this forum, somebody suggested to me to change the default encoding in the stanford.py of the nltk. When I changed the default encoding in the stanford.py from ascii to UTF-16 and replaced the the last print statement of the above code with the following looping:
for i, word_pos in enumerate(answer):
word, pos = word_pos
print i , word.encode('utf-16'), pos.encode('utf-16')
I got the following incorrect output:
0 ÿþ ÿþtitle/O Wienfilm/O 1896 1976 1976/O
Please any clues on how to deal with this issue? Thanks in advance.

This worked for me: specify the encoding argument as UTF-8 when you create NERTagger object
st = NERTagger("./classifiers/english.all.3class.distsim.crf.ser.gz", "stanford-ner.jar", encoding='utf-8')

Open terminal(cmd), and write;
chcp
It should return something like;
active code page: 857
Then, write;
chcp 1254
After then, in your .py script, to the top of your script write;
# -*- coding: cp1254 -*-
This should solve your problem.If it's not, copy these codes and paste to the top of your script.
# -*-coding:utf-8-*-
import locale
locale.setlocale(locale.LC_ALL, '')
I had many problems with decoding before, these methods solved.
ASCII can decode only 2^7 = 128 characters, that's why you getting that error.As you see in the error sentence ordinal not in range(128) .
And check this website please.Use arrow keys for switching pages :-) I believe it's going to solve your problem.

At the top of your app add:
import sys
reload(sys)
sys.setdefaultencoding("utf-8")

I was dealing with the same problem and I solved it by adding the encoding options on internals.py in nltk.
You must open internals.py saven on:
%YourPythonFolder%\Lib\site-packages\nltk\internals.py
Then go to the method java and adding this line after #construct the full command string (about line 147)
cmd = cmd + ['-inputEncoding', 'utf-8', '-outputEncoding', 'utf-8']
This section code must look like:
# Construct the full command string.
cmd = list(cmd)
cmd = ['-cp', classpath] + cmd
cmd = [_java_bin] + _java_options + cmd
cmd = cmd + ['-inputEncoding', 'utf-8', '-outputEncoding', 'utf-8']
Hope it helps.

Related

Opening a text file, and receiving a encoding error, tried multiple methods no hope

I'm trying to open up a password database file (consists of a bunch of common passwords) and I'm getting the following error:
Attempts so far..
Code:
f = open("crackstation-human-only.txt", 'r')
for i in f:
print(i)
Error Code:
Traceback (most recent call last):
File "C:\Users\David\eclipse-workspace\Kaplin\password_cracker.py", line 3, in <module>
for i in f:
File "C:\Users\David\AppData\Local\Programs\Python\Python37\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 753: character maps to <undefined>
After doing some research I was told to attempt encoding = 'utf-8' which I later discovered was basically guessing and hoping that the file would show all the outputs
Code:
f = open("crackstation-human-only.txt", 'r', encoding = 'utf-8')
for i in f:
print(i)
Error:
Traceback (most recent call last):
File "C:\Users\David\eclipse-workspace\Kaplin\password_cracker.py", line 3, in <module>
for i in f:
File "C:\Users\David\AppData\Local\Programs\Python\Python37\lib\codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 5884: invalid continuation byte
After receiving this error message, I was recommended to attempt to download a text editor like 'Sublime Text 3', and to open the console end enter the command 'Encoding()', but unfortunately it wasn't able to detect the encoding.
My professor was able to use bash to 'grep cat' the lines in the file (I honestly know very little about bash so if anyone else knows those terms i'm not sure if that will help them out)
If anyone has any suggestions on what I can do in order to get this to work out I would greatly appreciate it.
I will post the link to the text document if anyone is interested in seeing what types of characters are within the file.
Link to the file, it's a .txt from my school/professors domain
UPDATE:
I have a fellow classmate that is running elementary OS, and he was using the terminal to write his python program which would iterate through the file, and he was using the encoding 'latin-1', he was able to output more characters than me, I'm on Windows 10, using Eclipse-atom for all my scripts.
So there seems to be something that's causing me possibly not to get the correct outputs based on these factors, i'm guessing because it just seems that way based on the results,
I will be installing elementary-os and attempting all the solutions there, to see if I can get this file to work out. I'll add another update soon!

Faced a similar problem a while ago, and more often I've found that setting
encoding = 'raw_unicode_escape'
has worked for me
For your particular case, I tried all Python 3 supported encoding types and found
raw_unicode_escape
mbcs
palmos
Try either of the above to read your file
f = open("crackstation-human-only.txt", 'r', encoding = 'mbcs')
For more information on encodings, refer https://docs.python.org/2.4/lib/standard-encodings.html
Hope this helps.
re:
With the link above i made a list of encoding formats to try on your file. I hadn't saved my previous work, which was more in detail, but this code should do the same. I re-ran it now as follows:
enc_list = ['big5big5-tw,',
'cp037IBM037,',
'cp437437,',
'cp737Greek',
'cp850850,',
'cp855855,',
'cp857857,',
'cp861861,',
'cp863863,',
'cp865865,',
'cp869869,',
'cp875Greek',
'cp949949,',
'cp1006Urdu',
'cp1140ibm1140Western',
'cp1251windows-1251Bulgarian,',
'cp1253windows-1253Greek',
'cp1255windows-1255Hebrew',
'cp1257windows-1257Baltic',
'euc_jpeucjp,',
'euc_jisx0213eucjisx0213Japanese',
'gb2312chinese,',
'gb18030gb18030-2000Unified',
'iso2022_jpcsiso2022jp,',
'iso2022_jp_2iso2022jp-2,',
'iso2022_jp_3iso2022jp-3,',
'iso2022_krcsiso2022kr,',
'iso8859_2iso-8859-2,',
'iso8859_4iso-8859-4,',
'iso8859_6iso-8859-6,',
'iso8859_8iso-8859-8,',
'iso8859_10iso-8859-10',
'iso8859_14iso-8859-14,',
'johabcp1361,',
'koi8_uUkrainian',
'mac_greekmacgreekGreek',
'mac_latin2maclatin2,',
'mac_turkishmacturkishTurkish',
'shift_jiscsshiftjis,',
'shift_jisx0213shiftjisx0213,',
'utf_16_beUTF-16BEall',
'utf_16_le',
'utf_7',
'utf_8',
'base64_codec',
'bz2_codec',
'hex_codec',
'idna',
'mbcs',
'palmos',
'punycode',
'quopri_codec',
'raw_unicode_escape',
'rot_13',
'string_escape',
'undefined',
'unicode_escape',
'unicode_internal',
'uu_codec',
'zlib_codec'
]
for encode in enc_list:
try:
with open(r"crackstation-human-only.txt", encoding=encode) as f:
temp = len(f.read())
except:
enc_list.remove(encode)
print(enc_list)
Run this code on your machine and you'll get a list of encodings you can try on your file. The output i received was
['cp037IBM037,', 'cp737Greek', 'cp855855,', 'cp861861,', 'cp865865,', 'cp875Greek', 'cp1006Urdu', 'cp1251windows-1251Bulgarian,', 'cp1255windows-1255Hebrew', 'euc_jpeucjp,', 'gb2312chinese,', 'iso2022_jpcsiso2022jp,', 'iso2022_jp_3iso2022jp-3,', 'iso8859_2iso-8859-2,', 'iso8859_6iso-8859-6,', 'iso8859_10iso-8859-10', 'johabcp1361,', 'mac_greekmacgreekGreek', 'mac_turkishmacturkishTurkish', 'shift_jisx0213shiftjisx0213,', 'utf_16_le', 'utf_8', 'bz2_codec', 'idna', 'mbcs', 'palmos', 'quopri_codec', 'raw_unicode_escape', 'string_escape', 'unicode_escape', 'uu_codec']

You do have some interesting characters in there. Even though your code does work for me, I'd suggest using a try/except block to catch the lines your system can't handle and skip them:
with open("crackstation-human-only.txt", 'r') as f:
for i in f:
try:
print(i)
except UnicodeDecodeError:
continue
Alternatively, try using open with
the binary read mode 'rb' instead of 'r'
the errors='replace' argument, but that will not do what you want.
see the open documentation

How to find string and return it to stdout in Python

I am getting familiar with Python & am struggling to do the below with BeautifulSoup, Python.
What is expected:
*If the output of the script below contains the string 5378, it should email me with the line the string appears.
#! /usr/bin/env python
from bs4 import BeautifulSoup
from lxml import html
import urllib2,re
import codecs
import sys
streamWriter = codecs.lookup('utf-8')[-1]
sys.stdout = streamWriter(sys.stdout)
BASE_URL = "http://outlet.us.dell.com/ARBOnlineSales/Online/InventorySearch.aspx?c=us&cs=22&l=en&s=dfh&brandid=2201&fid=111162"
webpage = urllib2.urlopen(BASE_URL)
soup = BeautifulSoup(webpage.read(), "lxml")
findcolumn = soup.find("div", {"id": "itemheader-FN"})
name = findcolumn.text.strip()
print name
I tried using findall(5378, name), but it returns to empty braces like this [].
I am struggling with Unicode issues if I am trying to use it along with grep.
$ python dell.py | grep 5378
Traceback (most recent call last):
File "dell.py", line 18, in <module>
print name
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201d' in position 817: ordinal not in range(128)
Can someone tell me what am I doing wrong in both cases?

The function findall (from the re module) expects the first parameter to be a a regular expression, which is a string, but you provided an integer. Try this instead:
re.findall("5378", name)
When printed this will output [u'5378'] when it found something or [] when it didn't.
I suspect you want to retrieve the product name from the number, which means you have to iterate through elements in findcolumn. We can use re.search() here to check for a single match within the element's texts.
for input_element in findcolumn.find_all("div"):
name = unicode(input_element.text.strip())
if re.search("5378", name) != None:
print unicode(name)
As for the unicode error, there are a bunch of solutions, depending on your operating system and configuration: Reconfigure your system locale on Ubuntu or Encode your script output with .encode()/unicode().

Encode error scraping

Scraping site with chineese simbols .
How do i scrap chineese simbolse ??
from urllib.request import urlopen
from urllib.parse import urljoin
from lxml.html import fromstring
URL = 'http://list.suning.com/0-258003-0.html'
ITEM_PATH = '.clearfix .product .border-out .border-in .wrap .res-info .sell-point'
def parse_items():
f = urlopen(URL)
list_html = f.read().decode('utf-8')
list_doc = fromstring(list_html)
for elem in list_doc.cssselect(ITEM_PATH):
a = elem.cssselect('a')[0]
href = a.get('href')
title = a.text
em = elem.cssselect('em')[0]
title2 = em.text
print(href, title, title2)
def main():
parse_items()
if __name__ == '__main__':
main()
Error looks like this.
Error looks like this
Error looks like this
Error looks like this
Error looks like this
http://product.suning.com/0000000000/146422477.html Traceback (most recent call last):
File "parser.py", line 27, in <module>
main()
File "parser.py", line 24, in main
parse_items()
File "parser.py", line 20, in parse_items
print(href, title, title2)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

From the print syntax and the imports, I assume that you use a Python3 version, since it can matter for unicode.
So, we can expect that href, title and title2 are all unicode strings (or Python 3 strings). But the print function will try to convert the strings to an encoding acceptable by the output system - for a reason I cannot know, you system uses by default ASCII, so the error.
How to fix:
the best way would be to make your system accept unicode. On Linux or other unixes, you can declare an UTF8 charset in LANG environment variable (export LANG=en_US.UTF-8), on Windows you can try chcp 65001 but this latter if far from being sure
if it does not work, or does not meet your needs, you can force an explicit encoding, or more exactly filter out offending characters, because Python3 natively uses unicode strings.
I would use:
import sys
def u_filter(s, encoding = sys.stdout.encoding):
return (s.encode(encoding, errors='replace').decode(encoding)
if isinstance(s, str) else s)
That means: if s is a unicode string encode it in the encoding used for stdout, replacing any non convertible character by a replacement char, and decode it back into a now clean string
and next:
def fprint(*args, **kwargs):
fargs = [ u_filter(arg) for arg in args ]
print(*fargs, **kwargs)
means: filter out any offending character from unicode strings and print the remaining unchanged.
With that you can safely replace your print throwing the exception with:
fprint(href, title, title2)

UnicodeEncodeError when writing to file

I have a python script that works great on my local machine (OS X), but when I copied it to a server (Debian), it does not work as expected. The script reads an xml file and prints the contents in a new format. On my local machine, I can run the script with stdout to the terminal or to a file (i.e. > myFile.txt), and both work fine.
However, on the server (ssh), when I print to terminal everything works fine, but printing to the file (which is what I really need) gives UnicodeEncodeError: UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128). All files are in utf-8 encoding, and utf-8 is declared in the magic comment.
If I print the str objects inside a list (which is a trick I usually use to get a handle on encoding issues), it also throws the same error.
If I use print( x.encode('utf-8') ), then it prints code-style bits (e.g. b'1' b'\xd0\x9a\xd0\xb0\xd0\xbc\xd0\xb0').
If I $ export PYTHONIOENCODING=utf-8 in the shell (as suggested in some SO posts), then I get a binary file: 1 <D0><9A><D0><B0><D0><BC><D0><B0>.
I have checked all of the locale variables and the relevant ones match what I have on my local machine.
I can simply process the file locally and upload it, but I really want to understand what is happening here. Since the python code is working on one computer, I am not sure that it is relevant, but I am adding it below:
# -*- encoding: utf-8 -*-
import sys, xml.etree.ElementTree as ET
corpus = ET.parse('file.xml')
text = corpus.getroot()
for body in text :
for sent in body :
depDOMs = [(0,'') for i in range(len(sent)+1)]
for word in sent :
if word.tag == 'LF' :
pass
elif 'ID' in word.attrib and 'FEAT' in word.attrib and 'DOM' in word.attrib :
ID = word.attrib['ID']
try :
Form = word.text.replace(' ','_')
except AttributeError :
Form = '_'
try :
Lemma = word.attrib['LEMMA'].replace(' ', '_')
except KeyError :
Lemma = '*NULL*'
CPOS = word.attrib['FEAT'].split()[0]
POS = word.attrib['FEAT'].replace( ' ' , '_' )
Feats = '_'
Head = word.attrib['DOM']
if Head == '_root' :
Head = '0'
try :
DepRel = word.attrib['LINK']
except KeyError :
DepRel = 'ROOT'
PHead = '_'
PDepRel = '_'
try:
if word.attrib['NODETYPE'] == 'FANTOM' :
word.attrib['LEMMA'] = '*'+word.attrib['LEMMA']+'*'
except KeyError :
pass
print( ID , Form , Lemma , Feats, CPOS , POS , Head , DepRel , PHead , PDepRel , sep='\t' )
else :
print( 'WARNING: what is this?',sent.attrib['ID'],word.attrib)
print()

The underlying issue may be caused by a miss configuration of Linux's locales, meaning that Python is being too cautious when printing non-ASCII chars.
Confirm locale configuration with locale. If there's a problem, you'll see something like:
$ locale
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
LANG=en_US.UTF-8
LANGUAGE=
Fix this with:
$ sudo locale-gen "en_US.UTF-8"
(replace "en_US.UTF-8" with the locale that's not working). For further info, see: https://askubuntu.com/questions/162391/how-do-i-fix-my-locale-issue

You can find important information related to the error you are experiencing in the attributes of the UnicodeError based exception.
Quoting the documentation:
UnicodeError has attributes that describe the encoding or decoding
error. For example, err.object[err.start:err.end] gives the particular
invalid input that the codec failed on.
encoding
The name of the encoding that raised the error.
reason
A string describing the specific codec error.
object
The object the codec was attempting to encode or decode.
start
The first index of invalid data in object.
end
The index after the last invalid data in object.

Reading a tag from XML and printing it

I have the following input XML file,i read the rel_notes tag and print it...running into the following error
Input XML:
<rel_notes>
• Please move to this build for all further test and development activities
• Please use this as base build to verify compilation and sanity before any check-in happens
</rel_notes>
Sample python code:
file = open('data.xml,'r')
from xml.etree import cElementTree as etree
tree = etree.parse(file)
print('\n'.join(elem.text for elem in tree.iter('rel_notes')))
OUTPUT
print('\n'.join(elem.text for elem in tree.iter('rel_notes')))
File "C:\python2.7.3\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2022' in position 9: character maps to <undefined>

The issue is with printing Unicode to Windows console. Namely, the character '•' can't be represented in cp437 used by your console.
To reproduce the problem, try:
print u'\u2022'
You could set PYTHONIOENCODING environment variable to instruct python to replace all unrepresentable characters with corresponding xml char references:
T:\> set PYTHONIOENCODING=cp437:xmlcharrefreplace
T:\> python your_script.py
Or encode the text to bytes before printing:
print u'\u2022'.encode('cp437', 'xmlcharrefreplace')
answer to your initial question
To print text of each <build_location/> element:
import sys
from xml.etree import cElementTree as etree
input_file = sys.stdin # filename or file object
tree = etree.parse(input_file)
print('\n'.join(elem.text for elem in tree.iter('build_location')))
If input file is large; iterparse() could be used:
import sys
from xml.etree import cElementTree as etree
input_file = sys.stdin
context = iter(etree.iterparse(input_file, events=('start', 'end')))
_, root = next(context) # get root element
for event, elem in context:
if event == 'end' and elem.tag == 'build_location':
print(elem.text)
root.clear() # free memory

I don't think the entire snippet above is completely helpful. But, UnicodeEncodeError usually happens when the ASCII characters aren't handled properly.
Example:
unicode_str = html.decode(<source encoding>)
encoded_str = unicode_str.encode("utf8")
Its already explained clearly in this answer: Python: Convert Unicode to ASCII without errors
This should at least solve the UnicodeEncodeError.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

nltk NERTagger UnicodeDecodeError in python - python

This worked for me: specify the encoding argument as UTF-8 when you create NERTagger object st = NERTagger("./classifiers/english.all.3class.distsim.crf.ser.gz", "stanford-ner.jar", encoding='utf-8')

At the top of your app add: import sys reload(sys) sys.setdefaultencoding("utf-8")

Related

Opening a text file, and receiving a encoding error, tried multiple methods no hope

How to find string and return it to stdout in Python

Encode error scraping

UnicodeEncodeError when writing to file

Reading a tag from XML and printing it

Categories

Resources