I am using the twitter streaming api (tweepy) to capture several tweets. I do this in python2.7.
After I have collected a corpus of tweets I break each tweet into words and add each word to a dictionary as keys, where the values are the participation of each word in positive or negative sentences.
When I retrieve the words as keys of the dictionary and try to process them for a next iteration I get
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 2: ordinal not in range(128)
errors
The weird thing is that before I place them as dictionary keys I encode them without errors. Here is a sample code
pos = {}
neg = {}
for status in corpus:
p = s.analyze(status).polarity
words = []
# gather real words
for w in status.split(' '):
try:
words.append(w.encode('utf-8'))
except UnicodeDecodeError as e:
print(e)
# assign sentiment of the sentence to the words
for w in words:
if w not in pos:
pos[w] = 0
neg[w] = 0
if p >= 0:
pos[w] += 1
else:
neg[w] += 1
k = pos.keys()
k = [i.encode('utf-8') for i in k] # <-- for this line a get an error
p = [v for i, v in pos.items()]
n = [v for i, v in neg.items()]
So this piece of code will catch no errors during the splitting of the words but it will throw an error when trying to encode the keys again. I should note than normally I wouldn't try to encode the keys again, as I would think they are already properly encoded. But I added this extra encoding to narrow down the source of the error.
Am I missing something? Do you see anything wrong with my code?
to avoid confusion here is a sample code more close to the original that is not trying to encode the keys again
k = ['happy']
for i in range(3):
print('sampling twitter --> {}'.format(i))
myStream.filter(track=k) # <-- this is where I will receive the error in the second iteration
for status in corpus:
p = s.analyze(status).polarity
words = []
# gather real words
for w in status.split(' '):
try:
words.append(w.encode('utf-8'))
except UnicodeDecodeError as e:
print(e)
# assign sentiment of the sentence to the words
for w in words:
if w not in pos:
pos[w] = 0
neg[w] = 0
if p >= 0:
pos[w] += 1
else:
neg[w] += 1
k = pos.keys()
(please suggest a better title for the question)
You get a decode error while you are trying to encode a string. This seems weird but it is due to implicit decode/encode mechanism of Python.
Python allows to encode strings to obtain bytes and decode bytes to obtain strings. This means that Python can encode only strings and decode only bytes.
So when you try to encode bytes, Python (which does not know how to encode bytes) tries to implicitely decode the byte to obtain a string to encode and it uses its default encoding to do that.
This is why you get a decode error while trying to encode something: the implicit decoding.
That means that you are probably trying to encode something which is already encoded.
Note that the error message says "'ascii' codec can't decode ...". That's because when you call encode on something that is already a bytestring in Python 2, it tries to decode it to Unicode first using the default codec.
I'm not sure why you thought that encoding again would be a good idea. Don't do it; the strings are already byetestrings, leave them as that.
Related
I have a file which is mostly UTF-8, but some Windows-1252 characters have also found their way in.
I created a table to map from the Windows-1252 (cp1252) characters to their Unicode counterparts, and would like to use it to fix the mis-encoded characters, e.g.
cp1252_to_unicode = {
"\x85": u'\u2026', # …
"\x91": u'\u2018', # ‘
"\x92": u'\u2019', # ’
"\x93": u'\u201c', # “
"\x94": u'\u201d', # ”
"\x97": u'\u2014' # —
}
for l in open('file.txt'):
for c, u in cp1252_to_unicode.items():
l = l.replace(c, u)
But attempting to do the replace this way results in a UnicodeDecodeError being raised, e.g.:
"\x85".replace("\x85", u'\u2026')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x85 in position 0: ordinal not in range(128)
Any ideas for how to deal with this?
If you try to decode this string as utf-8, as you already know, you will get an "UnicodeDecode" error, as these spurious cp1252 characters are invalid utf-8 -
However, Python codecs allow you to register a callback to handle encoding/decoding errors, with the codecs.register_error function - it gets the UnicodeDecodeerror a a parameter - you can write such a handler that atempts to decode the data as "cp1252", and continues the decoding in utf-8 for the rest of the string.
In my utf-8 terminal, I can build a mixed incorrect string like this:
>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252")
>>> print a
maçã ma��
>>> a.decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 9-11: invalid data
I wrote the said callback function here, and found a catch: even if you increment the position from which to decode the string by 1, so that it would start on the next chratcer, if the next character is also not utf-8 and out of range(128), the error is raised at the first out of range(128) character - that means, the decoding "walks back" if consecutive non-ascii, non-utf-8 chars are found.
The worka round this is to have a state variable in the error_handler which detects this "walking back" and resume decoding from the last call to it - on this short example, I implemented it as a global variable - (it will have to be manually reset to "-1" before each call to the decoder):
import codecs
last_position = -1
def mixed_decoder(unicode_error):
global last_position
string = unicode_error[1]
position = unicode_error.start
if position <= last_position:
position = last_position + 1
last_position = position
new_char = string[position].decode("cp1252")
#new_char = u"_"
return new_char, position + 1
codecs.register_error("mixed", mixed_decoder)
And on the console:
>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252")
>>> last_position = -1
>>> print a.decode("utf-8", "mixed")
maçã maçã
With thanks to jsbueno and a whack of other Google searches and other pounding I solved it this way.
#The following works very well but it does not allow for any attempts to FIX the data.
xmlText = unicode(xmlText, errors='replace').replace(u"\uFFFD", "?")
This version allows for a limited opportunity to repair invalid characters. Unknown characters are replaced with a safe value.
import codecs
replacement = {
'85' : '...', # u'\u2026' ... character.
'96' : '-', # u'\u2013' en-dash
'97' : '-', # u'\u2014' em-dash
'91' : "'", # u'\u2018' left single quote
'92' : "'", # u'\u2019' right single quote
'93' : '"', # u'\u201C' left double quote
'94' : '"', # u'\u201D' right double quote
'95' : "*" # u'\u2022' bullet
}
#This is is more complex but allows for the data to be fixed.
def mixed_decoder(unicodeError):
errStr = unicodeError[1]
errLen = unicodeError.end - unicodeError.start
nextPosition = unicodeError.start + errLen
errHex = errStr[unicodeError.start:unicodeError.end].encode('hex')
if errHex in replacement:
return u'%s' % replacement[errHex], nextPosition
return u'%s' % errHex, nextPosition # Comment this line out to get a question mark
return u'?', nextPosition
codecs.register_error("mixed", mixed_decoder)
xmlText = xmlText.decode("utf-8", "mixed")
Basically I attempt to turn it into utf8. For any characters that fail I just convert it to HEX so I can display or look it up in a table of my own.
This is not pretty but it does allow me to make sense of messed up data
Good solution that of #jsbueno, but there is no need of global variable last_position, see:
def mixed_decoder(error: UnicodeError) -> (str, int):
bs: bytes = error.object[error.start: error.end]
return bs.decode("cp1252"), error.start + 1
import codecs
codecs.register_error("mixed", mixed_decoder)
a = "maçã".encode("utf-8") + "maçã".encode("cp1252")
# a = b"ma\xc3\xa7\xc3\xa3ma\xe7\xe3"
s = a.decode("utf-8", "mixed")
# s = "maçãmaçã"
This is usually called Mojibake.
There's a nice Python library that might solve these issues for you called ftfy.
Example:
>>> from ftfy import fix_text
>>> fix_text("Ð¨ÐµÐ¿Ð¾Ñ (напоминалки)")
'Шепот (напоминалки)'
Just came into this today, so here is my problem and my own solution:
original_string = 'Notifica\xe7\xe3o de Emiss\xe3o de Nota Fiscal Eletr\xf4nica.'
def mixed_decoding(s):
output = ''
ii = 0
for c in s:
if ii <= len(s)-1:
if s[ii] == '\\' and s[ii+1] == 'x':
b = s[ii:ii+4].encode('ascii').decode('unicode-escape')
output = output+b
ii += 3
else:
output = output+s[ii]
ii += 1
print(output)
return output
decoded_string = mixed_decoding(original_string)
Now it prints:
>>> Notificação de Emissão de Nota Fiscal Eletrônica.
I'm trying to save concrete content of the dictionary to a file but when I try to write it, I get the following error:
Traceback (most recent call last):
File "P4.py", line 83, in <module>
outfile.write(u"{}\t{}\n".format(keyword, str(tagSugerido)).encode("utf-8"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 40: ordinal not in range(128)
And here is the code:
from collections import Counter
with open("corpus.txt") as inf:
wordtagcount = Counter(line.decode("latin_1").rstrip() for line in inf)
with open("lexic.txt", "w") as outf:
outf.write('Palabra\tTag\tApariciones\n'.encode("utf-8"))
for word,count in wordtagcount.iteritems():
outf.write(u"{}\t{}\n".format(word, count).encode("utf-8"))
"""
2) TAGGING USING THE MODEL
Dados los ficheros de test, para cada palabra, asignarle el tag mas
probable segun el modelo. Guardar el resultado en ficheros que tengan
este formato para cada linea: Palabra Prediccion
"""
file=open("lexic.txt", "r") # abrimos el fichero lexic (nuestro modelo) (probar con este)
data=file.readlines()
file.close()
diccionario = {}
"""
In this portion of code we iterate the lines of the .txt document and we create a dictionary with a word as a key and a List as a value
Key: word
Value: List ([tag, #ocurrencesWithTheTag])
"""
for linea in data:
aux = linea.decode('latin_1').encode('utf-8')
sintagma = aux.split('\t') # Here we separate the String in a list: [word, tag, ocurrences], word=sintagma[0], tag=sintagma[1], ocurrences=sintagma[2]
if (sintagma[0] != "Palabra" and sintagma[1] != "Tag"): #We are not interested in the first line of the file, this is the filter
if (diccionario.has_key(sintagma[0])): #Here we check if the word was included before in the dictionary
aux_list = diccionario.get(sintagma[0]) #We know the name already exists in the dic, so we create a List for every value
aux_list.append([sintagma[1], sintagma[2]]) #We add to the list the tag and th ocurrences for this concrete word
diccionario.update({sintagma[0]:aux_list}) #Update the value with the new list (new list = previous list + new appended element to the list)
else: #If in the dic do not exist the key, que add the values to the empty list (no need to append)
aux_list_else = ([sintagma[1],sintagma[2]])
diccionario.update({sintagma[0]:aux_list_else})
"""
Here we create a new dictionary based on the dictionary created before, in this new dictionary (diccionario2) we want to keep the next
information:
Key: word
Value: List ([suggestedTag, #ocurrencesOfTheWordInTheDocument, probability])
For retrieve the information from diccionario, we have to keep in mind:
In case we have more than 1 Tag associated to a word (keyword ), we access to the first tag with keyword[0], and for ocurrencesWithTheTag with keyword[1],
from the second case and forward, we access to the information by this way:
diccionario.get(keyword)[2][0] -> with this we access to the second tag
diccionario.get(keyword)[2][1] -> with this we access to the second ocurrencesWithTheTag
diccionario.get(keyword)[3][0] -> with this we access to the third tag
...
..
.
etc.
"""
diccionario2 = dict.fromkeys(diccionario.keys())#We create a dictionary with the keys from diccionario and we set all the values to None
with open("estimation.txt", "w") as outfile:
for keyword in diccionario:
tagSugerido = unicode(diccionario.get(keyword[0]).decode('utf-8')) #tagSugerido is the tag with more ocurrences for a concrete keyword
maximo = float(diccionario.get(keyword)[1]) #maximo is a variable for the maximum number of ocurrences in a keyword
if ((len(diccionario.get(keyword))) > 2): #in case we have > 2 tags for a concrete word
suma = float(diccionario.get(keyword)[1])
for i in range (2, len(diccionario.get(keyword))):
suma += float(diccionario.get(keyword)[i][1])
if (diccionario.get(keyword)[i][1] > maximo):
tagSugerido = unicode(diccionario.get(keyword)[i][0]).decode('utf-8'))
maximo = float(diccionario.get(keyword)[i][1])
probabilidad = float(maximo/suma);
diccionario2.update({keyword:([tagSugerido, suma, probabilidad])})
else:
diccionario2.update({keyword:([diccionario.get(keyword)[0],diccionario.get(keyword)[1], 1])})
outfile.write(u"{}\t{}\n".format(keyword, tagSugerido).encode("utf-8"))
The desired output will look like this:
keyword(String) tagSugerido(String):
Hello NC
Friend N
Run V
...etc
The conflictive line is:
outfile.write(u"{}\t{}\n".format(keyword, str(tagSugerido)).encode("utf-8"))
Thank you.
Like zmo suggested:
outfile.write(u"{}\t{}\n".format(keyword, str(tagSugerido)).encode("utf-8"))
should be:
outfile.write(u"{}\t{}\n".format(keyword, tagSugerido.encode("utf-8")))
A note on unicode in Python 2
Your software should only work with unicode strings internally, converting to a particular encoding on output.
Do prevent from making the same error over and over again you should make sure you understood the difference between ascii and utf-8 encodings and also between str and unicode objects in Python.
The difference between ASCII and UTF-8 encoding:
Ascii needs just one byte to represent all possible characters in the ascii charset/encoding. UTF-8 needs up to four bytes to represent the complete charset.
ascii (default)
1 If the code point is < 128, each byte is the same as the value of the code point.
2 If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)
utf-8 (unicode transformation format)
1 If the code point is <128, it’s represented by the corresponding byte value.
2 If the code point is between 128 and 0x7ff, it’s turned into two byte values between 128 and 255.
3 Code points >0x7ff are turned into three- or four-byte sequences, where each byte of the sequence is between 128 and 255.
The difference between str and unicode objects:
You can say that str is baiscally a byte string and unicode is a unicode string. Both can have a different encoding like ascii or utf-8.
str vs. unicode
1 str = byte string (8-bit) - uses \x and two digits
2 unicode = unicode string - uses \u and four digits
3 basestring
/\
/ \
str unicode
If you follow some simple rules you should go fine with handling str/unicode objects in different encodings like ascii or utf-8 or whatever encoding you have to use:
Rules
1 encode(): Gets you from Unicode -> bytes
encode([encoding], [errors='strict']), returns an 8-bit string version of the Unicode string,
2 decode(): Gets you from bytes -> Unicode
decode([encoding], [errors]) method that interprets the 8-bit string using the given encoding
3 codecs.open(encoding=”utf-8″): Read and write files directly to/from Unicode (you can use any encoding, not just utf-8, but utf-8 is most common).
4 u”: Makes your string literals into Unicode objects rather than byte sequences.
5 unicode(string[, encoding, errors])
Warning: Don’t use encode() on bytes or decode() on Unicode objects
And again: Software should only work with Unicode strings internally, converting to a particular encoding on output.
As you're not giving a simple concise code to illustrate your question, I'll just give you a general advice on what should be the error:
If you're getting a decode error, it's that tagSugerido is read as ASCII and not as Unicode. To fix that, you should do:
tagSugerido = unicode(diccionario.get(keyword[0]).decode('utf-8'))
to store it as an unicode.
Then you're likely to get an encode error at the write() stage, and you should fix your write the following way:
outfile.write(u"{}\t{}\n".format(keyword, str(tagSugerido)).encode("utf-8"))
should be:
outfile.write(u"{}\t{}\n".format(keyword, tagSugerido.encode("utf-8")))
I litterally answered a very similar question moments ago. And when working with unicode strings, switch to python3, it'll make your life easier!
If you cannot switch to python3 just yet, you can make your python2 behave like it is almost python3, using the python-future import statement:
from __future__ import absolute_import, division, print_function, unicode_literals
N.B.: instead of doing:
file=open("lexic.txt", "r") # abrimos el fichero lexic (nuestro modelo) (probar con este)
data=file.readlines()
file.close()
which will fail to close properly the file descriptor upon failure during readlines, you should better do:
with open("lexic.txt", "r") as f:
data=f.readlines()
which will take care of always closing the file even upon failure.
N.B.2: Avoid using file as this is a python type you're shadowing, but use f or lexic_file…
I'm falling the unicode hell.
My environment in on unix, python 2.7.3
LC_CTYPE=zh_TW.UTF-8
LANG=en_US.UTF-8
I'm trying to dump hex encoded data in human readable format, here is simplified code
#! /usr/bin/env python
# encoding:utf-8
import sys
s=u"readable\n" # previous result keep in unicode string
s2="fb is not \xfb" # data read from binary file
s += s2
print s # method 1
print s.encode('utf-8') # method 2
print s.encode('utf-8','ignore') # method 3
print s.decode('iso8859-1') # method 4
# method 1-4 display following error message
#UnicodeDecodeError: 'ascii' codec can't decode byte 0xfb
# in position 0: ordinal not in range(128)
f = open('out.txt','wb')
f.write(s)
I just want to print out the 0xfb.
I should describe more here. The key is 's += s2'.
Where s will keep my previous decoded string.
And the s2 is next string which should append into s.
If I modified as following, it occurs on write file.
s=u"readable\n"
s2="fb is not \xfb"
s += s2.decode('cp437')
print s
f=open('out.txt','wb')
f.write(s)
# UnicodeEncodeError: 'ascii' codec can't encode character
# u'\u221a' in position 1: ordinal not in range(128)
I wish the result of out.txt is
readable
fb is not \xfb
or
readable
fb is not 0xfb
[Solution]
#! /usr/bin/env python
# encoding:utf-8
import sys
import binascii
def fmtstr(s):
r = ''
for c in s:
if ord(c) > 128:
r = ''.join([r, "\\x"+binascii.hexlify(c)])
else:
r = ''.join([r, c])
return r
s=u"readable"
s2="fb is not \xfb"
s += fmtstr(s2)
print s
f=open('out.txt','wb')
f.write(s)
I strongly suspect that your code is actually erroring out on the previous line: the s += s2 one. s2 is just a series of bytes, which can't be arbitrarily tacked on to a unicode object (which is instead a series of code points).
If you had intended the '\xfb' to represent U+FB, LATIN SMALL LETTER U WITH CIRCUMFLEX, it would have been better to assign it like this instead:
s2 = u"\u00fb"
But you said that you just want to print out \xHH codes for control characters. If you just want it to be something humans can understand which still makes it apparent that special characters are in a string, then repr may be enough. First, don't have s be a unicode object, because you're treating your strings here as a series of bytes, not a series of code points.
s = s.encode('utf-8')
s += s2
print repr(s)
Finally, if you don't want the extra quotes on the outside that repr adds, for nice pretty printing or whatever, there's not a simple builtin way to do that in Python (that I know of). I've used something like this before:
import re
controlchars_re = re.compile(r'[\x00-\x31\x7f-\xff]')
def _show_control_chars(match):
txt = repr(match.group(0))
return txt[1:-1]
def escape_special_characters(s):
return controlchars_re.sub(_show_control_chars, s.replace('\\', '\\\\'))
You can pretty easily tweak the controlchars_re regex to define which characters you care about escaping.
I have two python dictionaries containing information about japanese words and characters:
vocabDic : contains vocabulary, key: word, value: dictionary with information about it
kanjiDic : contains kanji ( single japanese character ), key: kanji, value: dictionary with information about it
Now I would like to iterate through each character of each word in the vocabDic and look up this character in the kanji dictionary. My goal is to create a csv file which I can then import into a database as join table for vocabulary and kanji.
My Python version is 2.6
My code is as following:
kanjiVocabJoinWriter = csv.writer(open('kanjiVocabJoin.csv', 'wb'), delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
kanjiVocabJoinCount = 1
#loop through dictionary
for key, val in vocabDic.iteritems():
if val['lang'] is 'jpn': # only check japanese words
vocab = val['text']
print vocab
# loop through vocab string
for v in vocab:
test = kanjiDic.get(v)
print v
print test
if test is not None:
print str(kanjiVocabJoinCount)+','+str(test['id'])+','+str(val['id'])
kanjiVocabJoinWriter([str(kanjiVocabJoinCount),str(test['id']),str(val['id'])])
kanjiVocabJoinCount = kanjiVocabJoinCount+1
If I print the variables to the command line, I get:
vocab : works, prints in japanese
v ( one character of the vocab in the for loop ) : �
test ( character looked up in the kanjiDic ) : None
To me it seems like the for loop messes the encoding up.
I tried various functions ( decode, encode.. ) but no luck so far.
Any ideas on how I could get this working?
Help would be very much appreciated.
From your description of the problem, it sounds like vocab is an encoded str object, not a unicode object.
For concreteness, suppose vocab equals u'債務の天井' encoded in utf-8:
In [42]: v=u'債務の天井'
In [43]: vocab=v.encode('utf-8') # val['text']
Out[43]: '\xe5\x82\xb5\xe5\x8b\x99\xe3\x81\xae\xe5\xa4\xa9\xe4\xba\x95'
If you loop over the encoded str object, you get one byte at a time: \xe5, then \x82, then \xb5, etc.
However if you loop over the unicode object, you'd get one unicode character at a time:
In [45]: for v in u'債務の天井':
....: print(v)
債
務
の
天
井
Note that the first unicode character, encoded in utf-8, is 3 bytes:
In [49]: u'債'.encode('utf-8')
Out[49]: '\xe5\x82\xb5'
That's why looping over the bytes, printing one byte at a time, (e.g. print \xe5) fails to print a recognizable character.
So it looks like you need to decode your str objects and work with unicode objects. You didn't mention what encoding you are using for your str objects. If it is utf-8, then you'd decode it like this:
vocab=val['text'].decode('utf-8')
If you are not sure what encoding val['text'] is in, post the output of
print(repr(vocab))
and maybe we can guess the encoding.
I know there are quite a few solutions for this problem but mine was peculiar in the sense that, I might get truncated utf16 data and yet have to make the best effort of dealing with conversions where decode and encode will fail with UnicodeDecodeError. So came up with the following code in python.
Please let me know your comments on how I can improve them for faster processing.
try:
# conversion to ascii if utf16 data is formatted correctly
input = open(filename).read().decode('UTF16')
asciiStr = input.encode('ASCII', 'ignore')
open(filename).close()
return asciiStr
except:
# if fail with UnicodeDecodeError, then use brute force
# to decode truncated data
try:
unicode = open(filename).read()
if (ord(unicode[0]) == 255 and ord(unicode[1]) == 254):
print("Little-Endian format, UTF-16")
leAscii = "".join([(unicode[i]) for i in range(2, len(unicode), 2) if 0 < ord(unicode[i]) < 127])
open(filename).close()
return leAscii
elif (ord(unicode[0]) == 254 and ord(unicode[1]) == 255):
print("Big-Endian format, UTF-16")
beAscii = "".join([(unicode[i]) for i in range(3, len(unicode), 2) if 0 < ord(unicode[i]) < 127])
open(filename).close()
return beAscii
else:
open(filename).close()
return None
except:
open(filename).close()
print("Error in converting to ASCII")
return None
What about:
data = open(filename).read()
try:
data = data.decode("utf-16")
except UnicodeDecodeError:
data = data[:-1].decode("utf-16")
I.e. if it's truncated mid-way through a code unit, snip the last byte off, and do it again. That should get you back to a valid UTF-16 string, without having to try to implement a decoder yourself.
To tolerate errors you could use the optional second argument to the byte-string's decode method. In this example the dangling third byte ('c') is replaced with the "replacement character" U+FFFD:
>>> 'abc'.decode('UTF-16', 'replace')
u'\u6261\ufffd'
There is also an 'ignore' option which will simply drop bytes that can't be decoded:
>>> 'abc'.decode('UTF-16', 'ignore')
u'\u6261'
While it is common to desire a system that is "tolerant" of incorrectly encoded text, it is often quite difficult to define precisely what the expected behavior is in these situations. You may find that the one who provided the requirement to "deal with" incorrectly encoded text does not fully grasp the concept of character encoding.
This just jumped out at me as a "best practice" improvement. File accesses should really be wrapped in with blocks. This will handle opening and cleaning up for you.