Converting non ascii characters to ascii from dictreader - python

There are many questions on python and unicode/string. However, none of the answers work for me.
First, a file is opened using DictReader, then each row is put into an array. Then the dict value is sent to be converted to unicode.
Step One is getting the data
f = csv.DictReader(open(filename,"r")
data = []
for row in f:
data.append(row)
Step Two is getting a string value from the dict and replacing the accents (found this from other posts)
s = data[i].get('Name')
strip_accents(s)
def strip_accents(s):
try: s = unicode(s)
except: s = s.encode('utf-8')
s = unicodedata.normalize('NFKD', s).encode('ascii','ignore')
return s
I use the try and except because some strings have accents, the others dont. What I can not figure out is, the unicode(s) works with a type str that has no accents, however, when a type str has accents, it fails
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 11: ordinal not in range(128)
I have seen posts on this but the answers do not work. When I use type(s), it says it is <type 'str'> . So I tried to read the file as unicode
f = csv.DictReader(codecs.open(filename,"r",encoding='utf-8'))
But as soon as it goes to read
data = []
for row in f:
data.append(row)
This error occurs:
File "F:...files.py", line 9, in files
for row in f:
File "C:\Python27\lib\csv.py", line 104, in next
row = self.reader.next()
File "C:\Python27\lib\codecs.py", line 684, in next
return self.reader.next()
File "C:\Python27\lib\codecs.py", line 615, in next
line = self.readline()
File "C:\Python27\lib\codecs.py", line 530, in readline
data = self.read(readsize, firstline=True)
File "C:\Python27\lib\codecs.py", line 477, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 0: invalid start byte
Is this error caused by the way dictreader is handling the unicode? How to get around this?
More tests. As #univerio pointed out, one item which is causing the fails is ISO-8859-1
Modifying the open statement to:
f = csv.DictReader(codecs.open(filename,"r",encoding="cp1252"))
produces a slightly different error:
File "F:...files.py", line 9, in files
for row in f:
File "C:\Python27\lib\csv.py", line 104, in next
row = self.reader.next()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 11: ordinal not in range(128)
Using the basic open statement and modifying strip_accents() such as:
try: s = unicode(s)
except: s = s.decode("iso-8859-1").encode('utf8')
print type(s)
s = unicodedata.normalize('NFKD', s).encode('ascii','ignore')
return str(s)
prints that the type is still str and errors on
s = unicodedata.normalize('NFKD', s).encode('ascii','ignore')
TypeError: must be unicode, not str
based on Python: Converting from ISO-8859-1/latin1 to UTF-8 modifying to
s = unicode(s.decode("iso-8859-1").encode('utf8'))
produces a different error:
except: s = unicode(s.decode("iso-8859-1").encode('utf8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 11: ordinal not in range(128)

I think this should work:
def strip_accents(s):
s = s.decode("cp1252") # decode from cp1252 encoding instead of the implicit ascii encoding used by unicode()
s = unicodedata.normalize('NFKD', s).encode('ascii','ignore')
return s
The reason opening the file with the correct encoding didn't work is because DictReader doesn't seem to handle unicode strings correctly.

Reference here: UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128), by #Duncan 's answer,
print repr(ch)
Example:
string = 'Ka\u011f KO\u011e52 \u0131 \u0130\u00f6\u00d6 David \u00fc K\u00dc\u015f\u015e \u00e7 \u00c7'
print (repr(string))
It prints:
'Kağ KOĞ52 ı İöÖ David ü KÜşŞ ç Ç'

Related

Python 3: Combining files gives "UnicodeDecodeError: 'utf-8' codec can't decode byte 0x83 in position 1: invalid start byte"

I have a binary file that was split in 7 pieces (for a CTF I am doing) and have to reorder them
from itertools import permutations
lst = ["1","2","3","4","5","6","7"]
for combo in permutations(lst, 7): # 2 for pairs, 3 for triplets, etc
a = ''.join(combo)
with open('/home/bitnami/fragmented/output/'+a, 'wb') as outfile:
for fname in combo:
with open('frag_'+fname) as infile:
outfile.write(infile.read())
However I get the error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x83 in position
1: invalid start byte
Any advice without comprising the integrity of the files?
Thanks

UnicodeDecodeError when reading from CSV/List: Unexpected End of Data

so I am trying to use something called DeepMoji to grade a csv full of tweets. The tweets must be encoded in Unicode. I have been able to make it work with a small dataset, but with the one that i have of over 200,000 points, i am receiving this error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe2 in position 254: unexpected end of data.
The code and solution I have tried are the following, but gives the same error, does anyone have any ideas?
TEST_SENTENCES = []
with open('Cleaned_Data3.csv', 'rU') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
TEST_SENTENCES.append(row["Tweet"])
try:
[x.encode('utf-8') for x in TEST_SENTENCES]
except:
for rows in TEST_SENTENCES: #attempt to fix the problem
str=unicode(str, errors='replace')
Here is the full error code.
Traceback (most recent call last):
File "C:\Users\pjame\Desktop\DeepMoji-master\examples\score_texts_emojis.py", line 24, in <module>
for row in reader:
File "C:\Python27\lib\site-packages\unicodecsv\py2.py", line 217, in next
row = csv.DictReader.next(self)
File "C:\Python27\lib\csv.py", line 108, in next
row = self.reader.next()
File "C:\Python27\lib\site-packages\unicodecsv\py2.py", line 128, in next
for value in row]
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe2 in position 254: unexpected end of data

How does one print a Unicode character code in Python?

I would like to print a unicode's character code, and not the actual glyph it represents in Python.
For example, if u is a list of unicode characters:
>>> u[0]
u'\u0103'
>>> print u[0]
ă
I would like to output the character code as a raw string: u'\u0103'.
I have tried to just print it to a file, but this doesn't work without encoding it in UTF-8.
>>> w = open('~/foo.txt', 'w')
>>> print>>w, u[0].decode('utf-8')
Traceback (most recent call last):
File "<pyshell#33>", line 1, in <module>
print>>w, u[0].decode('utf-8')
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0103' in position 0: ordinal not in range(128)
>>> print>>w, u[0].encode('utf-8')
>>> w.close()
Encoding it results in the glyph ă being written to the file.
How can I write the character code?
For printing raw unicode data one only need specify the correct encoding:
>>> s = u'\u0103'
>>> print s.encode('raw_unicode_escape')
\u0103

Ordinal not in range, but I can't find it: removing non-ASCII characters

I had an error when I'm trying to convert a CSV to KML using minidom .toprettyxml with:
for row in csvReader:
placemarkElement = createPlacemark(kmlDoc, row, order)
documentElement.appendChild(placemarkElement)
kmlFile = open(fileName, 'w')
kmlFile.write(kmlDoc.toprettyxml(' ', newl='\n', encoding='utf-8'))
the error is:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position
5: ordinal not in range(128)
I had a lot of records, so decided to print out what it was that falls out of this range, as it doesn't seem to be 0xa0, so I used:
def removeNonAscii(s):
for i in s:
if ord(i) > 128:
sys.stdout.write("\r{0}".format(i) + '\n')
sys.stdout.write("\r{0}".format(ord(i)) + '\n')
return "".join(i for i in s if ord(i) < 128)
to remove the offending chars from the output, and print out what they were.
The only thing that prints out is "á", with ordinal code 160.
I searched for this both in Notepad++ and in Excel, but cannot find it, so I don't understand where this is coming from?
I would rather know what I am removing with by removeNonAscii, in case it is important
I'm using utf-8 encoding if that helps

How to Reverse Hebrew String in Python?

I am trying to reverse Hebrew string in Python:
line = 'אבגד'
reversed = line[::-1]
print reversed
but I get:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 0: ordinal not in range(128)
Care to explain what I'm doing wrong?
EDIT:
I'm also trying to save the string into a file using:
w1 = open('~/fileName', 'w')
w1.write(reverseLine)
but now I get:
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 1-3: character maps to <undefined>
Any ideas how to fix that, too?
you need more than reverse a string to flip hebrew backwords, due to the opposite order of numbers etc.
The algorithms is much more complicated;
All the answers in this page (to this date) will most likely screw your numbers and non-hebrew texts.
For most cases you should use
from bidi.algorithm import get_display
print get_display(text)
Adding u in front of the hebrew string works for me:
In [1]: line = u'אבגד'
In [2]: reversed = line[::-1]
In [2]: print reversed
דגבא
To your second question, you can use:
import codecs
w1 = codecs.open("~/fileName", "r", "utf-8")
w1.write(reversed)
To write unicode string to file fileName.
Alternatively, without using codecs, you will need to encode reversed string with utf-8 when writing to file:
with open('~/fileName', 'w') as f:
f.write(reversed.encode('utf-8'))
You need to use a unicode string constant:
line = u'אבגד'
reversed = line[::-1]
print reversed
String defaults to being treated as ascii. Use u'' for unicode
line = u'אבגד'
reversed = line[::-1]
print reversed
Make sure you're using unicode objects
line = unicode('אבגד', 'utf-8')
reversed = line[::-1]
print reversed
Found how to write to file:
w1 = codecs.open('~/fileName', 'w', encoding='utf-8')
w1.write(reverseLine)

Categories