Using ASCII number to character in python - python

I am trying to print a list of dicts to file that's encoded in latin-1. Each field is to be separated by an ASCII character 254 and the end of line should be ASCII character 20.
When I try to use a character that is greater than 128 I get "UnicodeDecodeError: 'ascii' codec can't decode byte 0xfe in position 12: ordinal not in range(128)"
This is my current code. Could some one help me with how to encode a ASCII char 254 and how to add a end of line ASCII char 20 when using DictWriter.
Thanks
my Code:
with codecs.open("test.dat", "w", "ISO-8859-1") as outputFile:
delimiter = (chr(254))
keys = file_dict[0].keys()
dict_writer = csv.DictWriter(outputFile, keys, delimiter=delimiter)
dict_writer.writeheader()
for value in file_dict:
dict_writer.writerow(value)

ASCII does only contain character codes 0-127.
Codes in the range 128-255 are not defined in ASCII but only in codecs that extend it, like ANSI, latin-1 or all Unicodes.
In your case it's probably somehow double-encoding the string, which fails.
It works if you use the standard built-in open function without specifying a codec:
with open("test.dat", "w") as outputFile: # omit the codec stuff here
delimiter = (chr(254))
keys = file_dict[0].keys()
dict_writer = csv.DictWriter(outputFile, keys, delimiter=delimiter)
dict_writer.writeheader()
for value in file_dict:
dict_writer.writerow(value)

Related

Dictionary keys with unicode characters raise Error

I write CSV parser.
CSV file have strings with unidentified characters and JSON file have map with the correct strings.
file.csv
0,�urawska A.
1,Polnar J�zef
dict.json
{
"\ufffdurawska A.": "\u017burawska A.",
"Polnar J\ufffdzef": "Polnar J\u00f3zef"
}
parse.py
import csv
import json
proper_names = json.load(open('dict.json'))
with open('file.csv') as csv_file:
reader = csv.reader(csv_file, delimiter=',')
for row in reader:
print proper_names[row[1].decode('utf-8')]
Traceback (most recent call last): File "parse.py", line 9, in
print proper_names[row[1].decode('utf-8')] UnicodeEncodeError: 'ascii' codec can't encode character u'\u017b' in position 0: ordinal
not in range(128)
How can I use that dict with decoded strings ?
I could reproduce the error, and identify where it occurs. In fact, a dictionnary with unicode keys causes no problem, the error occurs when you try to print a unicode character that cannot be represented in ascii. If you split the print in 2 lines:
for row in reader:
val = proper_names[row[1].decode('utf-8')]
print val
the error will occur on print line.
You must encode it back with a correct charset. the one I know best is latin1, but it cannot represent \u017b, so I use again utf8:
for row in reader:
val = proper_names[row[1].decode('utf-8')]
print val.encode('utf8')
or directly
for row in reader:
print proper_names[row[1].decode('utf-8')].encode('utf8')
If I look at the error message, I think that the issue is the value, not the key. (\u017b is in the value)
So you also have to encode the result:
print proper_names[row[1].decode('utf-8')].encode('utf-8')
(edit: fixes to address comments for future reference)

Ordinal not in range, but I can't find it: removing non-ASCII characters

I had an error when I'm trying to convert a CSV to KML using minidom .toprettyxml with:
for row in csvReader:
placemarkElement = createPlacemark(kmlDoc, row, order)
documentElement.appendChild(placemarkElement)
kmlFile = open(fileName, 'w')
kmlFile.write(kmlDoc.toprettyxml(' ', newl='\n', encoding='utf-8'))
the error is:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position
5: ordinal not in range(128)
I had a lot of records, so decided to print out what it was that falls out of this range, as it doesn't seem to be 0xa0, so I used:
def removeNonAscii(s):
for i in s:
if ord(i) > 128:
sys.stdout.write("\r{0}".format(i) + '\n')
sys.stdout.write("\r{0}".format(ord(i)) + '\n')
return "".join(i for i in s if ord(i) < 128)
to remove the offending chars from the output, and print out what they were.
The only thing that prints out is "á", with ordinal code 160.
I searched for this both in Notepad++ and in Excel, but cannot find it, so I don't understand where this is coming from?
I would rather know what I am removing with by removeNonAscii, in case it is important
I'm using utf-8 encoding if that helps

Converting non ascii characters to ascii from dictreader

There are many questions on python and unicode/string. However, none of the answers work for me.
First, a file is opened using DictReader, then each row is put into an array. Then the dict value is sent to be converted to unicode.
Step One is getting the data
f = csv.DictReader(open(filename,"r")
data = []
for row in f:
data.append(row)
Step Two is getting a string value from the dict and replacing the accents (found this from other posts)
s = data[i].get('Name')
strip_accents(s)
def strip_accents(s):
try: s = unicode(s)
except: s = s.encode('utf-8')
s = unicodedata.normalize('NFKD', s).encode('ascii','ignore')
return s
I use the try and except because some strings have accents, the others dont. What I can not figure out is, the unicode(s) works with a type str that has no accents, however, when a type str has accents, it fails
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 11: ordinal not in range(128)
I have seen posts on this but the answers do not work. When I use type(s), it says it is <type 'str'> . So I tried to read the file as unicode
f = csv.DictReader(codecs.open(filename,"r",encoding='utf-8'))
But as soon as it goes to read
data = []
for row in f:
data.append(row)
This error occurs:
File "F:...files.py", line 9, in files
for row in f:
File "C:\Python27\lib\csv.py", line 104, in next
row = self.reader.next()
File "C:\Python27\lib\codecs.py", line 684, in next
return self.reader.next()
File "C:\Python27\lib\codecs.py", line 615, in next
line = self.readline()
File "C:\Python27\lib\codecs.py", line 530, in readline
data = self.read(readsize, firstline=True)
File "C:\Python27\lib\codecs.py", line 477, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 0: invalid start byte
Is this error caused by the way dictreader is handling the unicode? How to get around this?
More tests. As #univerio pointed out, one item which is causing the fails is ISO-8859-1
Modifying the open statement to:
f = csv.DictReader(codecs.open(filename,"r",encoding="cp1252"))
produces a slightly different error:
File "F:...files.py", line 9, in files
for row in f:
File "C:\Python27\lib\csv.py", line 104, in next
row = self.reader.next()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 11: ordinal not in range(128)
Using the basic open statement and modifying strip_accents() such as:
try: s = unicode(s)
except: s = s.decode("iso-8859-1").encode('utf8')
print type(s)
s = unicodedata.normalize('NFKD', s).encode('ascii','ignore')
return str(s)
prints that the type is still str and errors on
s = unicodedata.normalize('NFKD', s).encode('ascii','ignore')
TypeError: must be unicode, not str
based on Python: Converting from ISO-8859-1/latin1 to UTF-8 modifying to
s = unicode(s.decode("iso-8859-1").encode('utf8'))
produces a different error:
except: s = unicode(s.decode("iso-8859-1").encode('utf8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 11: ordinal not in range(128)
I think this should work:
def strip_accents(s):
s = s.decode("cp1252") # decode from cp1252 encoding instead of the implicit ascii encoding used by unicode()
s = unicodedata.normalize('NFKD', s).encode('ascii','ignore')
return s
The reason opening the file with the correct encoding didn't work is because DictReader doesn't seem to handle unicode strings correctly.
Reference here: UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128), by #Duncan 's answer,
print repr(ch)
Example:
string = 'Ka\u011f KO\u011e52 \u0131 \u0130\u00f6\u00d6 David \u00fc K\u00dc\u015f\u015e \u00e7 \u00c7'
print (repr(string))
It prints:
'Kağ KOĞ52 ı İöÖ David ü KÜşŞ ç Ç'

Python process a csv file to remove unicode characters greater than 3 bytes

I'm using Python 2.7.5 and trying to take an existing CSV file and process it to remove unicode characters that are greater than 3 bytes. (Sending this to Mechanical Turk, and it's an Amazon restriction.)
I've tried to use the top (amazing) answer in this question (How to filter (or replace) unicode characters that would take more than 3 bytes in UTF-8?). I assume that I can just iterate through the csv row-by-row, and wherever I spot unicode characters of >3 bytes, replace them with a replacement character.
# -*- coding: utf-8 -*-
import csv
import re
re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
ifile = open('sourcefile.csv', 'rU')
reader = csv.reader(ifile, dialect=csv.excel_tab)
ofile = open('outputfile.csv', 'wb')
writer = csv.writer(ofile, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
#skip header row
next(reader, None)
for row in reader:
writer.writerow([re_pattern.sub(u'\uFFFD', unicode(c).encode('utf8')) for c in row])
ifile.close()
ofile.close()
I'm currently getting this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xea in position 264: ordinal not in range(128)
So this does iterate properly through some rows, but stops when it gets to the strange unicode characters.
I'd really appreciate some pointers; I'm completely confused. I've replaced 'utf8' with 'latin1' and unicode(c).encode to unicode(c).decode and I keep getting this same error.
Your input is still encoded data, not Unicode values. You'd need to decode to unicode values first, but you didn't specify an encoding to use. You then need to encode again back to encoded values to write back to the output CSV:
writer.writerow([re_pattern.sub(u'\uFFFD', unicode(c, 'utf8')).encode('utf8')
for c in row])
Your error stems from the unicode(c) call; without an explicit codec to use, Python falls back to the default ASCII codec.
If you use your file objects as context managers, there is no need to manually close them:
import csv
import re
re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
def limit_to_BMP(value, patt=re_pattern):
return patt.sub(u'\uFFFD', unicode(value, 'utf8')).encode('utf8')
with open('sourcefile.csv', 'rU') as ifile, open('outputfile.csv', 'wb') as ofile:
reader = csv.reader(ifile, dialect=csv.excel_tab)
writer = csv.writer(ofile, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
next(reader, None) # header is not added to output file
writer.writerows(map(limit_to_BMP, row) for row in reader)
I moved the replacement action to a separate function too, and used a generator expression to produce all rows on demand for the writer.writerows() function.

How to Reverse Hebrew String in Python?

I am trying to reverse Hebrew string in Python:
line = 'אבגד'
reversed = line[::-1]
print reversed
but I get:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 0: ordinal not in range(128)
Care to explain what I'm doing wrong?
EDIT:
I'm also trying to save the string into a file using:
w1 = open('~/fileName', 'w')
w1.write(reverseLine)
but now I get:
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 1-3: character maps to <undefined>
Any ideas how to fix that, too?
you need more than reverse a string to flip hebrew backwords, due to the opposite order of numbers etc.
The algorithms is much more complicated;
All the answers in this page (to this date) will most likely screw your numbers and non-hebrew texts.
For most cases you should use
from bidi.algorithm import get_display
print get_display(text)
Adding u in front of the hebrew string works for me:
In [1]: line = u'אבגד'
In [2]: reversed = line[::-1]
In [2]: print reversed
דגבא
To your second question, you can use:
import codecs
w1 = codecs.open("~/fileName", "r", "utf-8")
w1.write(reversed)
To write unicode string to file fileName.
Alternatively, without using codecs, you will need to encode reversed string with utf-8 when writing to file:
with open('~/fileName', 'w') as f:
f.write(reversed.encode('utf-8'))
You need to use a unicode string constant:
line = u'אבגד'
reversed = line[::-1]
print reversed
String defaults to being treated as ascii. Use u'' for unicode
line = u'אבגד'
reversed = line[::-1]
print reversed
Make sure you're using unicode objects
line = unicode('אבגד', 'utf-8')
reversed = line[::-1]
print reversed
Found how to write to file:
w1 = codecs.open('~/fileName', 'w', encoding='utf-8')
w1.write(reverseLine)

Categories