Dictionary keys with unicode characters raise Error - python

I write CSV parser.
CSV file have strings with unidentified characters and JSON file have map with the correct strings.
file.csv
0,�urawska A.
1,Polnar J�zef
dict.json
{
"\ufffdurawska A.": "\u017burawska A.",
"Polnar J\ufffdzef": "Polnar J\u00f3zef"
}
parse.py
import csv
import json
proper_names = json.load(open('dict.json'))
with open('file.csv') as csv_file:
reader = csv.reader(csv_file, delimiter=',')
for row in reader:
print proper_names[row[1].decode('utf-8')]
Traceback (most recent call last): File "parse.py", line 9, in
print proper_names[row[1].decode('utf-8')] UnicodeEncodeError: 'ascii' codec can't encode character u'\u017b' in position 0: ordinal
not in range(128)
How can I use that dict with decoded strings ?

I could reproduce the error, and identify where it occurs. In fact, a dictionnary with unicode keys causes no problem, the error occurs when you try to print a unicode character that cannot be represented in ascii. If you split the print in 2 lines:
for row in reader:
val = proper_names[row[1].decode('utf-8')]
print val
the error will occur on print line.
You must encode it back with a correct charset. the one I know best is latin1, but it cannot represent \u017b, so I use again utf8:
for row in reader:
val = proper_names[row[1].decode('utf-8')]
print val.encode('utf8')
or directly
for row in reader:
print proper_names[row[1].decode('utf-8')].encode('utf8')

If I look at the error message, I think that the issue is the value, not the key. (\u017b is in the value)
So you also have to encode the result:
print proper_names[row[1].decode('utf-8')].encode('utf-8')
(edit: fixes to address comments for future reference)

Related

Reading Korean through a CSV in Python

I am having an issue reading a CSV file in to Python containing English and Korean Characters, have tested my code without the Korean and it works fine.
Code (Python - 3.6.4)
import csv
with open('Kor3.csv', 'r') as f:
reader = csv.reader(f)
your_list = list(reader)
print(your_list)
Error
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position
2176: character maps to undefined
CSV File Output: This has been converted from Excel to unicode text then filename changed to CSV. Think this is the route of the problem.
Would it be better to read from an Excel or another format?
Sample Input (2 Columns)
생일 축하해요 Happy birthday
Just declare the encoding when opening the file:
with open('Kor3.csv', 'r', encoding='utf-8') as f:
Use python 3. Csv functions will read with unicode by default
In the end I just went with importing from the excel file think this was an issue with the csv rather than python. Thanks for your help.
from xlrd import open_workbook
wb = open_workbook('Korean.xlsx')
values = []
for s in wb.sheets():
#print 'Sheet:',s.name
for row in range(1, s.nrows):
col_names = s.row(0)
col_value = []
for name, col in zip(col_names, range(s.ncols)):
value = (s.cell(row,col).value)
try : value = str(int(value))
except : pass
col_value.append((value))
values.append(col_value)
print(values) #test
print(values[0][1],values[1][1]) #test2

Extract json fields and write them into a csv with python

I've got a very big json with multiple fields and I want to extract just some of them and then write them into a csv.
Here is my code:
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import json
import csv
data_file = open("book_data.json", "r")
values = json.load(data_file)
data_file.close()
with open("book_data.csv", "wb") as f:
wr = csv.writer(f)
for data in values:
value = data["identifier"]
value = data["authors"]
for key, value in data.iteritems():
wr.writerow([key, value])
It gives me this error:
File "json_to_csv.py", line 22, in <module>
wr.writerow([key, value])
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 8: ordinal not in range(128)
But I give the utf-8 encoding on the top, so I don't know what's wrong there.
Thanks
You need to encode the data:
wr.writerow([key.encode("utf-8"), value.encode("utf-8")])
The difference is equivalent to:
In [8]: print u'\u2019'.encode("utf-8")
’
In [9]: print str(u'\u2019')
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-9-4e3ad09ee31b> in <module>()
----> 1 print str(u'\u2019')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 0: ordinal not in range(128)
If you have a mixture of strings and lists and values, you can use issinstance to check what you have, if you have a list iterate over and encode:
with open("book_data.csv", "wb") as f:
wr = csv.writer(f)
for data in values:
for key, value in data.iteritems():
wr.writerow([key, ",".join([v.encode("utf-8") for v in value]) if isinstance(value, list) else value.encode("utf8")])
To just write the three columns creator, contributor and identifier, just pull the data using the keys:
import csv
with open("book_data.csv", "wb") as f:
wr = csv.writer(f)
for dct in values:
authors = dct["authors"]
wr.writerow((",".join(authors["creator"]).encode("utf-8"),
"".join(authors["contributor"]).encode("utf-8"),
dct["identifier"].encode("utf-8")))

Using ASCII number to character in python

I am trying to print a list of dicts to file that's encoded in latin-1. Each field is to be separated by an ASCII character 254 and the end of line should be ASCII character 20.
When I try to use a character that is greater than 128 I get "UnicodeDecodeError: 'ascii' codec can't decode byte 0xfe in position 12: ordinal not in range(128)"
This is my current code. Could some one help me with how to encode a ASCII char 254 and how to add a end of line ASCII char 20 when using DictWriter.
Thanks
my Code:
with codecs.open("test.dat", "w", "ISO-8859-1") as outputFile:
delimiter = (chr(254))
keys = file_dict[0].keys()
dict_writer = csv.DictWriter(outputFile, keys, delimiter=delimiter)
dict_writer.writeheader()
for value in file_dict:
dict_writer.writerow(value)
ASCII does only contain character codes 0-127.
Codes in the range 128-255 are not defined in ASCII but only in codecs that extend it, like ANSI, latin-1 or all Unicodes.
In your case it's probably somehow double-encoding the string, which fails.
It works if you use the standard built-in open function without specifying a codec:
with open("test.dat", "w") as outputFile: # omit the codec stuff here
delimiter = (chr(254))
keys = file_dict[0].keys()
dict_writer = csv.DictWriter(outputFile, keys, delimiter=delimiter)
dict_writer.writeheader()
for value in file_dict:
dict_writer.writerow(value)

Write zlib compressed utf8 data to a file

I have a file with data encoded in utf-8. I would like to read the data, remove whitespaces, separate words with a newline, compress the entire content and write them to a file. This is what I am trying to do :
with codecs.open('1020104_4.utf8', encoding='utf8', mode='r') as fr :
data = re.split(r'\s+',fr.read().encode('utf8'))
#with codecs.open('out2', encoding='utf8', mode='w') as fw2 :
data2 = ('\n'.join(data)).decode('utf8')
data3 = zlib.compress(data2)
#fw2.write(data3)
However I get an error :
Traceback (most recent call last):
File "tmp2.py", line 17, in <module>
data3 = zlib.compress(data2)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 47-48: ordinal not in range(128)
How can I write this data to a file?
I think your encoding-foo is just the wrong way round, in Python 3 this would be a lot clearer ☺.
First, when splitting you want to do this on decoded data, i.e. on Unicode strings, which you already get from read since you are using codecs.open, so the first line should be
data = re.split(r'\s+', fr.read())
Consequently, before passing data to zlib you want to convert it to bytes by encoding it:
data2 = ('\n'.join(data)).encode('utf8')
data3 = zlib.compress(data2)
In the last step you want to write it to a binary file handle:
with open("output", "wb") as fw:
fw.write(data3)
You can shorten this a bit by using the gzip module instead:
with codecs.open('1020104_4.utf8', encoding='utf8', mode='r') as fr:
data = re.split(r'\s+', fr.read())
with gzip.open('out2', mode='wb') as fw2 :
data2 = ('\n'.join(data)).encode('utf8')
fw2.write(data2)

Python process a csv file to remove unicode characters greater than 3 bytes

I'm using Python 2.7.5 and trying to take an existing CSV file and process it to remove unicode characters that are greater than 3 bytes. (Sending this to Mechanical Turk, and it's an Amazon restriction.)
I've tried to use the top (amazing) answer in this question (How to filter (or replace) unicode characters that would take more than 3 bytes in UTF-8?). I assume that I can just iterate through the csv row-by-row, and wherever I spot unicode characters of >3 bytes, replace them with a replacement character.
# -*- coding: utf-8 -*-
import csv
import re
re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
ifile = open('sourcefile.csv', 'rU')
reader = csv.reader(ifile, dialect=csv.excel_tab)
ofile = open('outputfile.csv', 'wb')
writer = csv.writer(ofile, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
#skip header row
next(reader, None)
for row in reader:
writer.writerow([re_pattern.sub(u'\uFFFD', unicode(c).encode('utf8')) for c in row])
ifile.close()
ofile.close()
I'm currently getting this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xea in position 264: ordinal not in range(128)
So this does iterate properly through some rows, but stops when it gets to the strange unicode characters.
I'd really appreciate some pointers; I'm completely confused. I've replaced 'utf8' with 'latin1' and unicode(c).encode to unicode(c).decode and I keep getting this same error.
Your input is still encoded data, not Unicode values. You'd need to decode to unicode values first, but you didn't specify an encoding to use. You then need to encode again back to encoded values to write back to the output CSV:
writer.writerow([re_pattern.sub(u'\uFFFD', unicode(c, 'utf8')).encode('utf8')
for c in row])
Your error stems from the unicode(c) call; without an explicit codec to use, Python falls back to the default ASCII codec.
If you use your file objects as context managers, there is no need to manually close them:
import csv
import re
re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
def limit_to_BMP(value, patt=re_pattern):
return patt.sub(u'\uFFFD', unicode(value, 'utf8')).encode('utf8')
with open('sourcefile.csv', 'rU') as ifile, open('outputfile.csv', 'wb') as ofile:
reader = csv.reader(ifile, dialect=csv.excel_tab)
writer = csv.writer(ofile, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
next(reader, None) # header is not added to output file
writer.writerows(map(limit_to_BMP, row) for row in reader)
I moved the replacement action to a separate function too, and used a generator expression to produce all rows on demand for the writer.writerows() function.

Categories