I've got a very big json with multiple fields and I want to extract just some of them and then write them into a csv.
Here is my code:
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import json
import csv
data_file = open("book_data.json", "r")
values = json.load(data_file)
data_file.close()
with open("book_data.csv", "wb") as f:
wr = csv.writer(f)
for data in values:
value = data["identifier"]
value = data["authors"]
for key, value in data.iteritems():
wr.writerow([key, value])
It gives me this error:
File "json_to_csv.py", line 22, in <module>
wr.writerow([key, value])
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 8: ordinal not in range(128)
But I give the utf-8 encoding on the top, so I don't know what's wrong there.
Thanks
You need to encode the data:
wr.writerow([key.encode("utf-8"), value.encode("utf-8")])
The difference is equivalent to:
In [8]: print u'\u2019'.encode("utf-8")
’
In [9]: print str(u'\u2019')
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-9-4e3ad09ee31b> in <module>()
----> 1 print str(u'\u2019')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 0: ordinal not in range(128)
If you have a mixture of strings and lists and values, you can use issinstance to check what you have, if you have a list iterate over and encode:
with open("book_data.csv", "wb") as f:
wr = csv.writer(f)
for data in values:
for key, value in data.iteritems():
wr.writerow([key, ",".join([v.encode("utf-8") for v in value]) if isinstance(value, list) else value.encode("utf8")])
To just write the three columns creator, contributor and identifier, just pull the data using the keys:
import csv
with open("book_data.csv", "wb") as f:
wr = csv.writer(f)
for dct in values:
authors = dct["authors"]
wr.writerow((",".join(authors["creator"]).encode("utf-8"),
"".join(authors["contributor"]).encode("utf-8"),
dct["identifier"].encode("utf-8")))
Related
First of all, I know there is a standard way of doing the task I state in the title. For example,
import csv
with open('test.txt', encoding='utf-8') as f:
reader = csv.reader(f)
for row in reader:
print(row)
I apply this code on my data file(~262MB) in a Jupyter terminal, I get this:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-21-cbed80c58499> in <module>()
2 with open('CarRecord.txt', encoding='utf-8') as f:
3 reader = csv.reader(f)
----> 4 for row in reader:
5 print(row)
//anaconda/envs/py35/lib/python3.5/codecs.py in decode(self, input, final)
319 # decode input (taking the buffer into account)
320 data = self.buffer + input
--> 321 (result, consumed) = self._buffer_decode(data, self.errors, final)
322 # keep undecoded input until the next call
323 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 74: invalid start byte
Okay, position 74 is at the first row of my data file, where the first Chinese char. comes up. So I do another quick test, which I copy the first few rows from my data file and paste them into another new file. I run the same code with the test file, and now it just works as normal as I would expect, without any error messages.
Anyone has any ideas, please?....
------updated following the ideas in the comment:-------
import csv
with open('CarRecord.txt', mode='rb') as f:
decoded_file = f.read().decode('utf-16')
reader = csv.reader(decoded_file, delimiter=',')
for row in reader:
print(row)
now I get:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-37-3708b52ef0a3> in <module>()
1 import csv
2 with open('CarRecord.txt', mode='rb') as f:
----> 3 decoded_file = f.read().decode('utf-16')
4 reader = csv.reader(decoded_file, delimiter=',')
5 for row in reader:
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 1780-1781: illegal UTF-16 surrogate
It is not a precise answer to the question.
It turns out that the original data file -- although it contains unicode characters -- was encoded with ASCII. So I save a new data file and encode it with utf-8, and the standard method of reading CSV file just worked.
This question already has answers here:
Python CSV DictReader with UTF-8 data
(7 answers)
Closed 7 years ago.
I am trying to convert a CSV file to a json file. During that process, when i try to write to the json file, i am getting an error halfway about a unicode error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u06ec' in position 933: ordinal not in range(128)
my code:
import csv
import json
import codecs
csvfile = codecs.open('my.csv', 'r', encoding='utf-8', errors='ignore')
jsonfile = codecs.open('my.json',"w", encoding='utf-8',errors='ignore')
fieldnames = ("Title","Date","Text","Country","Page","Week")
reader = csv.DictReader(csvfile, fieldnames)
for row in reader:
row['Text'] = row['Text'].encode('ascii',errors='ignore') #error occur on this line
json.dump(row, jsonfile)
jsonfile.write('\n')
example of a row:
{'Country': 'UK', 'Title': '12345', 'Text': " hi there hi john i currently ", 'Week': 'week2', 'Page': 'homepage', 'Date': '1/3/16'}
Don't convert to ASCII.
JSON handles unicode natively.
Simply remove .encode("ascii", ...) part.
Also, you don't need to have encoding set on file object you use for JSON, because JSON already serialises unicode correctly.
Edited my code to read the CSV file as binary. It then gave me another issue of invalid byte of which i solved by transforming the text string to unicode:
This is the working code:
csvfile = open('my.csv', 'rb')
jsonfile = codecs.open('my.json',"w")
fieldnames = ("Title","Date","Text","Country","Page","Week")
reader = csv.DictReader(csvfile, fieldnames)
for row in reader:
print row
row['Text'] = unicode(row['Text'],errors='replace')
I write CSV parser.
CSV file have strings with unidentified characters and JSON file have map with the correct strings.
file.csv
0,�urawska A.
1,Polnar J�zef
dict.json
{
"\ufffdurawska A.": "\u017burawska A.",
"Polnar J\ufffdzef": "Polnar J\u00f3zef"
}
parse.py
import csv
import json
proper_names = json.load(open('dict.json'))
with open('file.csv') as csv_file:
reader = csv.reader(csv_file, delimiter=',')
for row in reader:
print proper_names[row[1].decode('utf-8')]
Traceback (most recent call last): File "parse.py", line 9, in
print proper_names[row[1].decode('utf-8')] UnicodeEncodeError: 'ascii' codec can't encode character u'\u017b' in position 0: ordinal
not in range(128)
How can I use that dict with decoded strings ?
I could reproduce the error, and identify where it occurs. In fact, a dictionnary with unicode keys causes no problem, the error occurs when you try to print a unicode character that cannot be represented in ascii. If you split the print in 2 lines:
for row in reader:
val = proper_names[row[1].decode('utf-8')]
print val
the error will occur on print line.
You must encode it back with a correct charset. the one I know best is latin1, but it cannot represent \u017b, so I use again utf8:
for row in reader:
val = proper_names[row[1].decode('utf-8')]
print val.encode('utf8')
or directly
for row in reader:
print proper_names[row[1].decode('utf-8')].encode('utf8')
If I look at the error message, I think that the issue is the value, not the key. (\u017b is in the value)
So you also have to encode the result:
print proper_names[row[1].decode('utf-8')].encode('utf-8')
(edit: fixes to address comments for future reference)
I have a file with data encoded in utf-8. I would like to read the data, remove whitespaces, separate words with a newline, compress the entire content and write them to a file. This is what I am trying to do :
with codecs.open('1020104_4.utf8', encoding='utf8', mode='r') as fr :
data = re.split(r'\s+',fr.read().encode('utf8'))
#with codecs.open('out2', encoding='utf8', mode='w') as fw2 :
data2 = ('\n'.join(data)).decode('utf8')
data3 = zlib.compress(data2)
#fw2.write(data3)
However I get an error :
Traceback (most recent call last):
File "tmp2.py", line 17, in <module>
data3 = zlib.compress(data2)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 47-48: ordinal not in range(128)
How can I write this data to a file?
I think your encoding-foo is just the wrong way round, in Python 3 this would be a lot clearer ☺.
First, when splitting you want to do this on decoded data, i.e. on Unicode strings, which you already get from read since you are using codecs.open, so the first line should be
data = re.split(r'\s+', fr.read())
Consequently, before passing data to zlib you want to convert it to bytes by encoding it:
data2 = ('\n'.join(data)).encode('utf8')
data3 = zlib.compress(data2)
In the last step you want to write it to a binary file handle:
with open("output", "wb") as fw:
fw.write(data3)
You can shorten this a bit by using the gzip module instead:
with codecs.open('1020104_4.utf8', encoding='utf8', mode='r') as fr:
data = re.split(r'\s+', fr.read())
with gzip.open('out2', mode='wb') as fw2 :
data2 = ('\n'.join(data)).encode('utf8')
fw2.write(data2)
I'm using Python 2.7.5 and trying to take an existing CSV file and process it to remove unicode characters that are greater than 3 bytes. (Sending this to Mechanical Turk, and it's an Amazon restriction.)
I've tried to use the top (amazing) answer in this question (How to filter (or replace) unicode characters that would take more than 3 bytes in UTF-8?). I assume that I can just iterate through the csv row-by-row, and wherever I spot unicode characters of >3 bytes, replace them with a replacement character.
# -*- coding: utf-8 -*-
import csv
import re
re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
ifile = open('sourcefile.csv', 'rU')
reader = csv.reader(ifile, dialect=csv.excel_tab)
ofile = open('outputfile.csv', 'wb')
writer = csv.writer(ofile, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
#skip header row
next(reader, None)
for row in reader:
writer.writerow([re_pattern.sub(u'\uFFFD', unicode(c).encode('utf8')) for c in row])
ifile.close()
ofile.close()
I'm currently getting this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xea in position 264: ordinal not in range(128)
So this does iterate properly through some rows, but stops when it gets to the strange unicode characters.
I'd really appreciate some pointers; I'm completely confused. I've replaced 'utf8' with 'latin1' and unicode(c).encode to unicode(c).decode and I keep getting this same error.
Your input is still encoded data, not Unicode values. You'd need to decode to unicode values first, but you didn't specify an encoding to use. You then need to encode again back to encoded values to write back to the output CSV:
writer.writerow([re_pattern.sub(u'\uFFFD', unicode(c, 'utf8')).encode('utf8')
for c in row])
Your error stems from the unicode(c) call; without an explicit codec to use, Python falls back to the default ASCII codec.
If you use your file objects as context managers, there is no need to manually close them:
import csv
import re
re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
def limit_to_BMP(value, patt=re_pattern):
return patt.sub(u'\uFFFD', unicode(value, 'utf8')).encode('utf8')
with open('sourcefile.csv', 'rU') as ifile, open('outputfile.csv', 'wb') as ofile:
reader = csv.reader(ifile, dialect=csv.excel_tab)
writer = csv.writer(ofile, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
next(reader, None) # header is not added to output file
writer.writerows(map(limit_to_BMP, row) for row in reader)
I moved the replacement action to a separate function too, and used a generator expression to produce all rows on demand for the writer.writerows() function.