MySQL export to CSV file as UTF-8 via Python script

MySQL export to CSV file as UTF-8 via Python script - python

I'm able to export a MySQL table into a CSV file via Python csv module but there are no utf-8 characters. (example: ???? chars insted of ąöę).
The table data is in utf-8 format (phpMyAdmin let me see correct data).
I found some information that in Python all data should be decoded in utf-8 and then encoded into CSV in utf-8 via for example unicodewritter (because the native csv module doesn't support Unicode correctly).
I tried a lot but no success.
Question : Is there any example script to export MySQL database in utf-8 to CSV file in utf-8 format in Python?
I use ubuntu 14.04 and there is a problem with mysql.connector so I use MySQLdb with Gord Thompson code :
# -*- coding: utf-8 -*-
import csv
import MySQLdb
from UnicodeSupportForCsv import UnicodeWriter
import sys
reload(sys)
sys.setdefaultencoding('utf8')
#sys.setdefaultencoding('Cp1252')
conn = MySQLdb.Connection(db='sampledb', host='localhost',
user='sampleuser', passwd='samplepass')
crsr = conn.cursor()
crsr.execute("SELECT * FROM rfid")
with open(r'test.csv', 'wb') as csvfile:
uw = UnicodeWriter(
csvfile, delimiter=',',
quotechar='"', quoting=csv.QUOTE_MINIMAL)
for row in crsr.fetchall():
uw.writerow([unicode(col) for col in row])
Error still exist : UnicodeDecodeError: 'utf8' codec can't decode byte 0xf3 in position 2: invalid continuation byte

MySQL is great in converting character sets. But you need to tell it to set up a connection using the correct collation.
On default it returns how it is put into the database. Add the required charset to the connection:
conn = MySQLdb.Connection(db='sampledb', host='localhost',
user='sampleuser', passwd='samplepass', charset='utf-8', )
Is this helpful?

This works for me with Python 2.7.5 and MySQL Connector/Python 2.0.4:
# -*- coding: utf-8 -*-
import csv
import mysql.connector
from UnicodeSupportForCsv import UnicodeWriter
conn = mysql.connector.connect(
host='localhost', port=3307,
user='root', password='whatever',
database='mydb')
crsr = conn.cursor()
crsr.execute("SELECT * FROM vocabulary")
with open(r'C:\Users\Gord\Desktop\test.csv', 'wb') as csvfile:
uw = UnicodeWriter(
csvfile, delimiter=',',
quotechar='"', quoting=csv.QUOTE_MINIMAL)
for row in crsr.fetchall():
uw.writerow([unicode(col) for col in row])
The UnicodeWriter class is taken directly from the last example on the documentation page for the csv module, which I stored in a file named "UnicodeSupportForCsv.py":
import csv, codecs, cStringIO
class UTF8Recoder:
"""
Iterator that reads an encoded stream and reencodes the input to UTF-8
"""
def __init__(self, f, encoding):
self.reader = codecs.getreader(encoding)(f)
def __iter__(self):
return self
def next(self):
return self.reader.next().encode("utf-8")
class UnicodeReader:
"""
A CSV reader which will iterate over lines in the CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
f = UTF8Recoder(f, encoding)
self.reader = csv.reader(f, dialect=dialect, **kwds)
def next(self):
row = self.reader.next()
return [unicode(s, "utf-8") for s in row]
def __iter__(self):
return self
class UnicodeWriter:
"""
A CSV writer which will write rows to CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
# Redirect output to a queue
self.queue = cStringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
self.writer.writerow([s.encode("utf-8") for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)

Finaly it Works! Thanks to : Gord Thompson and Prikkeldraad.
Thanks Guys !
# -*- coding: utf-8 -*-
import csv
import MySQLdb
from UnicodeSupportForCsv import UnicodeWriter
import sys
reload(sys)
sys.setdefaultencoding('utf8')
#sys.setdefaultencoding('Cp1252')
conn = MySQLdb.Connection(db='testdb', host='localhost', user='testuser', passwd='testpasswd', use_unicode=0,charset='utf8')
crsr = conn.cursor()
crsr.execute("SELECT * FROM rfid")
with open(r'test.csv', 'wb') as csvfile:
uw = UnicodeWriter(
csvfile, delimiter=',',quotechar='"', quoting=csv.QUOTE_MINIMAL)
for row in crsr.fetchall():
uw.writerow([unicode(col) for col in row])

Try this one ..make easy for you
https://github.com/jdunck/python-unicodecsv
The unicodecsv is a drop-in replacement for Python 2.7's csv module which supports unicode strings without a hassle. Supported versions are python 2.6, 2.7, 3.3, 3.4, 3.5, and pypy 2.4.0.
>>> import unicodecsv as csv
>>> from io import BytesIO
>>> f = BytesIO()
>>> w = csv.writer(f, encoding='utf-8')
>>> _ = w.writerow((u'é', u'ñ'))
>>> _ = f.seek(0)
>>> r = csv.reader(f, encoding='utf-8')
>>> next(r) == [u'é', u'ñ']
True

Related

How to use "encoding" argument in Python 2 for the open() function? [duplicate]

I am new to Python, and I have a question about how to use Python to read and write CSV files. My file contains like Germany, French, etc. According to my code, the files can be read correctly in Python, but when I write it into a new CSV file, the unicode becomes some strange characters.
The data is like:
And my code is:
import csv
f=open('xxx.csv','rb')
reader=csv.reader(f)
wt=open('lll.csv','wb')
writer=csv.writer(wt,quoting=csv.QUOTE_ALL)
wt.close()
f.close()
And the result is like:
What should I do to solve the problem?

Another alternative:
Use the code from the unicodecsv package ...
https://pypi.python.org/pypi/unicodecsv/
>>> import unicodecsv as csv
>>> from io import BytesIO
>>> f = BytesIO()
>>> w = csv.writer(f, encoding='utf-8')
>>> _ = w.writerow((u'é', u'ñ'))
>>> _ = f.seek(0)
>>> r = csv.reader(f, encoding='utf-8')
>>> next(r) == [u'é', u'ñ']
True
This module is API compatible with the STDLIB csv module.

Make sure you encode and decode as appropriate.
This example will roundtrip some example text in utf-8 to a csv file and back out to demonstrate:
# -*- coding: utf-8 -*-
import csv
tests={'German': [u'Straße',u'auslösen',u'zerstören'],
'French': [u'français',u'américaine',u'épais'],
'Chinese': [u'中國的',u'英語',u'美國人']}
with open('/tmp/utf.csv','w') as fout:
writer=csv.writer(fout)
writer.writerows([tests.keys()])
for row in zip(*tests.values()):
row=[s.encode('utf-8') for s in row]
writer.writerows([row])
with open('/tmp/utf.csv','r') as fin:
reader=csv.reader(fin)
for row in reader:
temp=list(row)
fmt=u'{:<15}'*len(temp)
print fmt.format(*[s.decode('utf-8') for s in temp])
Prints:
German Chinese French
Straße 中國的 français
auslösen 英語 américaine
zerstören 美國人 épais

There is an example at the end of the csv module documentation that demonstrates how to deal with Unicode. Below is copied directly from that example. Note that the strings read or written will be Unicode strings. Don't pass a byte string to UnicodeWriter.writerows, for example.
import csv,codecs,cStringIO
class UTF8Recoder:
def __init__(self, f, encoding):
self.reader = codecs.getreader(encoding)(f)
def __iter__(self):
return self
def next(self):
return self.reader.next().encode("utf-8")
class UnicodeReader:
def __init__(self, f, dialect=csv.excel, encoding="utf-8-sig", **kwds):
f = UTF8Recoder(f, encoding)
self.reader = csv.reader(f, dialect=dialect, **kwds)
def next(self):
'''next() -> unicode
This function reads and returns the next line as a Unicode string.
'''
row = self.reader.next()
return [unicode(s, "utf-8") for s in row]
def __iter__(self):
return self
class UnicodeWriter:
def __init__(self, f, dialect=csv.excel, encoding="utf-8-sig", **kwds):
self.queue = cStringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
'''writerow(unicode) -> None
This function takes a Unicode string and encodes it to the output.
'''
self.writer.writerow([s.encode("utf-8") for s in row])
data = self.queue.getvalue()
data = data.decode("utf-8")
data = self.encoder.encode(data)
self.stream.write(data)
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
with open('xxx.csv','rb') as fin, open('lll.csv','wb') as fout:
reader = UnicodeReader(fin)
writer = UnicodeWriter(fout,quoting=csv.QUOTE_ALL)
for line in reader:
writer.writerow(line)
Input (UTF-8 encoded):
American,美国人
French,法国人
German,德国人
Output:
"American","美国人"
"French","法国人"
"German","德国人"

Because str in python2 is bytes actually. So if want to write unicode to csv, you must encode unicode to str using utf-8 encoding.
def py2_unicode_to_str(u):
# unicode is only exist in python2
assert isinstance(u, unicode)
return u.encode('utf-8')
Use class csv.DictWriter(csvfile, fieldnames, restval='', extrasaction='raise', dialect='excel', *args, **kwds):
py2
The csvfile: open(fp, 'w')
pass key and value in bytes which are encoded with utf-8
writer.writerow({py2_unicode_to_str(k): py2_unicode_to_str(v) for k,v in row.items()})
py3
The csvfile: open(fp, 'w')
pass normal dict contains str as row to writer.writerow(row)
Finally code
import sys
is_py2 = sys.version_info[0] == 2
def py2_unicode_to_str(u):
# unicode is only exist in python2
assert isinstance(u, unicode)
return u.encode('utf-8')
with open('file.csv', 'w') as f:
if is_py2:
data = {u'Python中国': u'Python中国', u'Python中国2': u'Python中国2'}
# just one more line to handle this
data = {py2_unicode_to_str(k): py2_unicode_to_str(v) for k, v in data.items()}
fields = list(data[0])
writer = csv.DictWriter(f, fieldnames=fields)
for row in data:
writer.writerow(row)
else:
data = {'Python中国': 'Python中国', 'Python中国2': 'Python中国2'}
fields = list(data[0])
writer = csv.DictWriter(f, fieldnames=fields)
for row in data:
writer.writerow(row)
Conclusion
In python3, just use the unicode str.
In python2, use unicode handle text, use str when I/O occurs.

I had the very same issue. The answer is that you are doing it right already. It is the problem of MS Excel. Try opening the file with another editor and you will notice that your encoding was successful already. To make MS Excel happy, move from UTF-8 to UTF-16. This should work:
class UnicodeWriter:
def __init__(self, f, dialect=csv.excel_tab, encoding="utf-16", **kwds):
# Redirect output to a queue
self.queue = StringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
# Force BOM
if encoding=="utf-16":
import codecs
f.write(codecs.BOM_UTF16)
self.encoding = encoding
def writerow(self, row):
# Modified from original: now using unicode(s) to deal with e.g. ints
self.writer.writerow([unicode(s).encode("utf-8") for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = data.encode(self.encoding)
# strip BOM
if self.encoding == "utf-16":
data = data[2:]
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)

I couldn't respond to Mark above, but I just made one modification which fixed the error that was caused if data in the cells was not unicode, e.g. float or int data. I replaced this line into the UnicodeWriter function: "self.writer.writerow([s.encode("utf-8") if type(s)==types.UnicodeType else s for s in row])" so that it became:
class UnicodeWriter:
def __init__(self, f, dialect=csv.excel, encoding="utf-8-sig", **kwds):
self.queue = cStringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
'''writerow(unicode) -> None
This function takes a Unicode string and encodes it to the output.
'''
self.writer.writerow([s.encode("utf-8") if type(s)==types.UnicodeType else s for s in row])
data = self.queue.getvalue()
data = data.decode("utf-8")
data = self.encoder.encode(data)
self.stream.write(data)
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
You will also need to "import types".

I don't think this is the best answer, but it's probably the most self-contained answer and also the funniest.
UTF7 is a 7-bit ASCII encoding of unicode. It just so happens that UTF7 makes no special use of commas, quotes, or whitespace. It just passes them through from input to output. So really it makes no difference if you UTF7-encode first and then parse as CSV, or if you parse as CSV first and then UTF7-encode. Python 2's CSV parser can't handle unicode, but python 2 does have a UTF-7 encoder. So you can encode, parse, and then decode, and it's as if you had a unicode-capable parser.
import csv
import io
def read_csv(path):
with io.open(path, 'rt', encoding='utf8') as f:
lines = f.read().split("\r\n")
lines = [l.encode('utf7').decode('ascii') for l in lines]
reader = csv.reader(lines, dialect=csv.excel)
for row in reader:
yield [x.encode('ascii').decode('utf7') for x in row]
for row in read_csv("lol.csv"):
print(repr(row))
lol.csv
foo,bar,foo∆bar,"foo,bar"
output:
[u'foo', u'bar', u'foo\u2206bar', u'foo,bar']

Trouble on Unicode encoded data in Python

Hello StackOverflow community.
I am a fairly new user of Python, so sorry in advance for the sillyness of this question ! But I have tried to fix it out for hours but still not having figured it out.
I am trying to import a large dataset of text to manipulate it in Python.
This data set is in .csv and I've had problems reading it because of encoding problems.
I have tried to encode it in UTF-8 text with notepad++
I have tried the csv.reader module in Python
Here is an example of my code :
import csv
with open('twitter_test_python.csv') as csvfile:
#for file5 in csvfile:
# file5.readline()
#csvfile = csvfile.encode('utf-8')
spamreader = csv.reader(csvfile, delimiter=str(','), quotechar=str('|')
for row in spamreader:
row = " ".join(row)
row2= str.split(row)
listsw = []
for mots in row2:
if mots not in sw:
del mots
print row2
But when I import my data in Python I still have encoding problems (accents, etc) whether method I use.
How can I encode my data so that it is readable properly with Python ?
Thanks !

csv module documentation provides an example of how to deal with unicode:
import csv,codecs,cStringIO
class UTF8Recoder:
def __init__(self, f, encoding):
self.reader = codecs.getreader(encoding)(f)
def __iter__(self):
return self
def next(self):
return self.reader.next().encode("utf-8")
class UnicodeReader:
def __init__(self, f, dialect=csv.excel, encoding="utf-8-sig", **kwds):
f = UTF8Recoder(f, encoding)
self.reader = csv.reader(f, dialect=dialect, **kwds)
def next(self):
'''next() -> unicode
This function reads and returns the next line as a Unicode string.
'''
row = self.reader.next()
return [unicode(s, "utf-8") for s in row]
def __iter__(self):
return self
class UnicodeWriter:
def __init__(self, f, dialect=csv.excel, encoding="utf-8-sig", **kwds):
self.queue = cStringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
'''writerow(unicode) -> None
This function takes a Unicode string and encodes it to the output.
'''
self.writer.writerow([s.encode("utf-8") for s in row])
data = self.queue.getvalue()
data = data.decode("utf-8")
data = self.encoder.encode(data)
self.stream.write(data)
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
with open('twitter_test_python.csv','rb') as spamreader:
reader = UnicodeReader(fin)
for line in reader:
#do stuff
print line

Alexey Smirnov's answer is elegant but maybe a bit complicated for a beginner. So let me give an example closer to the code in the question.
When you read in files with Python 2 you get the content as str, not unicode. Probably you want to convert it as soon as possible. However, the documentation of the csv module says "This version of the csv module doesn’t support Unicode input." So you should encode the output of csv.reader, not the input. Inserting it into your code results in:
import csv
with open('twitter_test_python.csv') as csvfile:
spamreader = csv.reader(csvfile, delimiter=str(','), quotechar=str('|'))
for row in spamreader:
row = " ".join(row)
row = unicode(row, encoding="utf-8")
row2 = row.split()
However, you might want to consider whether joining the cells just to split them again is really what you want. Without that the code would look like following. The result is different if the list elements contain spaces.
import csv
with open('twitter_test_python.csv') as csvfile:
spamreader = csv.reader(csvfile, delimiter=str(','), quotechar=str('|'))
for row in spamreader:
row2 = list(unicode(cell, encoding="utf-8") for cell in row)
If you want to write something back to a file you should convert the unicode first back to a str like unicode.encode("utf-8").

Does anyone know a simple function that converts existing csv files to UTF-8 encoding?

I have huge csv files and they contain '\xc3\x84' style characters instead of German umlauts, because I scrapped HTML using BeautifulSoup and wrote it in the csv files using Python 2.7.8.
I managed to replace all those characters with the help of this:
Python 2.7.1: How to Open, Edit and Close a CSV file
and now my code looks like this:
import csv
new_rows = []
umlaut = {'\\xc3\\x84': 'Ä', '\\xc3\\x96': 'Ö', '\\xc3\\x9c': 'Ü', '\\xc3\\xa4': 'ä', '\\xc3\\xb6': 'ö', '\\xc3\\xbc': 'ü'}
with open('file1.csv', 'r') as csvFile:
reader = csv.reader(csvFile)
for row in reader:
new_row = row
for key, value in umlaut.items():
new_row = [ x.replace(key, value) for x in new_row ]
new_rows.append(new_row)
with open('file2.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerows(new_rows)
When I open the csv I see KÃ¶ln instead of Köln and other "German umlaut" problems.
I can solve this problem manually by opening the CSV file with notepad and then save it as UTF-8, but I want to do it automated with python.
I do not quite get how to use the UnicodeWriter:
https://docs.python.org/2/library/csv.html#examples
The answers and solutions I found here on stackoverflow are all a little bit complicated.
My question are, how would I use for example the UnicodeWriter right in my case?
Do you know any super easy function that does something like file2.encode('utf-8')?
If such an easy like function doesn' t exist in Python, then why doesn't it exists yet, because encoding errors are very common?

Instead of using your own mapping, you can use string-escape encoding:
>>> print '\\xc3\\x84'.decode('string-escape')
Ä
import csv
def iter_decode(it):
for line in it:
yield line.decode('string-escape')
with open('file1.csv') as csvFile, open('file2.csv', 'w') as f:
reader = csv.reader(iter_decode(csvFile))
writer = csv.writer(f)
for row in reader:
writer.writerow(row)

Given that you have a unicode writer from the docs :
class UnicodeWriter:
"""
A CSV writer which will write rows to CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
# Redirect output to a queue
self.queue = cStringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
self.writer.writerow([s.encode("utf-8") for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
use it like so:
from __future__ import unicode_lterals
import codecs
f = codecs.open("somefile.csv", mode='w', encoding='utf-8')
writer = UnicodeWriter(f)
for data in some_buffer:
writer.writerow(data)

Read and Write CSV files including unicode with Python 2.7

I am new to Python, and I have a question about how to use Python to read and write CSV files. My file contains like Germany, French, etc. According to my code, the files can be read correctly in Python, but when I write it into a new CSV file, the unicode becomes some strange characters.
The data is like:
And my code is:
import csv
f=open('xxx.csv','rb')
reader=csv.reader(f)
wt=open('lll.csv','wb')
writer=csv.writer(wt,quoting=csv.QUOTE_ALL)
wt.close()
f.close()
And the result is like:
What should I do to solve the problem?

Another alternative:
Use the code from the unicodecsv package ...
https://pypi.python.org/pypi/unicodecsv/
>>> import unicodecsv as csv
>>> from io import BytesIO
>>> f = BytesIO()
>>> w = csv.writer(f, encoding='utf-8')
>>> _ = w.writerow((u'é', u'ñ'))
>>> _ = f.seek(0)
>>> r = csv.reader(f, encoding='utf-8')
>>> next(r) == [u'é', u'ñ']
True
This module is API compatible with the STDLIB csv module.

Make sure you encode and decode as appropriate.
This example will roundtrip some example text in utf-8 to a csv file and back out to demonstrate:
# -*- coding: utf-8 -*-
import csv
tests={'German': [u'Straße',u'auslösen',u'zerstören'],
'French': [u'français',u'américaine',u'épais'],
'Chinese': [u'中國的',u'英語',u'美國人']}
with open('/tmp/utf.csv','w') as fout:
writer=csv.writer(fout)
writer.writerows([tests.keys()])
for row in zip(*tests.values()):
row=[s.encode('utf-8') for s in row]
writer.writerows([row])
with open('/tmp/utf.csv','r') as fin:
reader=csv.reader(fin)
for row in reader:
temp=list(row)
fmt=u'{:<15}'*len(temp)
print fmt.format(*[s.decode('utf-8') for s in temp])
Prints:
German Chinese French
Straße 中國的 français
auslösen 英語 américaine
zerstören 美國人 épais

There is an example at the end of the csv module documentation that demonstrates how to deal with Unicode. Below is copied directly from that example. Note that the strings read or written will be Unicode strings. Don't pass a byte string to UnicodeWriter.writerows, for example.
import csv,codecs,cStringIO
class UTF8Recoder:
def __init__(self, f, encoding):
self.reader = codecs.getreader(encoding)(f)
def __iter__(self):
return self
def next(self):
return self.reader.next().encode("utf-8")
class UnicodeReader:
def __init__(self, f, dialect=csv.excel, encoding="utf-8-sig", **kwds):
f = UTF8Recoder(f, encoding)
self.reader = csv.reader(f, dialect=dialect, **kwds)
def next(self):
'''next() -> unicode
This function reads and returns the next line as a Unicode string.
'''
row = self.reader.next()
return [unicode(s, "utf-8") for s in row]
def __iter__(self):
return self
class UnicodeWriter:
def __init__(self, f, dialect=csv.excel, encoding="utf-8-sig", **kwds):
self.queue = cStringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
'''writerow(unicode) -> None
This function takes a Unicode string and encodes it to the output.
'''
self.writer.writerow([s.encode("utf-8") for s in row])
data = self.queue.getvalue()
data = data.decode("utf-8")
data = self.encoder.encode(data)
self.stream.write(data)
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
with open('xxx.csv','rb') as fin, open('lll.csv','wb') as fout:
reader = UnicodeReader(fin)
writer = UnicodeWriter(fout,quoting=csv.QUOTE_ALL)
for line in reader:
writer.writerow(line)
Input (UTF-8 encoded):
American,美国人
French,法国人
German,德国人
Output:
"American","美国人"
"French","法国人"
"German","德国人"

Because str in python2 is bytes actually. So if want to write unicode to csv, you must encode unicode to str using utf-8 encoding.
def py2_unicode_to_str(u):
# unicode is only exist in python2
assert isinstance(u, unicode)
return u.encode('utf-8')
Use class csv.DictWriter(csvfile, fieldnames, restval='', extrasaction='raise', dialect='excel', *args, **kwds):
py2
The csvfile: open(fp, 'w')
pass key and value in bytes which are encoded with utf-8
writer.writerow({py2_unicode_to_str(k): py2_unicode_to_str(v) for k,v in row.items()})
py3
The csvfile: open(fp, 'w')
pass normal dict contains str as row to writer.writerow(row)
Finally code
import sys
is_py2 = sys.version_info[0] == 2
def py2_unicode_to_str(u):
# unicode is only exist in python2
assert isinstance(u, unicode)
return u.encode('utf-8')
with open('file.csv', 'w') as f:
if is_py2:
data = {u'Python中国': u'Python中国', u'Python中国2': u'Python中国2'}
# just one more line to handle this
data = {py2_unicode_to_str(k): py2_unicode_to_str(v) for k, v in data.items()}
fields = list(data[0])
writer = csv.DictWriter(f, fieldnames=fields)
for row in data:
writer.writerow(row)
else:
data = {'Python中国': 'Python中国', 'Python中国2': 'Python中国2'}
fields = list(data[0])
writer = csv.DictWriter(f, fieldnames=fields)
for row in data:
writer.writerow(row)
Conclusion
In python3, just use the unicode str.
In python2, use unicode handle text, use str when I/O occurs.

I had the very same issue. The answer is that you are doing it right already. It is the problem of MS Excel. Try opening the file with another editor and you will notice that your encoding was successful already. To make MS Excel happy, move from UTF-8 to UTF-16. This should work:
class UnicodeWriter:
def __init__(self, f, dialect=csv.excel_tab, encoding="utf-16", **kwds):
# Redirect output to a queue
self.queue = StringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
# Force BOM
if encoding=="utf-16":
import codecs
f.write(codecs.BOM_UTF16)
self.encoding = encoding
def writerow(self, row):
# Modified from original: now using unicode(s) to deal with e.g. ints
self.writer.writerow([unicode(s).encode("utf-8") for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = data.encode(self.encoding)
# strip BOM
if self.encoding == "utf-16":
data = data[2:]
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)

I couldn't respond to Mark above, but I just made one modification which fixed the error that was caused if data in the cells was not unicode, e.g. float or int data. I replaced this line into the UnicodeWriter function: "self.writer.writerow([s.encode("utf-8") if type(s)==types.UnicodeType else s for s in row])" so that it became:
class UnicodeWriter:
def __init__(self, f, dialect=csv.excel, encoding="utf-8-sig", **kwds):
self.queue = cStringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
'''writerow(unicode) -> None
This function takes a Unicode string and encodes it to the output.
'''
self.writer.writerow([s.encode("utf-8") if type(s)==types.UnicodeType else s for s in row])
data = self.queue.getvalue()
data = data.decode("utf-8")
data = self.encoder.encode(data)
self.stream.write(data)
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
You will also need to "import types".

I don't think this is the best answer, but it's probably the most self-contained answer and also the funniest.
UTF7 is a 7-bit ASCII encoding of unicode. It just so happens that UTF7 makes no special use of commas, quotes, or whitespace. It just passes them through from input to output. So really it makes no difference if you UTF7-encode first and then parse as CSV, or if you parse as CSV first and then UTF7-encode. Python 2's CSV parser can't handle unicode, but python 2 does have a UTF-7 encoder. So you can encode, parse, and then decode, and it's as if you had a unicode-capable parser.
import csv
import io
def read_csv(path):
with io.open(path, 'rt', encoding='utf8') as f:
lines = f.read().split("\r\n")
lines = [l.encode('utf7').decode('ascii') for l in lines]
reader = csv.reader(lines, dialect=csv.excel)
for row in reader:
yield [x.encode('ascii').decode('utf7') for x in row]
for row in read_csv("lol.csv"):
print(repr(row))
lol.csv
foo,bar,foo∆bar,"foo,bar"
output:
[u'foo', u'bar', u'foo\u2206bar', u'foo,bar']

Python - SQLite to CSV Writer Error - ASCII values not parsed

Afternoon,
I am having some trouble with a SQLite to CSV python script. I have searched high and I have searched low for an answer but none have worked for me, or I am having a problem with my syntax.
I want to replace characters within the SQLite database which fall outside of the ASCII table (larger than 128).
Here is the script I have been using:
#!/opt/local/bin/python
import sqlite3
import csv, codecs, cStringIO
class UnicodeWriter:
"""
A CSV writer which will write rows to CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
# Redirect output to a queue
self.queue = cStringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
self.writer.writerow([unicode(s).encode("utf-8") for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
conn = sqlite3.connect('test.db')
c = conn.cursor()
# Select whichever rows you want in whatever order you like
c.execute('select ROWID, Name, Type, PID from PID')
writer = UnicodeWriter(open("ProductListing.csv", "wb"))
# Make sure the list of column headers you pass in are in the same order as your SELECT
writer.writerow(["ROWID", "Product Name", "Product Type", "PID", ])
writer.writerows(c)
I have tried to add the 'replace' as indicated here but have got the same error. Python: Convert Unicode to ASCII without errors for CSV file
The error is the UnicodeDecodeError.
Traceback (most recent call last):
File "SQLite2CSV1.py", line 53, in <module>
writer.writerows(c)
File "SQLite2CSV1.py", line 32, in writerows
self.writerow(row)
File "SQLite2CSV1.py", line 19, in writerow
self.writer.writerow([unicode(s).encode("utf-8") for s in row])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 65: ordinal not in range(128)
Obviously I want the code to be robust enough that if it encounters characters outside of these bounds that it replaces it with a character such as '?' (\x3f).
Is there a way to do this within the UnicodeWriter class? And a way I can make the code robust that it won't produce these errors.
Your help is greatly appreciated.

If you just want to write an ASCII CSV, simply use the stock csv.writer(). To ensure that all values passed are indeed ASCII, use encode('ascii', errors='replace').
Example:
import csv
rows = [
[u'some', u'other', u'more'],
[u'umlaut:\u00fd', u'euro sign:\u20ac', '']
]
with open('/tmp/test.csv', 'wb') as csvFile:
writer = csv.writer(csvFile)
for row in rows:
asciifiedRow = [item.encode('ascii', errors='replace') for item in row]
print '%r --> %r' % (row, asciifiedRow)
writer.writerow(asciifiedRow)
The console output for this is:
[u'some', u'other', u'more'] --> ['some', 'other', 'more']
[u'umlaut:\xfd', u'euro sign:\u20ac', ''] --> ['umlaut:?', 'euro sign:?', '']
The resulting CSV file contains:
some,other,more
umlaut:?,euro sign:?,

With access to a unix environment, here's what worked for me
sqlite3.exe a.db .dump > a.sql;
tr -d "[\\200-\\377]" < a.sql > clean.sql;
sqlite3.exe clean.db < clean.sql;
(It's not a python solution, but maybe it will help someone else due to its brevity. This solution STRIPS OUT all non ascii characters, doesn't try to replace them.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

MySQL export to CSV file as UTF-8 via Python script - python

Related

How to use "encoding" argument in Python 2 for the open() function? [duplicate]

Trouble on Unicode encoded data in Python

Does anyone know a simple function that converts existing csv files to UTF-8 encoding?

Read and Write CSV files including unicode with Python 2.7

Python - SQLite to CSV Writer Error - ASCII values not parsed

Categories

Resources