Writing unicode data in csv - python

I know similar kind of question has been asked many times but seriously i have not been able to properly implement the csv writer which writes properly in csv (it shows garbage).
I am trying to use UnicodeWriter as mention in official docs .
ff = open('a.csv', 'w')
writer = UnicodeWriter(ff)
st = unicode('Displaygrößen', 'utf-8') #gives (u'Displaygr\xf6\xdfen', 'utf-8')
writer.writerow([st])
This does not give me any decoding or encoding error. But it writes the word Displaygrößen as Displaygrößen which is not good. Can any one help me what i am doing wrong here??

You are writing a file in UTF-8 format, but you don't indicate that into your csv file.
You should write the UTF-8 header at the beginning of the file. Add this:
ff = open('a.csv', 'w')
ff.write(codecs.BOM_UTF8)
And your csv file should open correctly after that with the program trying to read it.

Opening the file with codecs.open should fix it.

Related

How to correctly put extended ASCII characters into a CSV file?

I'm trying to write some data in an array that contains extended ASCII characters to a CSV file. Below I made an small example of the code I'm using on real file.
The array text_array represents an array containing only one row.
import csv
text_array = [["Á","Â","Æ","Ç","Ö","×","Ø","Ù","Þ","ß","á","â","ã","ä","å","æ"]]
with open("/Files/out.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(text_array)
The output I'm getting on CSV file is wrong, showing these characters.
à Â Æ Ç Ö × Ø Ù Þ ß á â ã ä å æ
I found that the code below fixes the issue in Python 3.4 but I'm working on Python 2.7.
c = csv.writer(open("Out.csv", 'w', newline='', encoding='utf-8'))
How can I fix this?
UPDATE
I receive some links as comments, but is a kind of difficult for me to understand what is needed to do to fix this issue. May someone show some example please.

Saving data from python to excel file as CSV UTF-8 file format

I have been trying to save the data as a excel file as a type of CSV UTF-8 (Comma delimited) (*.csv) which is different then the normal
CSV (Comma delimited) (*.csv) file. It display the unicode text when opened in excel. I can save as that file easily from excel but from python i am only able to save it as normal csv. Which will not cause loss of data but when opened it shows this kind of text "à¤à¤‰à¤Ÿà¤¾" instead of "एउटा" this text.
If I copied the text opening it with notepad to the excel file and then manually save the file as CSV UTF-8 then it preserves the correct display. But doing so is time consuming since all values appear in same line in notepad and i have to separate it in excel file.
So i just want to know how can i save data as CSV UTF-8 format of excel using python.
I have tried the follwing code but it results in normal csv file.
import codecs
import unicodecsv as csv
input_text = codecs.open('input.txt', encoding='utf-8')
all_text = input_text.read()
text_list = all_text.split()
output_list = [['Words','Tags']]
for input_word in text_list:
word_tag_list = [input_word,'O']
output_list.append(word_tag_list)
with codecs.open("output.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(output_list)
You need to indicate to Excel that this is a UTF-8 file. Unfortunately the only way to do this is by prepending a special byte sequence to the front of the file. Python will do this automatically if you use a special encoding.
with codecs.open("output.csv", "w", "encoding="utf_8_sig") as f:
I have found the answer. The encoding="utf_8_sig" should be given to csv.writer method to write the excel file as CSV UTF-8 file. Previous code can be witten as:
with open("output.csv", "wb") as f:
writer = csv.writer(f, dialect='excel', encoding='utf_8_sig')
writer.writerows(output_list)
However there was problem when data has , at the end Eg: "भने," For this case i didn't need the comma so i removed it with following code within the for loop.
import re
if re.search(r'.,$',input_word):
input_word = re.sub(',$','',input_word)
Finally I was able to obtain the output as desired with Unicode character correctly displayed and removing extra comma which is present at the end of data. So, if anyone know how to ignore comma at the end of data in excel file then you can comment here. Thanks.

Mixed encoding in csv file

I have a fairly large database (10,000+ records with about 120 vars each) in R. The problem is, that about half of the variables in the original .csv file were correctly encoded in UTF-8 while the rest were encoded in ANSI (Windows-1252) but are being decoded as UTF-8 resulting in weird characters for non-ASCII characters (mainly latin) like this é or ó.
I cannot simply change the file encoding because half of it would be decoded with the wrong type. Furthermore, I have no way of knowing which columns were encoded correctly and which ones didn't, and all I have is the original .csv file which I'm trying to fix.
So far I have found that a plain text file can be encoded in UTF-8 and misinterpreted characters (bad Unicode) can be inferred. One library that provides such functionality is ftfy for Python. However, I'm using the following code and so far, haven't had success:
import ftfy
file = open("file.csv", "r", encoding = "UTF8")
content = file.read()
content = ftfy.fix_text(content)
However, content will show exactly the same text than before. I believe this has to do with the way ftfy is inferring the content encoding.
Nevertheless, if I run ftfy.fix_text("Pública que cotiza en México") it will show the right response:
>> 'Pública que cotiza en México'
I'm thinking that maybe the way to solve the problem is to iterate through each of the values (cells) in the .csv file and try to fix if with ftfy, and the importing the file back to R, but it seems a little bit complicated
Any suggestions?
In fact, there was a mixed encoding for random cells in several places. Probably, there was an issue when exporting the data from it's original source.
The problem with ftfy is that it processes the file line by line, and if it encountered well formated characters, it assumes that the whole line is encoded in the same way and that strange characters were intended.
Since these errors appeared randomly through all the file, I wasn't able to transpose the whole table and process every line (column), so the answer was to process cell by cell. Fortunately, Python has a standard library that provides functionality to work painlessly with csv (specially because it escapes cells correctly).
This is the code I used to process the file:
import csv
import ftfy
import sys
def main(argv):
# input file
csvfile = open(argv[1], "r", encoding = "UTF8")
reader = csv.DictReader(csvfile)
# output stream
outfile = open(argv[2], "w", encoding = "Windows-1252") # Windows doesn't like utf8
writer = csv.DictWriter(outfile, fieldnames = reader.fieldnames, lineterminator = "\n")
# clean values
writer.writeheader()
for row in reader:
for col in row:
row[col] = ftfy.fix_text(row[col])
writer.writerow(row)
# close files
csvfile.close()
outfile.close()
if __name__ == "__main__":
main(sys.argv)
And then, calling:
$ python fix_encoding.py data.csv out.csv
will output a csv file with the right encoding.
a small suggestion: divide and conquer.
try using one tool (ftfy?) to align all the file to the same encoding (and save as plaintext file) and only then try parsing it as csv

Unicode(UTF-8) can't display correctly? (Python)

I have following code in Pyhton:
# myFile.csv tend to looks like:
# 'a1', 'ふじさん', 'c1'
# 'a2', 'ふじさん', 'c2'
# 'a3', 'ふじさん', 'c3'
s = u"unicodeText" # unicodeText like, ふじさん بعدة أش 일본富士山Ölkələr
with codecs.open('myFile.csv', 'w+', 'utf-8') as f: # codecs open
f.write(s.encode('utf-8', 'ignore'))
I was using Vim to edit the code and using Vim to open "myFile.csv";
It can success display unicode text from terminal;
but not able to display unicode text from Excel, nor from browser;
My platform is osx
I don't know if is my configuration problem or actually I code it wrong way, if you any idea, please advise. Deeply appreciate!
change open to codecs.open.
Thanks for point out f.close(), deleted.
Excel (at least on Windows) likes a Unicode BOM at the start of a .csv file even with UTF-8. There is a codec for that, utf-8-sig.
Also, Python 3's normal open is all that is required and no need for f.close() in a with:
#coding:utf8
data = '''\
a1,ふじさん,c1
a2,ふじさん,c2
a3,ふじさん,c3
'''
with open('myFile.csv', 'w', encoding='utf-8-sig') as f:
f.write(data)
It seems you're trying to open the file in text mode (because you specify an encoding), but then you try to write binary data (because you encode the text before writing it to the file). You need to either open the file as binary and write encoded text, or open it as text and write text.
Furthermore, your attempt to open it as text isn't even working because you're passing utf-8 as the buffering parameter instead of the encoding parameter. See the open() documentation`.
But even if you did all that correctly, this still wouldn't really help you with an Excel file, because those have a complicated binary structure. I recommend you use something like the xlrd to read xls files and Xlswriter to write them.
Here is a simple example that should work for .csv:
with open('file.csv', 'w', encoding='utf-8') as fh:
fh.write('This >µ< is a unicode GREEK LETTER MU\n')
or alternatively
with open('file.csv', 'wb') as fh:
fh.write('This >µ< is a unicode GREEK LETTER MU\n'.encode('utf-8'))
codecs.open opens a wrapped reader/writer which will do encoding/decoding for you. So you do not need to encode your string for writing. You need to pass the 'ignore' parameter in your open call.
with open('myFile.csv', 'w+', 'utf-8', 'ignore') as f:
f.write(s)
Note that you do not need to call close as you use a with statement.
Original answer, scratch that:
Third parameter of open is buffering requiring an integer.
You should write pass the encoding like this:
with open('myFile.xls', 'w+', encoding='utf-8') as f:
Note that you open the file in text mode. No need to encode the string for writing.
Also your file mode 'w+' is a bit odd. I'm not sure, but I think it will truncate your file. If you want to append to the file you should use 'a' as mode.

Raw string for variables in python?

I have seen several similar posts on this but nothing has solved my problem.
I am reading a list of numbers with backslashes and writing them to a .csv. Obviously the backslashes are causing problems.
addr = "6253\342\200\2236387"
with open("output.csv", 'a') as w:
write = writer(w)
write.writerow([addr])
I found that using r"6253\342\200\2236387" gave me exactly what I want for the output but since I am reading my input from a file I can't use raw string. i tried .encode('string-escape') but that gave me 6253\xe2\x80\x936387 as output which is definitely not what I want. unicode-escape gave me an error. Any thoughts?
The r in front of a string is only for defining a string. If you're reading data from a file, it's already 'raw'. You shouldn't have to do anything special when reading in your data.
Note that if your data is not plain ascii, you may need to decode it or read it in binary. For example, if the data is utf-8, you can open the file like this before reading:
import codecs
f = codecs.open("test", "r", "utf-8")
Text file contains...
1234\4567\7890
41\5432\345\6789
Code:
with open('c:/tmp/numbers.csv', 'ab') as w:
f = open(textfilepath)
wr = csv.writer(w)
for line in f:
line = line.strip()
wr.writerow([line])
f.close()
This produced a csv with whole lines in a column. Maybe use 'ab' rather than 'a' as your file open type. I was getting extra blank records in my csv when using just 'a'.
I created this awhile back. This helps you write to a csv file.
def write2csv(fileName,theData):
theFile = open(fileName+'.csv', 'a')
wr = csv.writer(theFile, delimiter = ',', quoting=csv.QUOTE_MINIMAL)
wr.writerow(theData)

Categories