Pandas DataFrame's accented characters appearing garbled in Excel - python

With:
# -*- coding: utf-8 -*-
at the top of my .ipynb, Jupyter is now displaying accented characters correctly.
When I export to csv (with .to_csv()) a pandas data frame containing accented characters:
... the characters do not render properly when the csv is opened in Excel.
This is the case whether I set the encoding='utf-8' or not. Is pandas/python doing all that it can here, and this is an Excel issue? Or can something be done before the export to csv?
Python: 2.7.10
Pandas: 0.17.1
Excel: Excel for Mac 2011

If you want to keep accents, try with encoding='iso-8859-1'
df.to_csv(path,encoding='iso-8859-1',sep=';')

I had similar problem, also on a Mac. I noticed that the unicode string showed up fine when I opened the csv in TextEdit, but showed up garbled when I opened in Excel.
Thus, I don't think there is any way successfully export unicode to Excel with to_csv, but I'd expect the default to_excel writer to suffice.
df.to_excel('file.xlsx', encoding='utf-8')

I also had the same inconvenience. When I checked the Dataframe in the Jupyter notebook I saw that everything was in order.
The problem happens when I try to open the file directly (as it has a .csv extension Excel can open it directly).
The solution for me was to open a new blank excel workbook, and import the file from the "Data" tab, like this:
Import External Data
Import Data from text
I choose the file
In the import wizard window, where it says "File origin" in the drop-down list, I chose the "65001 : Unicode (utf-8)"
Then i just choose the right delimiter, and that was it for me.

I think using a different excel writer helps, recommending xlsxwriter
import pandas as pd
df = ...
writer = pd.ExcelWriter('file.xlsx', engine='xlsxwriter')
df.to_excel(writer)
writer.save()

Maybe try this function for your columns if you can't get Excel to cooperate. It will remove the accents using the unicodedata library:
import unicodedata
def remove_accents(input_str):
if type(input_str) == unicode:
nfkd_form = unicodedata.normalize('NFKD', input_str)
return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])
else:
return input_str

I had the same problem, and writing to .xlsx and renaming to .csv didn't solve the problem (for application-specific reasons I won't go into here), nor was I able to successfully use an alternate encoding as Juliana Rivera recommended. 'Manually' writing the data as text worked for me.
with open(RESULT_FP + '.csv', 'w+') as rf:
for row in output:
row = ','.join(list(map(str, row))) + '\n'
rf.write(row)
Sometimes I guess you just have to go back to basics.

I encountered a similar issue when attempting to read_json followed by a to_excel:
df = pandas.read_json(myfilepath)
# causes garbled characters
df.to_excel(sheetpath, encoding='utf8')
# also causes garbled characters
df.to_excel(sheetpath, encoding='latin1')
Turns out, if I load the json manually with the json module first, and then export with to_excel, the issue doesn't occur:
with open(myfilepath, encoding='utf8') as f:
j = json.load(f)
df = pandas.DataFrame(j)
df.to_excel(sheetpath, encoding='utf8')

Related

The CSV file format that I'm using is strange and cannot be used

My CSV file is in the following format.
"AUS" "market"
"DEFAULT_LATITUDE" "-13614588"
"DEFAULT_LONGITUDE" "52188316"
There is an unspecified number of spaces between each field(I guess, \t), and it is enclosed in double quotation marks.
When I open my CSV file using Excel, it fits into each cell well.
However, I'd like to read the data into Python. (using panda or csv module)
Which option should I use?
Here is my code and output.
import pandas as pd
if __name__ == '__main__':
data = pd.read_csv("export.csv", delimiter='\t')
print(data)
I've solved it.
When I used the tap character from my notepad by Ctrl+C/V, I was able to read normally.

Read Excel with data in Hindi in Python Pandas

I am new to Python Pandas and working on a small application where in i want to read my excel file having data in Hindi Language.
Issue I am facing is , pandas is not able to read hindi words and is placing some arbitary '?' symbol.
I have tried adding encoding to utf-8 but that is also not working.
My Excel Data :
Python Code :
df = pd.read_csv("Vegaretable_List.csv", encoding='utf-8')
Output :
['?? ' '??? ' '???? ' '????? ' '????']
Any help will be appreciable.
Thanks in advance.
The problem shouldn't occur if the file is read in using the same encoding it was created with.
If you get "???", it means the csv or excel file was saved with a different encoding.
Here is a table of the standard encodings.
Also, you could open your file in an appropriate program, and save it with UTF-8, in order to read with your code.
Also See:
SO: Encoding Error in Panda read_csv
Do not create csv file, instead use excel file in .xlsx format. Python will read the hindi text. I did this and it worked.
dataset = pd.read_excel("Data.xlsx")
Here the Data.xlsx contains all the hindi text that you gave.
Best of luck
Assuming that your Excel/CSV file has a content similar to this:
मिशल
बहादुर
मेरी
जेन
जॉन
स्मिथ
The encoding type is correct. It's just that you have to iterate through the data to get it back.
For .CSV
import csv
with open('customers.csv', 'r', encoding='utf-8') as file:
data = csv.reader(file)
for row in data:
print(row)
For .XLSX
with open('customers.xlsx', 'r', encoding='utf-8') as file:
data = file.readlines()
for row in data:
print(row.strip())

Saving data from python to excel file as CSV UTF-8 file format

I have been trying to save the data as a excel file as a type of CSV UTF-8 (Comma delimited) (*.csv) which is different then the normal
CSV (Comma delimited) (*.csv) file. It display the unicode text when opened in excel. I can save as that file easily from excel but from python i am only able to save it as normal csv. Which will not cause loss of data but when opened it shows this kind of text "à¤à¤‰à¤Ÿà¤¾" instead of "एउटा" this text.
If I copied the text opening it with notepad to the excel file and then manually save the file as CSV UTF-8 then it preserves the correct display. But doing so is time consuming since all values appear in same line in notepad and i have to separate it in excel file.
So i just want to know how can i save data as CSV UTF-8 format of excel using python.
I have tried the follwing code but it results in normal csv file.
import codecs
import unicodecsv as csv
input_text = codecs.open('input.txt', encoding='utf-8')
all_text = input_text.read()
text_list = all_text.split()
output_list = [['Words','Tags']]
for input_word in text_list:
word_tag_list = [input_word,'O']
output_list.append(word_tag_list)
with codecs.open("output.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(output_list)
You need to indicate to Excel that this is a UTF-8 file. Unfortunately the only way to do this is by prepending a special byte sequence to the front of the file. Python will do this automatically if you use a special encoding.
with codecs.open("output.csv", "w", "encoding="utf_8_sig") as f:
I have found the answer. The encoding="utf_8_sig" should be given to csv.writer method to write the excel file as CSV UTF-8 file. Previous code can be witten as:
with open("output.csv", "wb") as f:
writer = csv.writer(f, dialect='excel', encoding='utf_8_sig')
writer.writerows(output_list)
However there was problem when data has , at the end Eg: "भने," For this case i didn't need the comma so i removed it with following code within the for loop.
import re
if re.search(r'.,$',input_word):
input_word = re.sub(',$','',input_word)
Finally I was able to obtain the output as desired with Unicode character correctly displayed and removing extra comma which is present at the end of data. So, if anyone know how to ignore comma at the end of data in excel file then you can comment here. Thanks.

Can not write Japanese characters by pandas in Python

I'm trying to write data with Japanese characters to file CSV.
But CSV's not correct Japanese characters
def write_csv(columns, data):
df = pd.DataFrame(data, columns=columns)
df.to_csv("..\Report\Report.csv", encoding='utf-8')
write_csv(["法人番号", "法人名称", "法人名称カナ"], [])
and CSV:
æ³•äººç•ªå· æ³•äººå称 法人å称カナ
How can I accomplish this?
Your code is OK, just tried it. I'm guessing the CSV file is good but you're trying to open it as cp1252 instead of UTF-8.
What software are you using to open this CSV?
If you're using Microsoft Excel, make sure to use "Import" instead of "Open" so that you can choose the encoding.
With Google Sheets or LibreOffice it should Just Work.
Another possible explanation is that there's something wrong with your data in the first place. Here's how you can check that (I just took a few random characters from this generator):
df = pd.DataFrame(['勘してろむ説彼ふて惑岐とや尊続セヲ狭題'])
df.to_csv('report.csv', encoding='utf-8')
Try opening that the same way. If it opens correctly but the other doesn't, the problem is in your code.
For me utf_8_sig worked like a charm.
df.to_csv("..\Report\Report.csv", encoding='utf_8_sig')

Mixed encoding in csv file

I have a fairly large database (10,000+ records with about 120 vars each) in R. The problem is, that about half of the variables in the original .csv file were correctly encoded in UTF-8 while the rest were encoded in ANSI (Windows-1252) but are being decoded as UTF-8 resulting in weird characters for non-ASCII characters (mainly latin) like this é or ó.
I cannot simply change the file encoding because half of it would be decoded with the wrong type. Furthermore, I have no way of knowing which columns were encoded correctly and which ones didn't, and all I have is the original .csv file which I'm trying to fix.
So far I have found that a plain text file can be encoded in UTF-8 and misinterpreted characters (bad Unicode) can be inferred. One library that provides such functionality is ftfy for Python. However, I'm using the following code and so far, haven't had success:
import ftfy
file = open("file.csv", "r", encoding = "UTF8")
content = file.read()
content = ftfy.fix_text(content)
However, content will show exactly the same text than before. I believe this has to do with the way ftfy is inferring the content encoding.
Nevertheless, if I run ftfy.fix_text("Pública que cotiza en México") it will show the right response:
>> 'Pública que cotiza en México'
I'm thinking that maybe the way to solve the problem is to iterate through each of the values (cells) in the .csv file and try to fix if with ftfy, and the importing the file back to R, but it seems a little bit complicated
Any suggestions?
In fact, there was a mixed encoding for random cells in several places. Probably, there was an issue when exporting the data from it's original source.
The problem with ftfy is that it processes the file line by line, and if it encountered well formated characters, it assumes that the whole line is encoded in the same way and that strange characters were intended.
Since these errors appeared randomly through all the file, I wasn't able to transpose the whole table and process every line (column), so the answer was to process cell by cell. Fortunately, Python has a standard library that provides functionality to work painlessly with csv (specially because it escapes cells correctly).
This is the code I used to process the file:
import csv
import ftfy
import sys
def main(argv):
# input file
csvfile = open(argv[1], "r", encoding = "UTF8")
reader = csv.DictReader(csvfile)
# output stream
outfile = open(argv[2], "w", encoding = "Windows-1252") # Windows doesn't like utf8
writer = csv.DictWriter(outfile, fieldnames = reader.fieldnames, lineterminator = "\n")
# clean values
writer.writeheader()
for row in reader:
for col in row:
row[col] = ftfy.fix_text(row[col])
writer.writerow(row)
# close files
csvfile.close()
outfile.close()
if __name__ == "__main__":
main(sys.argv)
And then, calling:
$ python fix_encoding.py data.csv out.csv
will output a csv file with the right encoding.
a small suggestion: divide and conquer.
try using one tool (ftfy?) to align all the file to the same encoding (and save as plaintext file) and only then try parsing it as csv

Categories