Read Excel with data in Hindi in Python Pandas - python

I am new to Python Pandas and working on a small application where in i want to read my excel file having data in Hindi Language.
Issue I am facing is , pandas is not able to read hindi words and is placing some arbitary '?' symbol.
I have tried adding encoding to utf-8 but that is also not working.
My Excel Data :
Python Code :
df = pd.read_csv("Vegaretable_List.csv", encoding='utf-8')
Output :
['?? ' '??? ' '???? ' '????? ' '????']
Any help will be appreciable.
Thanks in advance.

The problem shouldn't occur if the file is read in using the same encoding it was created with.
If you get "???", it means the csv or excel file was saved with a different encoding.
Here is a table of the standard encodings.
Also, you could open your file in an appropriate program, and save it with UTF-8, in order to read with your code.
Also See:
SO: Encoding Error in Panda read_csv

Do not create csv file, instead use excel file in .xlsx format. Python will read the hindi text. I did this and it worked.
dataset = pd.read_excel("Data.xlsx")
Here the Data.xlsx contains all the hindi text that you gave.
Best of luck

Assuming that your Excel/CSV file has a content similar to this:
मिशल
बहादुर
मेरी
जेन
जॉन
स्मिथ
The encoding type is correct. It's just that you have to iterate through the data to get it back.
For .CSV
import csv
with open('customers.csv', 'r', encoding='utf-8') as file:
data = csv.reader(file)
for row in data:
print(row)
For .XLSX
with open('customers.xlsx', 'r', encoding='utf-8') as file:
data = file.readlines()
for row in data:
print(row.strip())

Related

Writing in same row in Gujarati in python using CSV writer

My code:
import csv
# Create a list of Gujarati strings
strings = [['હેલો, વર્લ્ડ!', 'સુપ્રભાત', 'મારા નામ હેઠળ છે']]
# Open the CSV file in 'w' mode
with open('Gujarati.csv', 'w', encoding='utf-16',newline='') as f:
# Create a CSV writer
writer = csv.writer(f)
# Write the strings to the CSV file
writer.writerows(strings)
I am trying to write each heading as a different column, but I don't know why it is getting in the same column. I want it to be in separate columns. I don't know what else to write but feel free to ask me anything anytime.
I appreciate any help you can provide
https://support.microsoft.com/en-us/office/import-or-export-text-txt-or-csv-files-5250ac4c-663c-47ce-937b-339e391393ba#:~:text=You%20can%20import%20data%20from,to%20import%2C%20and%20click%20Import.
Import or export text (.txt or .csv) files You can change the separator character used in both delimited and .csv text files. This may be necessary to ensure that the import or export operation works the way you want it to.

I am facing problem with .CSV format in PANDAS

I will explain in detail:
I have an Excel file and my client is using one tool which reads .csv format files only.
Now I am opening the Excel file in Excel and saving into .CSV format by using Save As option in excel. let me take this is a File_1.
I wrote Python code by using pandas module and i converted that Excel file into csv. let me take this is as a File_2.
My client tool is able to read File_1 but not File_2. Why? What would be the problem?
My observations:
When I am reading File_1 in pandas (which is converted into .CSV manually) I had to mention --> encoding = "ISO-8859-1", otherwise it is giving Unicode error.
Ex: pd.read_csv("File_1.csv", encoding = 'ISO-8859-1")
But when I am reading File_2 in pandas, it simply reading and not giving any error.
Ex: pd.read_csv("File_2.csv")
So what would be the reason to not read File_2 by client tool? Is it Unicode problem or any other?

How to correctly put extended ASCII characters into a CSV file?

I'm trying to write some data in an array that contains extended ASCII characters to a CSV file. Below I made an small example of the code I'm using on real file.
The array text_array represents an array containing only one row.
import csv
text_array = [["Á","Â","Æ","Ç","Ö","×","Ø","Ù","Þ","ß","á","â","ã","ä","å","æ"]]
with open("/Files/out.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(text_array)
The output I'm getting on CSV file is wrong, showing these characters.
à Â Æ Ç Ö × Ø Ù Þ ß á â ã ä å æ
I found that the code below fixes the issue in Python 3.4 but I'm working on Python 2.7.
c = csv.writer(open("Out.csv", 'w', newline='', encoding='utf-8'))
How can I fix this?
UPDATE
I receive some links as comments, but is a kind of difficult for me to understand what is needed to do to fix this issue. May someone show some example please.

Saving data from python to excel file as CSV UTF-8 file format

I have been trying to save the data as a excel file as a type of CSV UTF-8 (Comma delimited) (*.csv) which is different then the normal
CSV (Comma delimited) (*.csv) file. It display the unicode text when opened in excel. I can save as that file easily from excel but from python i am only able to save it as normal csv. Which will not cause loss of data but when opened it shows this kind of text "à¤à¤‰à¤Ÿà¤¾" instead of "एउटा" this text.
If I copied the text opening it with notepad to the excel file and then manually save the file as CSV UTF-8 then it preserves the correct display. But doing so is time consuming since all values appear in same line in notepad and i have to separate it in excel file.
So i just want to know how can i save data as CSV UTF-8 format of excel using python.
I have tried the follwing code but it results in normal csv file.
import codecs
import unicodecsv as csv
input_text = codecs.open('input.txt', encoding='utf-8')
all_text = input_text.read()
text_list = all_text.split()
output_list = [['Words','Tags']]
for input_word in text_list:
word_tag_list = [input_word,'O']
output_list.append(word_tag_list)
with codecs.open("output.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(output_list)
You need to indicate to Excel that this is a UTF-8 file. Unfortunately the only way to do this is by prepending a special byte sequence to the front of the file. Python will do this automatically if you use a special encoding.
with codecs.open("output.csv", "w", "encoding="utf_8_sig") as f:
I have found the answer. The encoding="utf_8_sig" should be given to csv.writer method to write the excel file as CSV UTF-8 file. Previous code can be witten as:
with open("output.csv", "wb") as f:
writer = csv.writer(f, dialect='excel', encoding='utf_8_sig')
writer.writerows(output_list)
However there was problem when data has , at the end Eg: "भने," For this case i didn't need the comma so i removed it with following code within the for loop.
import re
if re.search(r'.,$',input_word):
input_word = re.sub(',$','',input_word)
Finally I was able to obtain the output as desired with Unicode character correctly displayed and removing extra comma which is present at the end of data. So, if anyone know how to ignore comma at the end of data in excel file then you can comment here. Thanks.

Pandas DataFrame's accented characters appearing garbled in Excel

With:
# -*- coding: utf-8 -*-
at the top of my .ipynb, Jupyter is now displaying accented characters correctly.
When I export to csv (with .to_csv()) a pandas data frame containing accented characters:
... the characters do not render properly when the csv is opened in Excel.
This is the case whether I set the encoding='utf-8' or not. Is pandas/python doing all that it can here, and this is an Excel issue? Or can something be done before the export to csv?
Python: 2.7.10
Pandas: 0.17.1
Excel: Excel for Mac 2011
If you want to keep accents, try with encoding='iso-8859-1'
df.to_csv(path,encoding='iso-8859-1',sep=';')
I had similar problem, also on a Mac. I noticed that the unicode string showed up fine when I opened the csv in TextEdit, but showed up garbled when I opened in Excel.
Thus, I don't think there is any way successfully export unicode to Excel with to_csv, but I'd expect the default to_excel writer to suffice.
df.to_excel('file.xlsx', encoding='utf-8')
I also had the same inconvenience. When I checked the Dataframe in the Jupyter notebook I saw that everything was in order.
The problem happens when I try to open the file directly (as it has a .csv extension Excel can open it directly).
The solution for me was to open a new blank excel workbook, and import the file from the "Data" tab, like this:
Import External Data
Import Data from text
I choose the file
In the import wizard window, where it says "File origin" in the drop-down list, I chose the "65001 : Unicode (utf-8)"
Then i just choose the right delimiter, and that was it for me.
I think using a different excel writer helps, recommending xlsxwriter
import pandas as pd
df = ...
writer = pd.ExcelWriter('file.xlsx', engine='xlsxwriter')
df.to_excel(writer)
writer.save()
Maybe try this function for your columns if you can't get Excel to cooperate. It will remove the accents using the unicodedata library:
import unicodedata
def remove_accents(input_str):
if type(input_str) == unicode:
nfkd_form = unicodedata.normalize('NFKD', input_str)
return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])
else:
return input_str
I had the same problem, and writing to .xlsx and renaming to .csv didn't solve the problem (for application-specific reasons I won't go into here), nor was I able to successfully use an alternate encoding as Juliana Rivera recommended. 'Manually' writing the data as text worked for me.
with open(RESULT_FP + '.csv', 'w+') as rf:
for row in output:
row = ','.join(list(map(str, row))) + '\n'
rf.write(row)
Sometimes I guess you just have to go back to basics.
I encountered a similar issue when attempting to read_json followed by a to_excel:
df = pandas.read_json(myfilepath)
# causes garbled characters
df.to_excel(sheetpath, encoding='utf8')
# also causes garbled characters
df.to_excel(sheetpath, encoding='latin1')
Turns out, if I load the json manually with the json module first, and then export with to_excel, the issue doesn't occur:
with open(myfilepath, encoding='utf8') as f:
j = json.load(f)
df = pandas.DataFrame(j)
df.to_excel(sheetpath, encoding='utf8')

Categories