Can not write Japanese characters by pandas in Python - python

I'm trying to write data with Japanese characters to file CSV.
But CSV's not correct Japanese characters
def write_csv(columns, data):
df = pd.DataFrame(data, columns=columns)
df.to_csv("..\Report\Report.csv", encoding='utf-8')
write_csv(["法人番号", "法人名称", "法人名称カナ"], [])
and CSV:
æ³•äººç•ªå· æ³•äººå称 法人å称カナ
How can I accomplish this?

Your code is OK, just tried it. I'm guessing the CSV file is good but you're trying to open it as cp1252 instead of UTF-8.
What software are you using to open this CSV?
If you're using Microsoft Excel, make sure to use "Import" instead of "Open" so that you can choose the encoding.
With Google Sheets or LibreOffice it should Just Work.
Another possible explanation is that there's something wrong with your data in the first place. Here's how you can check that (I just took a few random characters from this generator):
df = pd.DataFrame(['勘してろむ説彼ふて惑岐とや尊続セヲ狭題'])
df.to_csv('report.csv', encoding='utf-8')
Try opening that the same way. If it opens correctly but the other doesn't, the problem is in your code.

For me utf_8_sig worked like a charm.
df.to_csv("..\Report\Report.csv", encoding='utf_8_sig')

Related

Find & replace data in a CSV using Python on Zapier

I'm new to python, zapier and pretty much everything, so forgive me if this is easy or impossible...
I'm trying to import multiple csv's into zapier for an automated workflow, however they contain dot points that aren't formatted using UTF-8, which is all zapier can read.
It consistently errors -
"utf-8' codec can't decode byte 0x95 in position 829: invalid start byte"
After talking to zapier support, they've suggested using python to possibly find an replace these dot points with asterisk or dash, then import this corrected csv into my zapier workflow.
This is what i have written so far as a Python action in Zapier (just trying to read the csv to start with) with no luck:
import csv
with open(input_data['file'], 'r') as file:
reader = csv.reader(file)
for row in reader:
print(row)
Is this possible?
Thanks!
Zapier trying to import CSV with bullet points
My current python code (not working) in an attempt to find & replace bullet points in the CSV's
This is possible, but it's a little tricky. Zapier is confusing when it comes to files. On your computer, files are a series of bytes. But in Zapier, a file is usually a url that points to the actual file. This is great for cross-app compatibility, but tricky to work with in code.
You're trying to open to open a url as a file in Python, which isn't working. Instead, make a request for that file, then read it as a series of bytes. Try this:
import csv
import io
file_data = requests.get(input_data['file'])
reader = csv.reader(file_data.content.decode('utf-8').splitlines(), delimiter=',')
result = io.StringIO() # a string interface to write
writer = csv.writer(result)
for row in reader:
# some modifications here
# row = row.replace(...)
writer.writerow(row)
return [{'data': result.getvalue()}]
The result there is because you want to write out a string that you can then re-package as a CSV in your virtual filesystem of choice (gDrive, Dropbox, etc).
You can also test this locally instead of in the Zapier editor (I find that's a bit easier to iterate with). Simply get the file url from the code step (it'll be something like https://zapier.com/engine/... and make a local python file with:
input_data = {'file': 'https://zapier.com/engine/...'}
...
You'll also need to pip install requests if you don't have it.

Pandas DataFrame's accented characters appearing garbled in Excel

With:
# -*- coding: utf-8 -*-
at the top of my .ipynb, Jupyter is now displaying accented characters correctly.
When I export to csv (with .to_csv()) a pandas data frame containing accented characters:
... the characters do not render properly when the csv is opened in Excel.
This is the case whether I set the encoding='utf-8' or not. Is pandas/python doing all that it can here, and this is an Excel issue? Or can something be done before the export to csv?
Python: 2.7.10
Pandas: 0.17.1
Excel: Excel for Mac 2011
If you want to keep accents, try with encoding='iso-8859-1'
df.to_csv(path,encoding='iso-8859-1',sep=';')
I had similar problem, also on a Mac. I noticed that the unicode string showed up fine when I opened the csv in TextEdit, but showed up garbled when I opened in Excel.
Thus, I don't think there is any way successfully export unicode to Excel with to_csv, but I'd expect the default to_excel writer to suffice.
df.to_excel('file.xlsx', encoding='utf-8')
I also had the same inconvenience. When I checked the Dataframe in the Jupyter notebook I saw that everything was in order.
The problem happens when I try to open the file directly (as it has a .csv extension Excel can open it directly).
The solution for me was to open a new blank excel workbook, and import the file from the "Data" tab, like this:
Import External Data
Import Data from text
I choose the file
In the import wizard window, where it says "File origin" in the drop-down list, I chose the "65001 : Unicode (utf-8)"
Then i just choose the right delimiter, and that was it for me.
I think using a different excel writer helps, recommending xlsxwriter
import pandas as pd
df = ...
writer = pd.ExcelWriter('file.xlsx', engine='xlsxwriter')
df.to_excel(writer)
writer.save()
Maybe try this function for your columns if you can't get Excel to cooperate. It will remove the accents using the unicodedata library:
import unicodedata
def remove_accents(input_str):
if type(input_str) == unicode:
nfkd_form = unicodedata.normalize('NFKD', input_str)
return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])
else:
return input_str
I had the same problem, and writing to .xlsx and renaming to .csv didn't solve the problem (for application-specific reasons I won't go into here), nor was I able to successfully use an alternate encoding as Juliana Rivera recommended. 'Manually' writing the data as text worked for me.
with open(RESULT_FP + '.csv', 'w+') as rf:
for row in output:
row = ','.join(list(map(str, row))) + '\n'
rf.write(row)
Sometimes I guess you just have to go back to basics.
I encountered a similar issue when attempting to read_json followed by a to_excel:
df = pandas.read_json(myfilepath)
# causes garbled characters
df.to_excel(sheetpath, encoding='utf8')
# also causes garbled characters
df.to_excel(sheetpath, encoding='latin1')
Turns out, if I load the json manually with the json module first, and then export with to_excel, the issue doesn't occur:
with open(myfilepath, encoding='utf8') as f:
j = json.load(f)
df = pandas.DataFrame(j)
df.to_excel(sheetpath, encoding='utf8')

How to open a .data file extension

I am working on side stuff where the data provided is in a .data file. How do I open a .data file to see what the data looks like and also how do I read from a .data file programmatically through python? I have Mac OSX
NOTE: The Data I am working with is for one of the KDD cup challenges
Kindly try using Notepad or Gedit to check delimiters in the file (.data files are text files too). After you have confirmed this, then you can use the read_csv method in the Pandas library in python.
import pandas as pd
file_path = "~/AI/datasets/wine/wine.data"
# above .data file is comma delimited
wine_data = pd.read_csv(file_path, delimiter=",")
It vastly depends on what is in it. It could be a binary file or it could be a text file.
If it is a text file then you can open it in the same way you open any file (f=open(filename,"r"))
If it is a binary file you can just add a "b" to the open command (open(filename,"rb")). There is an example here:
Reading binary file in Python and looping over each byte
Depending on the type of data in there, you might want to try passing it through a csv reader (csv python module) or an xml parsing library (an example of which is lxml)
After further into from above and looking at the page the format is:
Data Format
The datasets use a format similar as that of the text export format from relational databases:
One header lines with the variables names
One line per instance
Separator tabulation between the values
There are missing values (consecutive tabulations)
Therefore see this answer:
parsing a tab-separated file in Python
I would advise trying to process one line at a time rather than loading the whole file, but if you have the ram why not...
I suspect it doesnt open in sublime because the file is huge, but that is just a guess.
To get a quick overview of what the file may content you could do this within a terminal, using strings or cat, for example:
$ strings file.data
or
$ cat -v file.data
In case you forget to pass the -v option to cat and if is a binary file you could mess your terminal and therefore need to reset it:
$ reset
I was just dealing with this issue myself so I thought I would share my answer. I have a .data file and was unable to open it by simply right clicking it. MACOS recommended I open it using Xcode so I tried it but it did not work.
Next I tried open it using a program named "Brackets". It is a text editing program primarily used for HTML and CSS. Brackets did work.
I also tried PyCharm as I am a Python Programmer. Pycharm worked as well and I was also able to read from the file using the following lines of code:
inf = open("processed-1.cleveland.data", "r")
lines = inf.readlines()
for line in lines:
print(line, end="")
It works for me.
import pandas as pd
# define your file path here
your_data = pd.read_csv(file_path, sep=',')
your_data.head()
I mean that just take it as a csv file if it is seprated with ','.
solution from #mustious.

Python - Convert CSV to DBF

I would like to convert a csv file to dbf using python (for use in geocoding which is why I need the dbf file) - I can easily do this in stat/transfer or other similar programs but I would like to do as part of my script rather than having to go to an outside program. There appears to be a lot of help questions/answers for converting DBF to CSV but I am not having any luck the other way around.
An answer using dbfpy is fine, I just haven't had luck figuring out exactly how to do it.
As an example of what I am looking for, here is some code I found online for converting dbf to csv:
import csv,arcgisscripting
from dbfpy import dbf
gp = arcgisscripting.create()
try:
inFile = gp.GetParameterAsText(0) #Input
outFile = gp.GetParameterAsText(1)#Output
dbfFile = dbf.Dbf(open(inFile,'r'))
csvFile = csv.writer(open(outFile, 'wb'))
headers = range(len(dbfFile.fieldNames))
allRows = []
for row in dbfFile:
rows = []
for num in headers:
rows.append(row[num])
allRows.append(rows)
csvFile.writerow(dbfFile.fieldNames)
for row in allRows:
print row
csvFile.writerow(row)
except:
print gp.getmessage()
It would be great to get something similar for going the other way around.
Thank you!
Duplicate question at: Convert .csv file into .dbf using Python?
Promising answer there (among others) is
Use the csv library to read your data from the csv file. The third-party dbf library can write a dbf file for you.
For example, you could try:
http://packages.python.org/dbf/
http://code.activestate.com/recipes/362715-dbf-reader-and-writer/
You could also just open the CSV file in OpenOffice or Excel and save it in dBase format.
I assume you want to create attribute files for the Esri Shapefile format or something like that. Keep in mind that DBF files usually use ancient character encodings like CP 850. This may be a problem if your geo data contains names in foreign languages. However, Esri may have specified a different encoding.
EDIT: just noted that you do not want to use external tools.

Get the inputs from Excel and use those inputs in python script

How to get the inputs from excel and use those inputs in python.
Take a look at xlrd
This is the best reference I found for learning how to use it: http://www.dev-explorer.com/articles/excel-spreadsheets-and-python
Not sure if this is exactly what you're talking about, but:
If you have a very simple excel file (i.e. basically just one table filled with string-values, nothing fancy), and all you want to do is basic processing, then I'd suggest just converting it to a csv (comma-seperated value file). This can be done by "saving as..." in excel and selecting csv.
This is just a file with the same data as the excel, except represented by lines seperated with commas:
cell A:1, cell A:2, cell A:3
cell B:1, cell B:2, cell b:3
This is then very easy to parse using standard python functions (i.e., readlines to get each line of the file, then it's just a list that you can split on ",").
This if of course only helpful in some situations, like when you get a log from a program and want to quickly run a python script which handles it.
Note: As was pointed out in the comments, splitting the string on "," is actually not very good, since you run into all sorts of problems. Better to use the csv module (which another answer here teaches how to use).
import win32com
Excel=win32com.client.Dispatch("Excel.Application")
Excel.Workbooks.Open(file path)
Cells=Excel.ActiveWorkBook.ActiveSheet.Cells
Cells(row,column).Value=Input
Output=Cells(row,column).Value
If you can save as a csv file with headers:
Attrib1, Attrib2, Attrib3
value1.1, value1.2, value1.3
value2,1,...
Then I would highly recommend looking at built-in the csv module
With that you can do things like:
csvFile = csv.DictReader(open("csvFile.csv", "r"))
for row in csvFile:
print row['Attrib1'], row['Attrib2']

Categories