Pandas read excel with Chinese filename

Pandas read excel with Chinese filename - python

I am trying to load as a pandas dataframe a file that has Chinese characters in its name.
I've tried:
df=pd.read_excel("url/某物2008.xls")
and
import sys
df=pd.read_excel("url/某物2008.xls", encoding=sys.getfilesystemencoding())
But the response is something like: "no such file or directory "url/\xa1\xa92008.xls"
I've also tried changing the names of the files using os.rename, but the filenames aren't even read properly (asking python to just print the filenames yields only question marks or squares).

df=pd.read_excel(u"url/某物2008.xls", encoding=sys.getfilesystemencoding())
may work... but you may have to declare an encoding type at the top of the file

try this for unicode conversion:
df=pd.read_excel(u"url/某物2008.xls", encoding='utf-8')

Related

Python pandas csv file unicode error and stuffs

I'm trying to read a csv file on python. The code goes like this -
import pandas as pd
df = pd.read_csv("C:\Users\User\Desktop\Inan")
print(df.head())
However it keeps showing the unicode error. Tried putting r,changing the slashes in multiple ways,but it didnt't work,just showed different errors like "file not found". What can I do?

Try this method, It may work
df = pd.read_csv("C:/Users/User/Desktop/Inan.csv", encoding="utf-8")
include your file extension also(.csv .xlxs)

Read input file as hexidecimal and output certain values, always fails on reading the file

What im trying to do is extract png images from files, so by reading the hex data its easy to find where they are hidden. They always start and end with certain values concerning png images. I wrote a script that would open a .bin file and search for those values and export as png. The problem is, in python 2.7 nothing happens, and in python 3, I get errors about the encoding of the file. Ive tried ignorerrors and utf-8 encoding flags but probelms still persist. The code in question:
import binascii
import re
import os
for directory, subdirectories, files in os.walk('.'):
for file in files:
if not file.endswith('.bin'):
continue
filenumber = 0
with open(os.path.join(directory, file)) as f:
hexaPattern = re.compile(
r'(89504E47.*?AE426082)',
re.IGNORECASE
)
for match in hexaPattern.findall(binascii.hexlify(f.read())):
with open('{}-{}.png'.format(file, filenumber), 'wb+') as f:
f.write(binascii.unhexlify(match))
filenumber += 1
So as you can see, extract hex values beginning with "89504E47" from imported file, anything in between that and "AE426082". I think the entire code for getting these values is fine, but I'm having trouble with python actually reading the file as hexidecimal. Thoughts?

Thank you #Thierry Lathuille that fixed it. I used python 3.9, then did the changes with:
with open(os.path.join(directory, file), 'rb+') as f:
and everything output correctly!

Pandas Read CSV for file address with \t in it

This may be a redundant question because I know that I can rename the file and solve the issue, but I'm still pretty new at this and it would be really useful information for the future. Thanks in advance to respondents!
So, I have a CSV file which is a table exported from SQL with the filename "t_SQLtable" located in a sub-folder of my working directory.
In order to open the file in Pandas I use the following command:
SQLfile= pd.read_csv('SUBFOLDER\t_SQLtable.csv', sep=',')
This is the error I receive:
FileNotFoundError: [Errno 2] File SUBFOLDER _SQLtable.csv does not exist: 'SUBFOLDER\t_SQLtable.csv'
My understanding is that Pandas is reading the <\t> as a tab and thus is not able to find the file, because that's not the file name it is looking for. But I don't know how to format the text in order to tell Pandas how to recognize the <t> as part of the filename. Would anyone know how to resolve this?
Thank you!

Folders are navigated using / which won't escape any character
SQLfile= pd.read_csv('SUBFOLDER/t_SQLtable.csv', sep=',')
in future if you want to keep \t without it being considered as tab
use raw string
print('SUBFOLDER\t_SQLtable.csv')
print(r'SUBFOLDER\t_SQLtable.csv')
SUBFOLDER _SQLtable.csv
SUBFOLDER\t_SQLtable.csv

Try with this.
SQLfile= pd.read_csv('SUBFOLDER\\t_SQLtable.csv', sep=',')
SQLfile= pd.read_csv('SUBFOLDER/t_SQLtable.csv', sep=',')
If doesn't work , then try this:
import os
file_path = os.path.join(os.getcwd(), "SUBFOLDER", "t_SQLtable.csv")
SQLfile= pd.read_csv(file_path, sep=',')

Simply do what you did before, except add an r right before the string:
SQLfile = pd.read_csv(r'SUBFOLDER\t_SQLtable.csv', sep=',')
Adding r to the start of a string will make python treat it as a raw string, as in, all escape codes won't be evaluated.

How to open a .data file extension

I am working on side stuff where the data provided is in a .data file. How do I open a .data file to see what the data looks like and also how do I read from a .data file programmatically through python? I have Mac OSX
NOTE: The Data I am working with is for one of the KDD cup challenges

Kindly try using Notepad or Gedit to check delimiters in the file (.data files are text files too). After you have confirmed this, then you can use the read_csv method in the Pandas library in python.
import pandas as pd
file_path = "~/AI/datasets/wine/wine.data"
# above .data file is comma delimited
wine_data = pd.read_csv(file_path, delimiter=",")

It vastly depends on what is in it. It could be a binary file or it could be a text file.
If it is a text file then you can open it in the same way you open any file (f=open(filename,"r"))
If it is a binary file you can just add a "b" to the open command (open(filename,"rb")). There is an example here:
Reading binary file in Python and looping over each byte
Depending on the type of data in there, you might want to try passing it through a csv reader (csv python module) or an xml parsing library (an example of which is lxml)
After further into from above and looking at the page the format is:
Data Format
The datasets use a format similar as that of the text export format from relational databases:
One header lines with the variables names
One line per instance
Separator tabulation between the values
There are missing values (consecutive tabulations)
Therefore see this answer:
parsing a tab-separated file in Python
I would advise trying to process one line at a time rather than loading the whole file, but if you have the ram why not...
I suspect it doesnt open in sublime because the file is huge, but that is just a guess.

To get a quick overview of what the file may content you could do this within a terminal, using strings or cat, for example:
$ strings file.data
or
$ cat -v file.data
In case you forget to pass the -v option to cat and if is a binary file you could mess your terminal and therefore need to reset it:
$ reset

I was just dealing with this issue myself so I thought I would share my answer. I have a .data file and was unable to open it by simply right clicking it. MACOS recommended I open it using Xcode so I tried it but it did not work.
Next I tried open it using a program named "Brackets". It is a text editing program primarily used for HTML and CSS. Brackets did work.
I also tried PyCharm as I am a Python Programmer. Pycharm worked as well and I was also able to read from the file using the following lines of code:
inf = open("processed-1.cleveland.data", "r")
lines = inf.readlines()
for line in lines:
print(line, end="")

It works for me.
import pandas as pd
# define your file path here
your_data = pd.read_csv(file_path, sep=',')
your_data.head()
I mean that just take it as a csv file if it is seprated with ','.
solution from #mustious.

Python Win32, how to save an XLS as a CSV?

I'm loading up a .xlsx with win32com and would like to save the results as a csv when I'm done.
myworkbook.SaveAs('results.csv')
gives me an xlsx file with a csv extension. How do I save as an actual CSV?

I think that if you add the type after the filename, it should work. (Can't test right now.)
I think the type for CSV (DOS) is 24.
myworkbook.SaveAs('results.csv', 24)

Here are the docs for saveAs:
http://msdn.microsoft.com/en-us/library/bb214129.aspx
from win32com.client import constants as c
myWorkBook.SaveAs('results.csv', c.xlCSV)

You have to specify the type after the filename.
For CSV the following modes are available:
xlCSV = 6 # Comma separated value.
xlCSVMac = 22, # Comma separated value.
xlCSVMSDOS = 24, # Comma separated value.
xlCSVWindows =23, # Comma separated value.
Available file formats can be fond here, the spec of the saveAs method can be found here. Even as there is no example for python, the parameters and values should be the same.

I have not used this library but it might be worth giving a shot:
http://pypi.python.org/pypi/ooxml

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas read excel with Chinese filename - python

df=pd.read_excel(u"url/某物2008.xls", encoding=sys.getfilesystemencoding()) may work... but you may have to declare an encoding type at the top of the file

try this for unicode conversion: df=pd.read_excel(u"url/某物2008.xls", encoding='utf-8')

Related

Python pandas csv file unicode error and stuffs

Read input file as hexidecimal and output certain values, always fails on reading the file

Pandas Read CSV for file address with \t in it

How to open a .data file extension

Python Win32, how to save an XLS as a CSV?

Categories

Resources