How to open a .data file extension - python

I am working on side stuff where the data provided is in a .data file. How do I open a .data file to see what the data looks like and also how do I read from a .data file programmatically through python? I have Mac OSX
NOTE: The Data I am working with is for one of the KDD cup challenges

Kindly try using Notepad or Gedit to check delimiters in the file (.data files are text files too). After you have confirmed this, then you can use the read_csv method in the Pandas library in python.
import pandas as pd
file_path = "~/AI/datasets/wine/wine.data"
# above .data file is comma delimited
wine_data = pd.read_csv(file_path, delimiter=",")

It vastly depends on what is in it. It could be a binary file or it could be a text file.
If it is a text file then you can open it in the same way you open any file (f=open(filename,"r"))
If it is a binary file you can just add a "b" to the open command (open(filename,"rb")). There is an example here:
Reading binary file in Python and looping over each byte
Depending on the type of data in there, you might want to try passing it through a csv reader (csv python module) or an xml parsing library (an example of which is lxml)
After further into from above and looking at the page the format is:
Data Format
The datasets use a format similar as that of the text export format from relational databases:
One header lines with the variables names
One line per instance
Separator tabulation between the values
There are missing values (consecutive tabulations)
Therefore see this answer:
parsing a tab-separated file in Python
I would advise trying to process one line at a time rather than loading the whole file, but if you have the ram why not...
I suspect it doesnt open in sublime because the file is huge, but that is just a guess.

To get a quick overview of what the file may content you could do this within a terminal, using strings or cat, for example:
$ strings file.data
or
$ cat -v file.data
In case you forget to pass the -v option to cat and if is a binary file you could mess your terminal and therefore need to reset it:
$ reset

I was just dealing with this issue myself so I thought I would share my answer. I have a .data file and was unable to open it by simply right clicking it. MACOS recommended I open it using Xcode so I tried it but it did not work.
Next I tried open it using a program named "Brackets". It is a text editing program primarily used for HTML and CSS. Brackets did work.
I also tried PyCharm as I am a Python Programmer. Pycharm worked as well and I was also able to read from the file using the following lines of code:
inf = open("processed-1.cleveland.data", "r")
lines = inf.readlines()
for line in lines:
print(line, end="")

It works for me.
import pandas as pd
# define your file path here
your_data = pd.read_csv(file_path, sep=',')
your_data.head()
I mean that just take it as a csv file if it is seprated with ','.
solution from #mustious.

Related

How does one read a .dif file with Python

I am working on a project that requires me to read a file with a .dif extension. Dif stands for data information exchange. The file opens nicely in Open Office Calc. Then you can easily save as a csv file, however when I open in Python all I get are random characters that don't make sense. Here is the last code that I tried just to see if I could read.
txt = open('C:\myfile.dif', 'rb').read()
print txt
I would even be open to programatically converting the file to csv first. before opening if someone knows how to do that. As always, any help is much appreciated. Below is a partial screenshot of what I get when I run the code.
Hadn't heard of this file format. Went and got a sample here.
I tested your method and it works fine:
>>> content = open(r"E:\sample.dif", 'rb').read()
>>> print (content)
b'TABLE\r\n0,1\r\n"EXCEL"\r\nVECTORS\r\n0,8\r\n""\r\nTUPLES\r\n0,3\r\n""\r\nDATA\r\n0,0\r\n""\r\n-1,0\r\nBOT\r\n1,0\r\n"Welcome to File Extension FYI Center!"\r\n1,0\r\n""\r\n1,0\r\n""\r\n-1,0\r\nBOT\r\n1,0\r\n""\r\n1,0\r\n""\r\n1,0\r\n""\r\n-1,0\r\nBOT\r\n1,0\r\n"ID"\r\n1,0\r\n"Type"\r\n1,0\r\n"Description"\r\n-1,0\r\nBOT\r\n0,1\r\nV\r\n1,0\r\n"ASP"\r\n1,0\r\n"Active Server Pages"\r\n-1,0\r\nBOT\r\n0,2\r\nV\r\n1,0\r\n"JSP"\r\n1,0\r\n"JavaServer Pages"\r\n-1,0\r\nBOT\r\n0,3\r\nV\r\n1,0\r\n"PNG"\r\n1,0\r\n"Portable Network Graphics"\r\n-1,0\r\nBOT\r\n0,4\r\nV\r\n1,0\r\n"GIF"\r\n1,0\r\n"Graphics Interchange Format"\r\n-1,0\r\nBOT\r\n0,5\r\nV\r\n1,0\r\n"WMV"\r\n1,0\r\n"Windows Media Video"\r\n-1,0\r\nEOD\r\n'
>>>
The question is what is in the file and how do you want to handle it. Personally I liked:
with open(r"E:\sample.dif", 'rb') as f:
for line in f:
print (line)
In the first code block, that long line that has a b'' (for bytes!) in front of it can be iterated on \r\n:
b'TABLE\r\n'
b'0,1\r\n'
b'"EXCEL"\r\n'
b'VECTORS\r\n'
b'0,8\r\n'
b'""\r\n'
b'TUPLES\r\n'
b'0,3\r\n'
b'""\r\n'
b'DATA\r\n'
b'0,0\r\n'
.
.
.
b'"Windows Media Video"\r\n'
b'-1,0\r\n'
b'EOD\r\n'

python cannot open and edit a .reg file

I am trying to edit a .reg file in python to replace strings in a file. I can do this for any other file type such as .txt.
Here is the python code:
with open ("C:/Users/UKa51070/Desktop/regFile.reg", "r") as myfile:
data=myfile.read()
print data
It returns an empty string
I am not sure why you are not seeing any output, perhaps you could try:
print len(data)
Depending on your version of Windows, your REG file will be saved using UTF-16 encoding, unless you specifically export it using the Win9x/NT4 format.
You could try using the following script:
import codecs
with codecs.open("C:/Users/UKa51070/Desktop/regFile.reg", encoding='utf-16') as myfile:
data = myfile.read()
print data
It's probably not a good idea to edit .reg files manually. My suggestion is to search for a Python package that handles it for you. I think the _winreg Python built-in library is what you are looking for.

Excel fails to open Python-generated CSV files

I have many Python scripts that output CSV files. It is occasionally convenient to open these files in Excel. After installing OS X Mavericks, Excel no longer opens these files properly: Excel doesn't parse the files and it duplicates the rows of the file until it runs out of memory. Specifically, when Excel attempts to open the file, a prompt appears that reads: "File not loaded completely."
Example of code I'm using to generate the CSV files:
import csv
with open('csv_test.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerow([1,2,3])
writer.writerow([4,5,6])
Even the simple file generated by the above code fails to load in Excel. However, if I open the CSV file in a text editor and copy/paste the text into Excel, parse it with text to columns, and then save as CSV from Excel, then I can reopen the CSV file in Excel without issue. Do I need to pass an additional parameter in my scripts to make Excel parse the CSV files the same way it used to? Or is there some setting I can change in OS X Mavericks or Excel? Thanks.
Maybe I had the similar problem, the error message "SYLK: File format is not valid" when open python autogenerated csv file. The solution is really funny. The first two characters must not be I and D in uppercase (ID). Also see "SYLK: File format is not valid" error message when you open file.
Possible solution1: use *.txt instead of *.csv. In this case Excel (at least, 2010) will show you an import data wizard where you can specify delimiters, character encoding, field types, etc.
UPD: Solution2:
The python "csv" module has a "dialect" feature. For example, the following modification of your code generates valid csv file for my environment (Python 2.7, Excel 2010, Windows7, locale with ";" list delimiters):
import csv
with open('csv_test2.csv', 'wb') as f:
csv.excel.delimiter=';'
writer = csv.writer(f, dialect=csv.excel)
writer.writerow([1,2,3])
writer.writerow([4,5,6])

Error with urlopen: new-line character seen in unquoted field

I am using urllib.urlopen with Python 2.7 to read csv files located on an external webserver:
# Try & Except statements removed for clarity
import urllib
import csv
url = ...
csv_file = urllib.urlopen(url)
for row in csv.reader(csv_file):
do_something()
All 100+ files can be read fine, except one that has been updated recently and that returns:
Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
The file is accessible here. According to my text editor, its mode is Mac (CR), as opposed to Windows (CRLF) for the other files.
I found that based on this thread, python urlopen will handle correctly all formats of newlines. Therefore, the problem is likely to come from somewhere else. I have no clue though. The file opens fine with all my text editors and my speadsheet editors.
Does any one have any idea how to diagnose the problem ?
* EDIT *
The creator of the file informed me by email that I was not the only one to experience such issues. Therefore, he decided to make it again. The code above now works fine again. Unfortunately, using a new file also means that the issue can no longer be reproduced, and the solutions tested properly.
Before closing the question, I want to thank all the stackers who dedicated some of their time to figure out a solution and post it here.
It might be a corrupt .csv file? Otherwise, this code runs perfectly.
#!/usr/bin/python
import urllib
import csv
url = "http://www.football-data.co.uk/mmz4281/1213/I1.csv"
csv_file = urllib.urlopen(url)
for row in csv.reader(csv_file):
print row
Credits to J.F. Sebastian for the .csv file.
Altough, you might want to consider sharing the specific .csv file with us? So we can try to re-create the error.
The following code runs without any error:
#!/usr/bin/env python
import csv
import urllib2
r = urllib2.urlopen('http://www.football-data.co.uk/mmz4281/1213/I1.csv')
for row in csv.reader(r):
print row
I was having the same problem with a downloaded csv.
I know the fix would be to use open with 'rU'. But I would rather not have to save the file to disk, just to open back up into a variable. That seems unnecessary.
file = open(filepath,'rU')
mydata = csv.reader(file)
So if someone has a better solution that would be nice. Stackoverflow links that got me this far:
CSV new-line character seen in unquoted field error
Open the file in universal-newline mode using the CSV Django module
I found what I actually wanted with stringIO, or cStringIO, or io:
Using Python, how do I to read/write data in memory like I would with a file?
I ended up getting io working,
import csv
import urllib2
import io
# warning its a 20MB csv
url = 'http://poweredgec.com/latest_poweredge-11g.csv'
urlRead = urllib2.urlopen(url).read()
ramFile = io.open(urlRead, mode='w')
openRamFile = open(ramFile, 'rU')
csvCurrent = csv.reader(openRamFile)
csvTuple = map(tuple, csvCurrent)
print csvTuple

Get the inputs from Excel and use those inputs in python script

How to get the inputs from excel and use those inputs in python.
Take a look at xlrd
This is the best reference I found for learning how to use it: http://www.dev-explorer.com/articles/excel-spreadsheets-and-python
Not sure if this is exactly what you're talking about, but:
If you have a very simple excel file (i.e. basically just one table filled with string-values, nothing fancy), and all you want to do is basic processing, then I'd suggest just converting it to a csv (comma-seperated value file). This can be done by "saving as..." in excel and selecting csv.
This is just a file with the same data as the excel, except represented by lines seperated with commas:
cell A:1, cell A:2, cell A:3
cell B:1, cell B:2, cell b:3
This is then very easy to parse using standard python functions (i.e., readlines to get each line of the file, then it's just a list that you can split on ",").
This if of course only helpful in some situations, like when you get a log from a program and want to quickly run a python script which handles it.
Note: As was pointed out in the comments, splitting the string on "," is actually not very good, since you run into all sorts of problems. Better to use the csv module (which another answer here teaches how to use).
import win32com
Excel=win32com.client.Dispatch("Excel.Application")
Excel.Workbooks.Open(file path)
Cells=Excel.ActiveWorkBook.ActiveSheet.Cells
Cells(row,column).Value=Input
Output=Cells(row,column).Value
If you can save as a csv file with headers:
Attrib1, Attrib2, Attrib3
value1.1, value1.2, value1.3
value2,1,...
Then I would highly recommend looking at built-in the csv module
With that you can do things like:
csvFile = csv.DictReader(open("csvFile.csv", "r"))
for row in csvFile:
print row['Attrib1'], row['Attrib2']

Categories