I am new to Python and machine learning. I have this data file on which I want to apply binary classification. But I am unable to guess its format and to load it in Python. Can someone help me out here?
In the dataset first column is class, and there are 100 features. I am using pandas IO to load it, and tried read_csv, but it's not working! And also it's definitely not JSON. (And I have used only these formats till now, so pardon me in advance if it is some well known format!)
You can try sklearn.datasets.load_svmlight_file to read the file.
Here's an example from the documentation link on how to use the method:
from sklearn.externals.joblib import Memory
from sklearn.datasets import load_svmlight_file
mem = Memory("./mycache")
#mem.cache
def get_data():
data = load_svmlight_file("mysvmlightfile")
return data[0], data[1]
X, y = get_data()
It's a pure text file. By looking at the first row, it looks like a libsvm format.
See this for a reference.
Related
This may be a simple question, and I apologize if it's too simple. But I have some data in a CSV:
Date,Open,High,Low,Close,Adj Close,Volume
1993-01-29,43.968750,43.968750,43.750000,43.937500,26.453930,1003200
1993-02-01,43.968750,44.250000,43.968750,44.250000,26.642057,480500
1993-02-02,44.218750,44.375000,44.125000,44.343750,26.698507,201300
1993-02-03,44.406250,44.843750,44.375000,44.812500,26.980742,529400
1993-02-04,44.968750,45.093750,44.468750,45.000000,27.093624,531500
1993-02-05,44.968750,45.062500,44.718750,44.968750,27.074818,492100
1993-02-08,44.968750,45.125000,44.906250,44.968750,27.074818,596100
1993-02-09,44.812500,44.812500,44.562500,44.656250,26.886669,122100
....
I want to create a "training set", which is basically a random vector of 10 rows of data (I can figure out the normalizing, etc) randomly sampled from anywhere in the file. I think I'll have to use pandas to do the loading maybe?
If what I'm trying to ask is unclear, please add comments and I will adjust the question accordingly. Thank you.
import pandas as pd
sample = pd.read_csv('myfile.csv').sample(n=10)
you should load the file only 1 time and then sample as you go:
df = pd.read_csv('myfile.csv')
sample1 = df.sample(n=10)
sample2 = df.sample(n=10)
To read csv, you need to import pandas.
Use this code
import pandas as pd
data = pd.read_csv("filename.csv")
Put the filename.csv in quotation marks.
If your file is in a different folder, use the full path in quotations
"C:/Users/user/Desktop/folder/file.csv"
I struggled with the following for a couple of hours yesterday. I figured out a workaround, but I'd like to understand a little more of what's going on in the background and, ideally, I'd like to remove the intermediate file from my code just for the sake of elegance. I'm using python, by the way and files_df starts off as a pandas dataframe.
Can you help me understand why the following code gives me an error.
files_json = files_df.to_json(orient='records')
for file_json in files_json:
print(file_json) #do stuff
But this code works?
files_json = files_df.to_json(orient='records')
with open('export_json.json', 'w') as f:
f.write(files_json)
with open('export_json.json') as data:
files_json = json.load(data)
for file_json in files_json:
print(file_json) #do stuff
Obviously, the export/import is converting the data somehow into a usable format. I would like to understand that a little better and know if there is some option within the pandas files_df.to_json command to perform the same conversion.
json.load is the opposite of json.dump, but you export from pandas data frames into file and than import again with standard library into some sort of python structure.
Try files_df.to_dict
I'm attempting to convert a JSON file to an SQLite or CSV file so that I can manipulate the data with python. Here is where the data is housed: JSON File.
I found a few converters online, but those couldn't handle the quite large JSON file I was working with. I tried using a python module called sqlbiter but again, like the others, was never really able to output or convert the file.
I'm not. sure where to go now, if anyone has any recommendations or insights on how to get this data into a database, I'd really appreciate it.
Thanks in advance!
EDIT: I'm not looking for anyone to do it for me, I just need to be pointed in the right direction. Are there other methods I haven't tried that I could learn?
You can utilize pandas module for this data processing task as follows:
First, you need to read the JSON file using with, open and json.load.
Second, you need to change the format of your file a bit by changing the large dictionary that has a main key for every airport into a list of dictionaries instead.
Third, you can now utilize some pandas magic to convert your list of dictionaries into a DataFrame using pd.DataFrame(data=list_of_dicts).
Finally, you can utilize pandas's to_csv function to write your DataFrame as a CSV file into disk.
It would look something like this:
import pandas as pd
import json
with open('./airports.json.txt','r') as f:
j = json.load(f)
l = list(j.values())
df = pd.DataFrame(data=l)
df.to_csv('./airports.csv', index=False)
You need to load your json file and parse it to have all the fields available, or load the contents to a dictionary, then you could using pyodbc to write to the database these fields, or write them to the csv if you use import csv first.
But this is just a general idea. You need to study python and how to do every step.
For instance for writting to the database you could do something like:
for i in range(0,max_len):
sql_order = "UPDATE MYTABLE SET MYTABLE.MYFIELD ...."
cursor1.execute(sql_order)
cursor1.commit()
In the sci-kit learn python library there are many datasets accessed easily by the following commands:
for example to load the iris dataset:
iris=datasets.load_iris()
And we can now assign data and target/label variables as follows:
X=iris.data # assigns feature dataset to X
Y=iris.target # assigns labels to Y
My question is how to create my own data dictionary using my own data either in csv, xml or any other format into something similar above so data can be called easily and features/labels are easily accessed.
Is this possible? someone help me!!
By the way I am using the spyder (anaconda) platform by continuum.
Thanks!
I see at least two (easy) solutions to your problem.
First, you can store your data in whichever structure you like.
# Storing in a list
my_list = []
my_list.append(iris.data)
my_list[0] # your data
# Storing in a dictionary
my_dict = {}
my_dict["data"] = iris.data
my_dict["data"] # your data
Or, you can create your own class:
Class MyStructure:
def __init__(data, target):
self.data = data
self.target = target
my_class = MyStructure(iris.data, iris.target)
my_class.data # your data
Hope it helps
If ALL you want to do is read data from csv files and have them organized , I would recommend you to simply use either pandas or numpy's genfromtxt function.
mydata=numpy.genfromtxt(filepath,*params)
If the CSV is formatted regularly, you can extract for example the names of each column by specifying:
mydata=numpy.genfromtxt(filepath,unpack=True,names=True,delimiter=',')
then you can access any column data you want by simply typing it's name/header:
mydata['your header']
(Pandas also has a similar convenient way of grabbing data in an organized manner from CSV or similar files.)
However if you want to do it the long way and learn:
Simply, you want to write a class for the data that you are using, complete with its own access, modify, read, #dosomething functions. Instead of code for this, I think you would benefit more from going in and reading for example the iris class, or an introduction to a simple Class from any beginners guide to object based programming.
To do what you want, for an object MyData, you could have for example
read(#file) function that reads from a given file of some expected format and returns some specified structure. For reading from csv files, you can simply use numpy's loadtxt method.
modify(#some attribute)
etc.
I met a DF file which is encoded in binary format. But when I open it using Vim, still I can see characters like "pandas.core.frame", "numpy.core.multiarray". So I guess it is related with Python. However I know little about the Python language. Though I have tried using pandas and numpy modules, I failed to read the file. Could you guys give any suggestion on this issue? Thank you in advance. Here is the Dropbox link to the DF file: https://www.dropbox.com/s/b22lez3xysvzj7q/flux.df
Looks like DataFrame stored with pickle, use read_pickle() to read it:
import pandas as pd
df = pd.read_pickle('flux.df')