I am new at Scikit-Learn and I want to convert a collection of data which I have already labelled into a dataset. I have converted the .csv file of the data into a NumPy array, however one problem I have run into is to classify the data into training set based on the presence of a flag in the second column. I want to know how to access a particular row, column of a .csv file using the Pandas Utility Module. The following is my code:
import numpy as np
import pandas as pd
import csv
import nltk
import pickle
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB,BernoulliNB
from nltk.classify import ClassifierI
from statistics import mode
def numpyfy(fileid):
data = pd.read_csv(fileid,encoding = 'latin1')
#pd.readline(data)
target = data["String"]
data1 = data.ix[1:,:-1]
#print(data)
return data1
def learn(fileid):
trainingsetpos = []
trainingsetneg = []
datanew = numpyfy(fileid)
if(datanew.ix['Status']==1):
trainingsetpos.append(datanew.ix['String'])
if(datanew.ix['Status']==0):
trainingsetneg.append(datanew.ix['String'])
print(list(trainingsetpos))
You can use boolean indexing to split the data. Something like
import pandas as pd
def numpyfy(fileid):
df = pd.read_csv(fileid, encoding='latin1')
target = df.pop('String')
data = df.ix[1:,:-1]
return target, data
def learn(fileid):
target, data = numpyfy(fileid)
trainingsetpos = data[data['Status'] == 1]
trainingsetneg = data[data['Status'] == 0]
print(trainingsetpos)
Related
I have a csv file with rows and columns separated by commas. This file contains headers (str) and values. Now, I want to filter all the data with a condition. For example, there is a header called "pmra" and I want to keep all the information for pmra values between -2.6 and -2.0. How can I do that? I tried with np.where but it did not work. Thanks for your help.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
filename="NGC188_C.csv"
data = pd.read_csv(filename)
ra = data["ra"]
dec = data["dec"]
parallax = data["parallax"]
pm_ra = data["pmra"]
pm_dec = data["pmdec"]
g_band = data["phot_g_mean_mag"]
bp_rp = data["bp_rp"]
You can use something like:
data[(data["pmra"] >= -2.6) & (data["pmra"] <= -2)]
There is also another approach: You can use between function:
data["pmra"].between(-2.6, -2)
Suppose we have a pandas data frame df with a column id with about 5 rows. In the following code below, why do I still get the length of the filtered data frame to be 5:
import pickle
import gzip
import bz2
import pandas as pd
import os
import _pickle as cPickle
import bz2
from downcast import reduce
def load(filename):
"""
Load from filename using pickle
#param filename: name of file to load from
#type filename: str
"""
try:
f = bz2.BZ2File(filename, 'rb')
except:
sys.stderr.write('File ' + filename + ' cannot be read\n')
sys.stderr.write(details)
return
myobj = cPickle.load(f)
f.close()
return myobj
df=pd.DataFrame({"ids":[1,2,3,4,5]})
print(df.shape)
sfile = bz2.BZ2File('df_list_small', 'w')
pickle.dump(df, sfile)
This gives a shape of (5,1) .
df_new= load('df_list_small')
df_new = reduce(df_new)
all_groups = {ident:df_new for ident,df_new in df_new.groupby('ids')}
ids = 1
df_test = all_groups[ids]
print(df_test.shape)
This below gives a shape of (1,1)
So maybe it works only for certain files?
I figured it out. The filtered data frame would have the same dimensions as the original one because they are equal. If I had put a different id, then the dimension of the filtered data frame would have been different.
I'm new to Python, and I'm struggling with extracting the data linked to a max value.
Here's my code
import numpy as np
import pandas as pd
import statistics
import scipy
from pandas import read_csv
filename = 'airbnb.csv'
airbnb = pd.read_csv("./airbnb.csv")
airbnb.head()
table_quartier =airbnb.groupby(['Arrondissement','Quartier']).agg({'Nombre avis': "sum"})
arrondissement_plus = table_quartier.loc[table_quartier['Nombre avis'].idxmax()]
arrondissement_plus.reset_index()
**I'm looking to return 'Arrondissement' as the answer, how do I extract this info from the index found in 'arrondissement_plus'**
I am working with a data frame title price_df. and I would like to drop the rows that contain '4wd' from the column drive-wheels. I have tried price_df2 = price_df.drop(index='4wd', axis=0) and a few other variations after reading the docs pages in pandas, but I continue to get error codes. Could anyone direct me to the correct way to drop the rows that contain values 4wd from the column and data frame? Below is the code I have ran before trying to drop the values:
# Cleaned up Dataset location
fileName = "https://library.startlearninglabs.uw.edu/DATASCI410/Datasets/Automobile%20price%20data%20_Raw_.csv"
# Import libraries
from scipy.stats import norm
import numpy as np
import pandas as pd
import math
import numpy.random as nr
price_df = pd.read_csv(fileName)
round(price_df.head(),2) #getting an overview of that data
price_df.loc[:,'drive-wheels'].value_counts()
price_df2 = price_df.drop(index='4wd', axis=0)
You can use pd.DataFrame.query and back ticks for this column name with a hyphen:
price_df.query('`drive-wheels` != "4wd"')
Try this
price_df = pd.read_csv(fileName)
mask = price_df["drive-wheels"] =="4wd"
price_df = price_df[~mask]
Get a subset of your data with this one-liner:
price_df2 = price_df[price_df.drive-wheels != '4wd']
I have recently started working with LabelBinarizer by running the following code. (here are the first couple of rows of the CSV file that I'm using):
import pandas as pd
from sklearn.preprocessing import LabelBinarizer
#import matplotlib.pyplot as plot
#--------------------------------
label_conv = LabelBinarizer()
appstore_original = pd.read_csv("AppleStore.csv")
#--------------------------------
lb_conv = label_conv.fit_transform(appstore["cont_rating"])
column_names = label_conv.classes_
print(column_names)
print(lb_conv)
I get the lb_conv and the column names. Therefore:
how could I attach label_conv to appstore_original using column_names as the column names?
If anyone could help that would be great.
try this:
lb = LabelBinarizer()
df = pd.read_csv("AppleStore.csv")
df = df.join(pd.DataFrame(lb.fit_transform(df["cont_rating"]),
columns=lb.classes_,
index=df.index))
to make sure that a newly created DF will have the same index elements as the original DF (we need it for joining), we will specify index=df.index in the constructor call.