I am practicing data processing with Scikit learn, and I'm looking at Classification Probability. I've successfully ran the model using the data set from import dataset now I want to try and do the same thing with a CSV file, so I've downloaded the same dataset, and am trying to load it into my code.
iris = np.loadtxt('./iris.csv', delimiter=',', skiprows=1)
X = iris.data[:, 0:2]
y = iris.target
However I get an error stating ValueError: could not convert string to float: 'setosa' I understand that this is from the CSV as it is the name of the flower, is there any other way to import this CSV file so that this issue isnt an issue?
For this you can use pandas:
data = pandas.read_csv("iris.csv")
data.head() # to see first 5 rows
X = data.drop(["target"], axis = 1)
Y = data["target"]
or you can try (I would personally recommend to use pandas)
from numpy import genfromtxt
my_data = genfromtxt('my_file.csv', delimiter=',')
Related
I am using sklearn for KNN regressor:
#importing libraries and data
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor as KNR
theta = pd.read_csv("train.csv")#pandas dataframe
#getting data wanted from theta and putting it in a new dataframe
a = theta.get("YearBuilt")
b = theta.get("YrSold")
A = a.to_frame()
B = b.to_frame()
glasses = [A,B]
x = pd.concat(glasses)
#getting target data
y = theta.get("SalePrice")
#using KNN
horses = KNR(n_neighbors = 3)
horses.fit(x,y)
I get this error message:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Could someone please explain this? My data is in the hundred thousands for target and the thousands for input. And there is no blanks in the data.
Before answering the question, Let me refactor the code. You are using a dataframe so you can index single or muliple fields of the dataframe without going through the extra steps you've used:
#importing libraries and data
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor as KNR
theta = pd.read_csv("train.csv") # pandas dataframe
#getting data wanted from theta and putting it in a new dataframe
x = theta[["YearBuilt", "YrSold"]] # index multiple fields
#getting target data
y = theta["SalePrice"] # index single field
#using KNN
horses = KNR(n_neighbors = 3)
horses.fit(x,y) # fit KNN
Regarding your error, it indicates that you have some NaN, Inf, large values in your data. You can ensure these doesnt occur by filtering out the NaN and inf values using this:
theta = theta.replace([np.inf, -np.inf], np.nan)
theta.dropna(inplace=True)
I have a Dataframe and im trying to apply some ml algorithms on it.
im using pandas to handle it but im having several problems with it:
as you see in the 3rd cell, i have splitted Y into Ytr and Yts. after this the dataframe losses its column names. I've tried to name the column again but it doesn't work.
in the 4th cell, Im trying to use conditional statement to create a subset of Y in which Y values are 1(it is named ytr1). but it returns an empty dataframe.
any suggestions on the whole code would be really appreciated since im not really experienced with Pandas
note: if you haven't worked with jupyter notebook, #%% just means a new cell.
#%%
from pandas import DataFrame as df
import random
import numpy as np
import pandas as pd
import re
#%%
# Preparing the DataFrame
labels = pd.read_csv(r'A:\Data Sets\Pima Indian Diabetes\labels.csv', header=None)
ll = labels.loc[:, 0].tolist()
data = pd.read_csv(r'A:\Data Sets\Pima Indian Diabetes\pima-indians-diabetes2.csv', names=ll)
i = data.columns.values.tolist() # i is the labels of the csv file
i[-1]
#%%
# Spliting the Dataset
X = data.drop(i[-1], axis=1)
Y = data.iloc[:, 8]
Y = Y.to_frame()
Y = pd.DataFrame(Y.values.reshape(-1, 1), columns=i[-1])
tr_idx = data.sample(frac=0.7).index
Xtr = df(X[X.index.isin(tr_idx)])
Xts = df(X[~X.index.isin(tr_idx)])
Ytr = df(Y[X.index.isin(tr_idx)], columns='result')
Yts = df(Y[~X.index.isin(tr_idx)], columns=i[-1])
#%%
# splitting the Classes
ytr1 = Ytr.drop(Ytr[Ytr.iloc[0]!=1].index)
X: all the columns except Labels\classes which are 0 or 1
Y: last column of the csv files that are loaded as labels
Xtr: fraction of X that Im planning to use for training
Xts: fraction of X that Im planning to use for testing
I don't know that my code is correct or not. but I got the error:
bad input shape (1, 301)
from sklearn import svm
import pandas as pd
clf = svm.SVC(gamma='scale')
df = pd.read_csv('C:\\Users\\Armin\\Desktop\\heart.csv')
x = [df.age[1:302], df.sex[1:302], df.cp[1:302], df.trestbps[1:302], df.chol[1:302], df.fbs[1:302], df.restecg[1:302], df.thalach[1:302], df.exang[1:302], df.oldpeak[1:302], df.slope[1:302], df.ca[1:302], df.thal[1:302]]
y = [df.target[1:302]]
clf.fit(x, y)
This is a very simple fix.
You need all the columns from df in x except the target column, for that, just do:
x = df.drop('target', axis=1)
And your target column will be:
y = df['target']
And now do your fit:
clf.fit(x, y)
It will work.
PS: What you were trying to do is passing list of Series having the features value. But what you just need to do is, pass the actual values of your feature set and targets from the dataframe directly.
Some more references for you to get started and keep going:
Read more about what to pass to the fit method here: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC.fit
Here is a super basic tutorial from the folks of scikit themselves: https://scikit-learn.org/stable/tutorial/basic/tutorial.html
Using countvectorizer, I extracted feature vectors from thousands of emails and saved it in a CSV file
dictionary = open (r'''C:\Users\User\Desktop\csmp3\stemmedDictionary.txt''',"r")
dic = list(set(dictionary.read().splitlines()))
cv = CountVectorizer(vocabulary = dic, binary = True)
#~PRESENCE FEATURE VECTOR~#
#TRAIN
pdt = open (r'''C:\Users\User\Desktop\csmp3\presence-dataset-training-stemmed.csv''',"w")
matWriter = csv.writer(pdt,delimiter = ',')
for i in range (1,2): #45252
processed_email = open(r'''C:\Users\User\Desktop\csmp3\processed\processed'''+str(i)+'''.txt''',"r")
presence_array = cv.transform(processed_email)
matWriter.writerow(presence_array)
processed_email.close()
pdt.close()
This is part of a Spam Filtering using Naive Bayes project and our data set is rather large. I'm hoping to use this sparse matrix for Bernoulli Naive Bayes' partial fit method. I just can't quite figure out how to load the sparse matrix from the file. I've already tried numpy.loadtxt but it gives me "ValueError: could not convert string to float: "
Any help would be appreciated! Thank you!
I need to create a synthetic dataset, cause i have to fix a clustering algorithm for my university thesis, so i need it to test the algorithm with a little dataset.
I managed to create it with sklearn make_classification, but the program takes in input a csv file that contains the features of the dataset.
Does anyone know how can i manage to create a synthetic dataset directly in csv, or export the one created with sklearn into a csv file?
You can export a numpy array to a csv file using numpy.savetxt.
This example uses a BytesIO instance as output, you would use a file name instead.
In [1]: import io
In [2]: import numpy as np
In [3]: x = np.random.randn(5, 2)
In [4]: x
Out[4]:
array([[-0.13114465, -0.72491874],
[-0.08375738, -1.23769691],
[-0.5583027 , -0.24086865],
[ 0.04590227, -0.6582806 ],
[-0.21433652, -0.78924272]])
In [5]: buf = io.BytesIO()
In [6]: np.savetxt(buf, x, delimiter=',')
In [7]: print(buf.getvalue().decode())
-1.311446488105691699e-01,-7.249187409818331762e-01
-8.375738326459475358e-02,-1.237696910731503452e+00
-5.583026953882282983e-01,-2.408686450946319058e-01
4.590226685041418758e-02,-6.582805971999975414e-01
-2.143365241670896482e-01,-7.892427231682124233e-01