I am working on a Machine Learning Project. I need to create two python scripts:
1) a classifier
2) Produce a text file of labels using that classifier.
I am just saving the model in the first script. Then, in the second script, I am applying that model to a different dataset containing text to produce predicted labels (ham, or spam) and saving those predicted labels in a text file.
Basically I have a list of text with labels, ham or spam.
I created a classifier using the Linear Regression Model. I had two different files of training data (texts_training, and labels_training), so I loaded my training data into variables called texts and labels. And then, I worked on the classifier. This is what I have for the classifier:
#classifier.py
def features (words):
fe = np.ndarrary ((len(tweets), 56)
for t, text in enumerate (words):
if "money" in text:
money = 1
else:
money = 0
...(55 more features)
fe = [i:] = [money, ...]
return fe
fe = features (words)
feat.shape
>>>(1000, 56)
import sklearn
X = fe
label = preprocessing.LabelEncoder()
label.fit(labels)
label = lab.transform(labels)
y.shape
>>>(1000,)
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split (X,y, random_state = 4)
from sklearn.preprocessing import StandardScaler
scaler = preprocessing.StandardScaler().fit(X_train)
#Model
from sklearn.linear_model import LinearRegression
clf = LinearRegression()
clf = lreg.fit(X, y)
import pickle
f = open ("clf.pkl", "w")
pickle.dump ((clf, f)
f.close ()
Now, I was loading this into a different script but both scripts are saved in the same folder. This script basically has to use that classifier to save the labels produced in a txt.
system.py
def features (words):
fe = np.ndarrary ((len(tweets), 56)
for t, text in enumerate (words):
if "money" in text:
money = 1
else:
money = 0
...(55 more features)
feat = [t, :] = [money, ...]
return fe
fe = features (words)
X = feat
from sklearn import preprocessing
label = preprocessing.LabelEncoder()
label.fit(labels)
label = label.transform(labels)
y = label
from sklearn.preprocessing import StandardScaler
scaler = preprocessing.StandardScaler().fit(X)
import pickle
#class_output = pickle.load (open('clf.pkl', 'r'))
loaded_model = pickle.load (open('clf.pkl', 'r'))
class_output = loaded_model.predict (X)
**print class_output
>>>array([ 0.06140778, 0.053107 , 0.14343903, ..., 0.05701325,
0.18738435, -0.08788421])**
f = open ("labels_produced.txt", "w")
for output in class_output:
if output ==0:
f.write ("ham\n")
else:
f.write("spam\n")
f.close()
However, how do I compute spam or ham for the new data set as none of the values in the class_output are equal to 0. My features were set to be either 0 or 1.
I am a beginner learner, I have been struggling with this all day today. I do not understand why I get this error and how to fix it. If someone helps, I would really appreciate it.
You are trying to iterate over an object. That is what the the error means:
'LinearRegression' object is not iterable'
This can be seen by doing:
type(clf) = sklearn.linear_model.base.LinearRegression
clf is a LinearRegression object, with its own series of attributes. You cannot iterate over it like you try to do in the line:
for output in class_output:
if output == 0:
# etc
You need to extract the required attributes from your LinearRegression object clf, before you save them to Pickle or before you try and iterate over them.
There are several attributes contained in the LinearRegression obeject.
For example, the following would five you the coefficients of your LinearRegression fit:
Coefficients = clf.coef_
If you decide what attribute of clf you actually want to iterate over, you can extract it in the way shown above.
Edit: A list of attributes in the LinearRegression object is available here:
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
coef_ : array, shape (n_features, ) or (n_targets, n_features)
Estimated coefficients for the linear regression problem. If multiple targets are passed during the fit (y 2D), this is a 2D array of shape (n_targets, n_features), while if only one target is passed, this is a 1D array of length n_features.
residues_ : array, shape (n_targets,) or (1,) or empty
Sum of residuals. Squared Euclidean 2-norm for each target passed during the fit. If the linear regression problem is under-determined (the number of linearly independent rows of the training matrix is less than its number of linearly independent columns), this is an empty array. If the target vector passed during the fit is 1-dimensional, this is a (1,) shape array.
New in version 0.18.
intercept_ : array
Independent term in the linear model.
Edit: Great example available here:
http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html
Further Question:
In your function features, it looks like you pass in your (texts) and then save whether a feature is 0 or 1 in the list feat. But at the end of the function, you return re instead of returning feat. If you return feat you may get the information you want. Also, you iterate with the variables t,text but then in (feat = [i:] = [money, ...]) you assign the values in your feat array using the variable i . Should i be replaced with t?
Related
I'm trying to use a basic Naive-Bayes Classifier in Python using VSC. My attempts all yield 0.0 accuracy.
This is sample data: A CSV without header, of format
class,"['item1','item2','etc']"
The goal is to fit this data to a Multinomial NB model. This is my attempt at it:
df = pandas.read_csv('file.csv', delimiter=',',names=['class','words'],encoding='utf-8')
#x is independent var/feature
X = df.drop('class',axis=1)
#y is dependent var/label
Y = df['class']
#split data into train/test splits, use 25% of data for testing
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.25,random_state = 42)
#create a sparse matrix of words; each word is assigned a number and frequency is counted (i.e. word "x" occurs n amount of times in class Z), rows are classes, columns are words
cv = CountVectorizer()
X_tr = cv.fit_transform(X_train.words)
X_te = cv.transform(X_test.words)
model = MultinomialNB()
model.fit(X_tr,y_train)
y_pred = model.predict(X_te)
print(metrics.accuracy_score(y_test, y_pred))
# accuracy = accuracy_score(y_test,y_pred)*100
# print(accuracy)
As I understand it the following occurs:
A dataframe, df, is created, and split into X and Y (words and classes)
The data's collectively split into training/testing groups
The count vectorizer, CV, assigns an index to each word and counts how many times a certain word occurs in a certain class (word occurences as numbers)
A Multinomial model is created and fit with the training data (x_train.words is used so as the "words" label is ignored)
the model is tested with testing data and an accuracy score is printed.
I've already tried:
Checking the shape of the x_test and x_train dataframe: they match like I think they should, with an equal amount of columns (words), and a 6:3 ratio of rows (classes, per the train test split)
Checking the variable types: the training and testing x's are all sparse matrices (<class 'scipy.sparse.csr.csr_matrix'>) and the testing/training y's are, per the parameters of model.fit, array-like shapes of n samples (pandas series).
The Issue is that the accuracy is 0.0, meaning something's wrong. Perhaps the greater issue is that I have no idea what.
The problem is that the length of your whole data frame is just 9. Just 9 rows. So your model doesn't learn anything. Also, I checked your dataset and I don't think you can make a sentence classifier from it as there are no sentences in your dataset.
I'm using dataset from Kaggle - Cardiovascular Disease Dataset.
The model has been trained and what I want to do is to label a single input(a row of 13 values)
inserted in dynamic way.
Shape of Dataset is 13 Features + 1 Target, 66k rows
#prepare dataset for train and test
dfCardio = load_csv("cleanCardio.csv")
y = dfCardio['cardio']
x = dfCardio.drop('cardio',axis = 1, inplace=False)
model = knn = KNeighborsClassifier()
x_train,x_test, y_train, y_test = train_test_split(x,y,test_size=0.2,random_state=42)
model.fit(x_train, y_train)
# make predictions for test data
y_pred = model.predict(x_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
ML is trained, what I want to do is to predict the label of this single row :
['69','1','151','22','37','0','65','140','90','2','1','0','0','1']
to return 0 or 1 for Target.
So I wrote this code :
import numpy as np
import pandas as pd
single = np.array(['69','1','151','22','37','0','65','140','90','2','1','0','0','1'])
singledf = pd.DataFrame(single)
final=singledf.transpose()
prediction = model.predict(final)
print(prediction)
but it gives error : query data dimension must match training data dimension
how can I fix the labeling for single row ? why I'm not able to predict a single case ?
Each instance in your dataset has 13 features and 1 label.
x = dfCardio.drop('cardio',axis = 1, inplace=False)
This line in the code removes what I assume is the label column from the data, leaving only the (13) feature columns.
The feature vector on which you are trying to predict, is 14 elements long. You can only predict on feature vectors that are 13 elements long because that is what the model was trained on.
if you are looking for a real and quick solution you can use this
import numpy as np
import pandas as pd
single = np.array([['69','1','151','22','37','0','65','140','90','2','1','0','0']])
prediction = model.predict(single)
print(prediction)
I disagree with the others, this is not a problem with including the target.
I had this problem too. The only way I got around it was to input part of x.
So:
x2=x.iloc[0:3]
then give the first row a new value:
x2.iloc[0]=single
ypred=model.predict(x2)
and just look at ypred[0].
Or try a dataframe with 2 values
I want to start develop an application using Machine Learning. I want to classify text - spam or not spam. I have 2 files - spam.txt, ham.txt - that contain thousand of sentences each file. If I want to use a classifier, let's say LogisticRegression.
For example, as I saw on the Internet, to fit my model I need to do like this:
`lr = LogisticRegression()
model = lr.fit(X_train, y_train)`
So here comes my question, what are actually X_train and y_train? How can I obtain them from my sentences? I searched on the Internet, I did not understand, here is my last call, I am pretty new to this topic. Thank you!
According to the documentation (see here):
X corresponds to your float feature matrix of shape (n_samples, n_features) (aka. the design matrix of your training set)
y is the float target vector of shape (n_samples,) (the label vector). In your case, label 0 could correspond to a spam example, and 1 to a ham one
The question is now about how to get a float feature matrix from text data.
A common scheme is to use a tf-idf vectorisation (more on this here), which is available in sklearn.
The vectorisation can be chained with the logistic regression via the Pipeline API of sklearn.
This is how the code would look like roughly
from itertools import chain
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
import numpy as np
# prepare string data
with open('spam.txt', 'r') as f:
spam = f.readlines()
with open('ham.txt', 'r') as f:
ham = f.readlines()
text_train = list(chain(spam, ham))
# prepare labels
labels_train = np.concatenate((np.zeros(len(spam)),np.ones(len(ham))))
# build pipeline
vectorizer = TfidfVectorizer()
regressor = LogisticRegression()
pipeline = Pipeline([('vectorizer', vectorizer), ('regressor', regressor)])
# fit pipeline
pipeline.fit(text_train, labels_train)
# test predict
test = ["Is this spam or ham?"]
pipeline.predict(test) # value in [0,1]
import pandas as pd
import numpy
from sklearn import cross_validation
from sklearn.naive_bayes import GaussianNB
fi = "df.csv"
# Open the file for reading and read in data
file_handler = open(fi, "r")
data = pd.read_csv(file_handler, sep=",")
file_handler.close()
# split the data into training and test data
train, test = cross_validation.train_test_split(data,test_size=0.6, random_state=0)
# initialise Gaussian Naive Bayes
naive_b = GaussianNB()
train_features = train.ix[:,0:127]
train_label = train.iloc[:,127]
test_features = test.ix[:,0:127]
test_label = test.iloc[:,127]
naive_b.fit(train_features, train_label)
test_data = pd.concat([test_features, test_label], axis=1)
test_data["p_malw"] = naive_b.predict_proba(test_features)
print "test_data\n",test_data["p_malw"]
print "Accuracy:", naive_b.score(test_features,test_label)
I have written this code to accept input from a csv file with 128 columns where 127 columns are features and the 128th column is the class label.
I want to predict probability that the sample belongs to each class (There are 5 classes (1-5)) and print it in for of a matrix and determine the class of sample based on the prediction. predict_proba() is not giving the desired output. Please suggest required changes.
GaussianNB.predict_proba returns the probabilities of the samples for each class in the model. In your case, it should return a result with five columns with the same number of rows as in your test data. You can verify which column corresponds to which class using naive_b.classes_ . So, it is not clear why you are saying that this is not the desired output. Perhaps, your problem comes from the fact that you are assigning the output of predict proba to a data frame column. Try:
pred_prob = naive_b.predict_proba(test_features)
instead of
test_data["p_malw"] = naive_b.predict_proba(test_features)
and verify its shape using pred_prob.shape. The second dimension should be 5.
If you want the predicted label for each sample you can use the predict method, followed by confusion matrix to see how many labels have been predicted correctly.
from sklearn.metrics import confusion_matrix
naive_B.fit(train_features, train_label)
pred_label = naive_B.predict(test_features)
confusion_m = confusion_matrix(test_label, pred_label)
confusion_m
Here is some useful reading.
sklearn GaussianNB - http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB.predict_proba
sklearn confusion_matrix - http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
I converted two columns of a pandas dataframe into numpy arrays to use as the features and labels for a machine learning problem.
Code:
train_index, test_index = next(iter(ShuffleSplit(len(labels), train_size=0.2, test_size=0.80, random_state=42)))
features_train, features_test, = X[train_index], X[test_index]
labels_train, labels_test = labels[train_index], labels[test_index]
clf = DecisionTreeClassifier()
clf.fit(features_train, labels_train)
pred = clf.predict(features)
print pred
Features is currently an array of frequency counts (I used a CountVectorizer earlier to fit and transform my original pandas dataframe column). I have the full list of labels stored as pred, but I would like the corresponding feature to each label, so that I may return the list of labels to my pandas dataframe.
Ordering of predictions is the same as passed data (and as #Ulf pointed out - you are incorrectly using term "feature" here, feature is a column of your matrix, particular object that you are counting using countvectorizer; rows are observations, samples, data-points - and this is what you currently call features). Thus in order to see sample-label pairs you can simply zip them together:
pred = clf.predict(features)
for sample, label in zip(features, pred):
print sample, label
If you actually want to recover what each column means, your CountVectorizer is your guy. Somewhere in your code you created it
vectorizer = CountVectorizer( ... )
and later used it
... = vectorizer.fit_transform( ... )
now you can use it to transform your samples back through
pred = clf.predict(features)
for sample, label in zip(features, pred):
print vectorizer.inverse_transform(np.array([sample])), label