I'm quite new to machine learning techniques, and I'm having trouble following some of the scikit-learn documentation and other stackoverflow posts.. I'm trying to create a simple model from a bunch of medical data that will help me predict which of three classes a patient could fall into.
I load the data via pandas, convert all the objects to integers (Male = 0, Female=1 for example), and run the following code:
import numpy as np
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.ensemble import ExtraTreesClassifier
# Upload data file with all integers:
data = pd.read_csv('datafile.csv')
y = data["Target"]
features = list(data.columns[:-1]) # Last column being the target data
x = data[features]
ydata = label_binarize(y, classes=[0, 1, 2])
n_classes = ydata.shape[1]
X_train, X_test, y_train, y_test = train_test_split(x, ydata, test_size=.5)
model2 = ExtraTreesClassifier()
model2.fit(X_train, y_train)
out = model2.predict(X_test)
print np.min(out),np.max(out)
The predicted values of out range between 0.0 and 1.0, but the classes I am trying to predict are 0,1, and 2. What am I missing?
That's normal behaviour in scikit-learn.
There are two approaches possible:
A:You use "label binarize"
Binarizing transforms y=[n_samples, ] -> y[n_samples, n_classes] (1 dimension added; integers in range(0, X) get transformed to binary values)
Because of this input to fit, classifier.predict() will also return results of the form [n_predict_samples, n_classes] (with 0 and 1 as the only values) / That's what you observe!
Example output: [[0 0 0 1], [1 0 0 0], [0 1 0 0]] = predictions for class: 3, 0, 1
B: You skip "label binarize" (multi-class handling automatically done by sklearn)
Without binarizing (assuming your data is using integer-markers for classes): y=[n_samples, ]
Because of this input to fit, classifier.predict() will also return results of the form [n_predict_samples, ] (with possibly other values than 0, 1)
Example output conform to above example: [3 0 1]
Both outputs are mentioned in the docs here:
predict(X)
Returns:
y : array of shape = [n_samples] or [n_samples, n_outputs]
The predicted classes.
Remark: the above behaviour should be valid for most/all classifiers! (not only ExtraTreesClassifier)
Related
I'm trying to develop a machine learning algorithm using LinearSVC and another one using Convolutional Neural Networks to classify DNA sequences.
I've had to one hot encode the DNA sequences and then I stored the resulting arrays for each sequence in a list.
But when I do the train-test split step I wasn't able to use it.
My DNA sequences are like this (not my real dataset, which is way bigger, just to exemplify. All the sequences are in the file 'seqs_for_test.fasta'):
>TE_seq1
CCATAAACTATCTAAATAAGCACTTTTCTGGCTCTCTGGCCCCCCTTCTTCTTTTTGGGAAGGTGACAG
AGGGTAAAAGGGCTCTCTGCCGTGCGAGGCTCCTCACAGACACACAGCAAGAAAGAAGCGCCGCGCAGCA
>TE_seq2
GATAGCCCCTCTCCCAGCCCCAGTCTGATCCCTAACCCTAACTCCACGGCTCCTGTCTCTACCCCCGTCT
CTTTCTTCTTGTACCCTAGTCCCCCAGATCATTAGCTCCCTGCTCGGGCCCAGGGTTTTAAGAGAAGCCC
>TE_seq3
TGACTCAAGTCATGCTACCCAGCCCCGTCTTCTTAAAAATGAGACATGTTGAGACACCCTGCTTTTCGCC
TACAAACACATCCATTCTCTATACTTAGTCTTATTTAAATTCTATCCTCTGTATGTCTAGTCCTGGGGGT
>RD_seq4
TGCTCGCCCCCCAGGAAGTGCAGAGACCGCCTGGGTGTGACTGTTTTTAGGCCTAACAAAGGCACAGAAA
CACCCGTGCGGTCTCTGTATCCCCTGGAGGTATTTCTCCCCATTAGTTTGCTTGACACTAAGTTTTTAAA
>RD_seq5
TAAAAAAAGCTTATTAAGTCCCTAGAACCTGGGACCTATCTACCCAAGTTTTAAAACCTTACTTTTAAGG
CTACATTTTTTTATTTTGACTGTTTTACCATAAGGTCACATATAGGAAACCCCCACTGTCCTAATAAAAA
>RD_seq6
CTAATCTCCTGTTGGCTGACTTACATCAGTTTGGGAAGTTGTTCATGATGACTCTGCGACGATCAAGAAG
GACCAGGACTCTCCCTGGACACCTCAGGGACTTCTTGCTGGAGGGCACCATACATCAGTTTGCCAGCAAA
Here is my code for LinearSVC:
import pandas as pd
import numpy as np
from numpy import array
from numpy import argmax
from Bio import SeqIO
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
with open('../fasta/seqs_for_test.fasta') as fasta_file: # Will close handle cleanly
identifiers = []
sequences = []
for seq_record in SeqIO.parse(fasta_file, 'fasta'): # (generator)
identifiers.append(seq_record.id)
sequences.append(seq_record.seq.lower())
s1 = pd.Series(identifiers, name='ID')
s2 = pd.Series(sequences, name='sequence')
# Gathering Series into a pandas DataFrame and rename index as ID column
fasta_frame = pd.DataFrame(dict(ID=s1, sequence=s2)).set_index(['ID'])
fasta_frame
label_serie = pd.Series()
fasta_frame.insert(1, "label", label_serie)
# Transposable element (TE) == 0; Random (RD) == 1.
fasta_frame.loc[fasta_frame.index.str.contains(r'TE_'),'label'] = 0
fasta_frame.loc[fasta_frame.index.str.contains(r'RD_'),'label'] = 1
fasta_frame
# empty list to store ohe array sequences
res_arr = []
for index, row in fasta_frame['sequence'].iteritems():
# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(row)
# print(integer_encoded)
# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
# print(index)
# print(onehot_encoded)
# append ohe arrays
res_arr.append(onehot_encoded)
y = fasta_frame['label']
# y
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(res_arr,
y,
test_size = 0.20,
random_state=42)
# print(x_train)
# print(y_train)
# print(x_test)
# print(y_test)
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
modelo = LinearSVC()
modelo.fit(x_train, y_train)
previsoes = modelo.predict(y_test)
acuracia = accuracy_score(y_test, previsoes) * 100
print("accuracy was %.2f%%" % acuracia)
I've tried to reshape, np.vstack and other ways but got no success.
How can I use the list of arrays as my training set?
Error message:
ValueError: Found array with dim 3. Estimator expected <= 2.
Your problem is that the SVM expects for each training example a fixed number of n features of dimension 1 and then tries to find a separating hyperplane in this n-dimensional feature-space. If you one-hot-encode your DNA-sequences of length m, you actually get m features of dimension 4. The LinearSVC implementation is not adapt to this situation (I am not sure if SVMs are in general applicable to features which are not one-dimensional, how should a space look like which is spanned by arbitrary-dimensional features?).
If you want to use sklearn's SVM implementations, you have to find a work around "formaly" reducing the dimension of your features to one. One possibility would be to flatten your sequence representation. I.e. starting from one DNA-sequence of dimension [140, 4] you create a flattened representation of dimension [560, 1] through concatenating the one-hot representations in the same dimensions.
Maybe an example is illustrative:
Given an example DNA-sequence "AC" is one-hot encoded to [[1, 0, 0, 0], [0, 1, 0, 0]]. Then you have to flatten the input to [1, 0, 0, 0, 0, 1, 0, 0] that you can train an SVM on DNA-sequences of length 2.
Why does this work?
The SVM will have 8 weights (ignoring bias terms). The first weight will weight the importance of adenine occurring as the first nucleotide. The second weight will weight the importance of having cytosine as the first nucleotide. The fifth weight will weight the importance of adenine occurring as the second nucleotide and so forth. Now, if the "AC" DNA-sequences comes along and we want to classify it, all weights are ignored except for the weights corresponding to adenine occurring as the first nucleotide and cytosine as the second nucleotide.
If your DNA-sequences are not all of a fixed length, you will have to zero-pad them. This means appending to their flattened sequence representation zeros until they are as long as the longest sequence in your dataset.
i'm trying to implement a custom kernel, precisely the exponential Chi-Squared kernel, to pass as parameter to sklearn svm function, but when i run it the subsequent error is raised :
ValueError: X.shape[0] should be equal to X.shape[1]
I read about the broadcasting operation performed by numpy's functions in order to speedup the computation but i can't manage the error.
The code is:
import numpy as np
from sklearn import svm, datasets
# import the iris dataset (http://en.wikipedia.org/wiki/Iris_flower_data_set)
iris = datasets.load_iris()
train_features = iris.data[:, :2] # Here we only use the first two features.
train_labels = iris.target
def my_kernel(x, y):
gamma = 1
return np.exp(-gamma * np.divide((x - y) ** 2, x + y))
classifier = svm.SVC(kernel=my_kernel)
classifier = classifier.fit(train_features, train_labels)
print "Train Accuracy : " + str(classifier.score(train_features, train_labels))
Any help?
I believe the Chi-Squared Kernel is already implemented for you (in from sklearn.metrics.pairwise import chi2_kernel).
Like so
from functools import partial
from sklearn import svm, datasets
from sklearn.metrics.pairwise import chi2_kernel
# import the iris dataset (http://en.wikipedia.org/wiki/Iris_flower_data_set)
iris = datasets.load_iris()
train_features = iris.data[:, :2] # Here we only use the first two features.
train_labels = iris.target
my_chi2_kernel = partial(chi2_kernel, gamma=1)
classifier = svm.SVC(kernel=my_chi2_kernel)
classifier = classifier.fit(train_features, train_labels)
print("Train Accuracy : " + str(classifier.score(train_features, train_labels)))
====================
EDIT:
So turns out the question is really about how one can implement the chi square kernel. My shot at this would be:-
def my_chi2_kernel(X):
gamma = 1
nom = np.power(X[:, np.newaxis] - X, 2)
denom = X[:, np.newaxis] + X
# NOTE: We need to fix some entries, since division by 0 is an issue here.
# So we take all the index of would be 0 denominator, and fix them.
zero_denom_idx = denom == 0
nom[zero_denom_idx] = 0
denom[zero_denom_idx] = 1
return np.exp(-gamma * np.sum(nom / denom, axis=len(X.shape)))
So in essence x - y and x + y in OP's attempt is wrong, since it's not pairwise subtraction or addition.
Curiously, the custom version seems to be faster than sklearn's cythonised version (at least for small dataset?)
I'm following a book about machine learning in python and I just don't understand this code:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from sklearn import cross_validation
from utilities import visualize_classifier
# Input file containing data
input_file = 'data_multivar_nb.txt'
# Load data from input file
data = np.loadtxt(input_file, delimiter=',')
X, y = data[:, :-1], data[:, -1]
# Create Naive Bayes classifier
classifier = GaussianNB()
# Train the classifier
classifier.fit(X, y)
# Predict the values for training data
y_pred = classifier.predict(X)
# Compute accuracy
accuracy = 100.0 * (y == y_pred).sum() / X.shape[0]
print("Accuracy of Naive Bayes classifier =", round(accuracy, 2), "%")
I just have a few questions:
What does data[:, :-1] and data[:, -1] do?
The input file is in the form of:
2.18,0.57,0
4.13,5.12,1
9.87,1.95,2
4.02,-0.8,3
1.18,1.03,0
4.59,5.74,1
How does the computing accuracy part work?
What is X.shape[0]?
Lastly how do I use the classifier to predict the y for new values?
When you index a numpy array you use square brackets similar to a list.
my_list[-1] returns the last item in the list.
For example.
my_list = [1, 2, 3, 4]
my_list[-1]
4
If you're familiar with list indexing then you will know what a slice is.
my_list[:-1] returns all items from the beginning to the last-but-one.
my_list[:-1]
[1, 2, 3]
In your code, data[:, :-1] is simply indexing with slices in 2-dimensions. Lookup the documentation on numpy arrays for more information. Understanding ndarrays is a pre-requisite for using sklearn.
I'm working on a what I thought was a fairly simple machine learning problem.
In this problem the y (label) I'm wanting to classify is a multi-class value. In this dataset I have 6 possible choices.
I've been using the preprocessing.LabelBinarizer() function to pivot my y set to an array of ones or zeros in hopes that this would be sufficient (e.g. [0 0 0 0 0 1]).
This code below fails on the model.fit() due to a ValueError: Found arrays with inconsistent numbers of samples: [ 217 1302] || 1302 is 217*6 BTW
lb = preprocessing.LabelBinarizer()
api_y = lb.fit_transform(df['gear'])
y = pd.DataFrame(api_y)
y = np.ravel(y)
It seems that the binarizer returns results that appear like 6 columns to the algorithm instead of 1 column containing an array of 6 digits.
I've tried to force it into an array model using the code below but then the fit function bails for another reason: ValueError: Unknown label type array([array[0,1,0,0,0]), arrary([0,1,0,0...])
lb = preprocessing.LabelBinarizer()
api_y = lb.fit_transform(df['gear'])
y_list = []
for x in api_y:
item = {'gear': np.array(x)}
y_list.append(item)
y = pd.DataFrame(y_list)
print("after changing to binary classes array y is "+repr(y.shape))
y = np.ravel(y)
I also tried the sklearn_pandas.DataFrameMapper to no avail as it also created 6 distinct fields vs. an array of 6 values represented as one field.
Any help or suggestions would be appreciated...full version of what I thought was right posted here for clarity:
#!/Library/Frameworks/Python.framework/Versions/3.5/bin/python3
import pandas as pd
import numpy as np
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn import metrics
import sklearn_pandas
#
# load traing data taken from 2 years of strava rides
df = pd.DataFrame.from_csv("gear_train.csv")
#
# Prepare data for logistic regression
#
y, X = dmatrices('gear ~ distance + moving_time + total_elevation_gain + average_speed + max_speed + average_cadence + has_heartrate + device_watts', df, return_type="dataframe")
#
# Fix up y to be a flattened array of 1 column (binary array?)
#
lb = preprocessing.LabelBinarizer()
api_y = lb.fit_transform(df['gear'])
y = pd.DataFrame(api_y)
y = np.ravel(y)
#
# run the logistic regression
#
model = LogisticRegression()
model = model.fit(X, y)
score = model.score(X, y)
#
# evaluate the model by splitting into training and testing data sets
#
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model2 = LogisticRegression()
model2.fit(X_train, y_train)
predicted = model2.predict(X_test)
print("predicted="+repr(lb2.inverse_transform(predicted)))
print(metrics.classification_report(y_test, predicted))
#
# do a 10-fold CV test to see if this model holds up
#
scores = cross_val_score(LogisticRegression(), X, y, scoring='accuracy', cv=10)
print(scores.mean())enter code here
The root cause of my problem was y field contained string values instead of numeric. For example b12345 as a key instead of 12345. Once I changed that to use LabelEncoding and Decoding it worked like a champ.
I know the SVM (specifically linear SVC) has an option namely when probability = True as an optional parameter when you instantiate, model.predict_proba() is supposed to give the probability of each of its predictions along with the label (1 or 0). However I keep getting the numpy error "use all() on an 1 dimensional array" when I call predict_proba() and I can only figure out how to get a prediction in the form of a label (1 or 0) using model.predict().
Documentation example works fine for me setting the flag probability=True. The problem has to be in your input data. Try this very simple example:
import numpy as np
X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
y = np.array([1, 1, 2, 2])
from sklearn.svm import SVC
clf = SVC(probability=True)
clf.fit(X, y)
print(clf.predict([[-0.8, -1]]))
print(clf.predict_proba([[-0.8, -1]]))
You can use CallibratedClassifierCV.
from sklearn.calibration import CalibratedClassifierCV
model_svc = LinearSVC()
model = CalibratedClassifierCV(model_svc)
model.fit(X_train, y_train)
pred_class = model.predict(y_test)
probability = model.predict_proba(predict_vec)
You will get predicted probability score in array values.