TFLearn: Create a Train Test set using Only tflearn - python

I'm using my own dataset and I want to do a Deep Neural Network using tflearn.
This is a part of my code.
import tflearn
from tflearn.data_utils import load_csv
#Load the CSV File
X, Y = load_csv('data.csv')
#Split Data in train and Test with tflearn
¿How could I do a function in TFLearn to split X, Y and get train_X, test_X, train_Y, test_Y ?
I know how to do with numpy and other libraries, but I would like to do using tflearn.

In the fit method for the tflearn.DNN model in tflearn (http://tflearn.org/models/dnn/), you can set the option validation_set to a float less than 1, and then the model will automatically split your input in a training and validation set, while training.
Example
import tflearn
from tflearn.data_utils import load_csv
#Load the CSV File
X, Y = load_csv('data.csv')
# Define some network
network = ...
# Training
model = tflearn.DNN(network, tensorboard_verbose=0)
model.fit(X, Y, n_epoch=20, validation_set=0.1) # will use 10% for validation
This will create a validation set while training, which is different from a test set. If you just want a train and test set, I recommend taking a look at the train_test_split function from sklearn, which also can split your data for you.

the answer from Nicki is the simplest solution i think.
But, another easy solution is to use sklearn and the train_test_split()
from sklearn.model_selection import train_test_split
data, target = load_raw_data(data_size) # own method, data := ['hello','...'] target := [1 0 -1] label
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.33, random_state=42)
Or the numpy version:
import numpy as np
texts, target = load_raw_data(data_size) # own method, texts := ['hello','...'] target := [1 0 -1] label
train_indices = np.random.choice(len(target), round(0.8 * len(target)), replace=False)
test_indices = np.array(list(set(range(len(target))) - set(train_indices)))
x_train = [x for ix, x in enumerate(texts) if ix in train_indices]
x_test = [x for ix, x in enumerate(texts) if ix in test_indices]
y_train = np.array([x for ix, x in enumerate(target) if ix in train_indices])
y_test = np.array([x for ix, x in enumerate(target) if ix in test_indices])
So it's your choice, happy coding :)

Related

Scikit-Learn Numpy - Use One Hot Encoder on only string or categorical columns in dataset

I have a simple linear regression model below that uses one hot encoding to transform every X value. My question is how can I modify the code below to use one hot encoding for every column except one (e.g. the integer one highlighted below)
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score
# define the location of the dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv"
# load the dataset
dataset = read_csv(url, header=None)
# retrieve the array of data
data = dataset.values
# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)
# split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=1)
# one-hot encode input variables that are objects
onehot_encoder = OneHotEncoder()
onehot_encoder.fit(X_train)
X_train = onehot_encoder.transform(X_train)
X_test = onehot_encoder.transform(X_test)
# ordinal encode target variable
label_encoder = LabelEncoder()
label_encoder.fit(y_train)
y_train = label_encoder.transform(y_train)
y_test = label_encoder.transform(y_test)
# define the model
model = LogisticRegression()
# fit on the training set
model.fit(X_train, y_train)
# predict on test set
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.2f' % (accuracy*100))
I tried only feeding in 8 columns instead of 9 to OHE but got the error:
ValueError: The number of features in X is different to the number of features of the fitted data. The fitted data had 9 features and the X has 8 features.

SMOTE - could not convert string to float

I think I'm missing something in the code below.
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
# Split into training and test sets
# Testing Count Vectorizer
X = df[['Spam']]
y = df['Value']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)
X_resample, y_resampled = SMOTE().fit_resample(X_train, y_train)
sm = pd.concat([X_resampled, y_resampled], axis=1)
as I'm getting the error
ValueError: could not convert string to float:
---> 19 X_resampled, y_resampled = SMOTE().fit_resample(X_train, y_train)
Example of data is
Spam Value
Your microsoft account was compromised 1
Manchester United lost against PSG 0
I like cooking 0
I'd consider to transform both train and test sets to fix the issue which is causing the error, but I don't know how to apply to both. I've tried some examples on google, but it hasn't fixed the issue.
convert text data to numeric before applying SMOTE , like below.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer.fit(X_train.values.ravel())
X_train=vectorizer.transform(X_train.values.ravel())
X_test=vectorizer.transform(X_test.values.ravel())
X_train=X_train.toarray()
X_test=X_test.toarray()
and then add your SMOTE code
x_train = pd.DataFrame(X_train)
X_resample, y_resampled = SMOTE().fit_resample(X_train, y_train)
You can use SMOTENC instead of SMOTE. SMOTENC deals with categorical variables directly.
https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTENC.html#imblearn.over_sampling.SMOTENC
Tokenizing your string data before feeding it into SMOTE is an option. You can use any tokenizer and following torch implementation would be something like:
dataloader = torch.utils.data.DataLoader(dataset, batch_size=64)
X, y = [], []
for batch in dataloader:
input_ids = batch['input_ids']
labels = batch['labels']
X.append(input_ids)
y.append(labels)
X_tensor = torch.cat(X, dim=0)
y_tensor = torch.cat(y, dim=0)
X = X_tensor.numpy()
y = y_tensor.numpy()
smote = SMOTE(random_state=42, sampling_strategy=0.6)
X_resampled, y_resampled = smote.fit_resample(X, y)

Shuffle and split 2 numpy arrays so as to maintain their ordering with respect to each other

I have 2 numpy arrays X and Y, with shape X: [4750, 224, 224, 3] and Y: [4750,1].
X is the training dataset and Y is the correct output label for each entry.
I want to split the data into train and test so as to validate my machine learning model. Therefore, I want to split them randomly so that they both have the correct ordering after random split is applied on X and Y. ie- every row of X is correctly has its corresponding label unchanged after the split.
How can I achieve the above objective ?
This is how I would do it
def split(x, y, train_ratio=0.7):
x_size = x.shape[0]
train_size = int(x_size * train_ratio)
test_size = x_size - train_size
train_indices = np.random.choice(x_size, size=train_size, replace=False)
mask = np.zeros(x_size, dtype=bool)
mask[train_indices] = True
x_train, y_train = x[mask], y[mask]
x_test, y_test = x[~mask], y[~mask]
return (x_train, y_train), (x_test, y_test)
I simply choose the required number of indices I need (randomly) for my train set, remaining will be for the test set.
Then use a mask to select the train and test samples.
You can also use the scikit-learn train_test_split to split your data using just 2 lines of code :
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33)
sklearn.model_selection.train_test_split is a good choice!
But to craft one of your own
import numpy as np
def my_train_test_split(X, Y, train_ratio=0.8):
"""return X_train, Y_train, X_test, Y_test"""
n = X.shape[0]
split = int(n * train_ratio)
index = np.arange(n)
np.random.shuffle(index)
return X[index[:split]], Y[index[:split]], X[index[split:]], Y[index[split:]]

How to operate multidimensional features in SVM or use multidimensional features to train model?

If I have this input:
"a1,b1,c1,d1;A1,B1,C1,D1;α1,β1,γ1,θ1;Label1"
"... ... "
"an,bn,cn,dn;An,Bn,Cn,Dn;αn,βn,γn,θn;Labelx"
Array expression:
[
[[a1,b1,c1,d1],[A1,B1,C1,D1],[α1,β1,γ1,θ1],[Label1]],
... ... ... ...
[[an,bn,cn,dn],[An,Bn,Cn,Dn],[αn,βn,γn,θn],[Labelx]]
]
Instance:
[... ... ... ...
[[58.32,453.65,980.50,540.23],[774.40,428.79,1101.96,719.79],[503.70,624.76,1128.00,1064.26],[1]],
[[0,0,0,0],[871.05,478.17,1109.37,698.36],[868.63,647.56,1189.92,1040.80],[1]],
[[169.34,43.41,324.46,187.96],[50.24,37.84,342.39,515.21],[0,0,0,0],[0]]]
Like this:
There are 3 rectangles,and the label means intersect,contain or some other.
I want to use 3 or N features to train a model by SVM.
And I just learn the "python Iris SVM" code.What should I do?
The Opinion:
this is my try:
from sklearn import svm
import numpy as np
mport matplotlib as mpl
from sklearn.model_selection import train_test_split
def label_type(s):
it = {b'Situation_1': 0, b'Situation_2': 1, b'Unknown': 2}
return it[s]
path = 'C:/Users/SEARECLUSE/Desktop/MNIST_DATASET/temp_test.data'
data = np.loadtxt(path, dtype=list, delimiter=';', converters={3:
label_type})
x, y = np.split((data), (3,), axis=1)
x = x[:, :3]
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1,
train_size=0.6)
clf = svm.SVC(C=0.8, kernel='rbf', gamma=20, decision_function_shape='ovr')
clf.fit(x_train, y_train.ravel())
Report Error:
Line: clf.fit(x_train, y_train.ravel())
ValueError: could not convert string to float:
If I try to convert the data:
x, y = np.split(float(data), (3,), axis=1)
Report Error:
Line: x, y = np.split(float(data), (3,), axis=1)
TypeError: only length-1 arrays can be converted to Python scalars
SVMs were not initially designed to handle multidimensional data. I suggest you flatten your input features:
x, y = np.split((data), (3,), axis=1)
x = x[:, :3]
# flatten the features
x = np.reshape(x,(len(x),-1))
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1,
train_size=0.6)
clf = svm.SVC(C=0.8, kernel='rbf', gamma=20, decision_function_shape='ovr')
clf.fit(x_train, y_train.ravel())
I have few questions before I go for an answer:
Q1. What kind of data you are using to train SVM model. Is it image data? If image data then, is it RGB data? The way you explained you data it seems you are intended to do image classification using SVM. Correct me if I am wrong.
Assumption
Let say you have image data. Then please convert to gray scale. Then you try to convert entire data into numpy array. check numpy module to find how to do that.
Once you data become numpy array then you can apply your model.
Let me know if that helps.

How to use the a k-fold cross validation in scikit with naive bayes classifier and NLTK

I have a small corpus and I want to calculate the accuracy of naive Bayes classifier using 10-fold cross validation, how can do it.
Your options are to either set this up yourself or use something like NLTK-Trainer since NLTK doesn't directly support cross-validation for machine learning algorithms.
I'd recommend probably just using another module to do this for you but if you really want to write your own code you could do something like the following.
Supposing you want 10-fold, you would have to partition your training set into 10 subsets, train on 9/10, test on the remaining 1/10, and do this for each combination of subsets (10).
Assuming your training set is in a list named training, a simple way to accomplish this would be,
num_folds = 10
subset_size = len(training)/num_folds
for i in range(num_folds):
testing_this_round = training[i*subset_size:][:subset_size]
training_this_round = training[:i*subset_size] + training[(i+1)*subset_size:]
# train using training_this_round
# evaluate against testing_this_round
# save accuracy
# find mean accuracy over all rounds
Actually there is no need for a long loop iterations that are provided in the most upvoted answer. Also the choice of classifier is irrelevant (it can be any classifier).
Scikit provides cross_val_score, which does all the looping under the hood.
from sklearn.cross_validation import KFold, cross_val_score
k_fold = KFold(len(y), n_folds=10, shuffle=True, random_state=0)
clf = <any classifier>
print cross_val_score(clf, X, y, cv=k_fold, n_jobs=1)
I've used both libraries and NLTK for naivebayes sklearn for crossvalidation as follows:
import nltk
from sklearn import cross_validation
training_set = nltk.classify.apply_features(extract_features, documents)
cv = cross_validation.KFold(len(training_set), n_folds=10, indices=True, shuffle=False, random_state=None, k=None)
for traincv, testcv in cv:
classifier = nltk.NaiveBayesClassifier.train(training_set[traincv[0]:traincv[len(traincv)-1]])
print 'accuracy:', nltk.classify.util.accuracy(classifier, training_set[testcv[0]:testcv[len(testcv)-1]])
and at the end I calculated the average accuracy
Modified the second answer:
cv = cross_validation.KFold(len(training_set), n_folds=10, shuffle=True, random_state=None)
Inspired from Jared's answer, here is a version using a generator:
def k_fold_generator(X, y, k_fold):
subset_size = len(X) / k_fold # Cast to int if using Python 3
for k in range(k_fold):
X_train = X[:k * subset_size] + X[(k + 1) * subset_size:]
X_valid = X[k * subset_size:][:subset_size]
y_train = y[:k * subset_size] + y[(k + 1) * subset_size:]
y_valid = y[k * subset_size:][:subset_size]
yield X_train, y_train, X_valid, y_valid
I am assuming that your data set X has N data points (= 4 in the example) and D features (= 2 in the example). The associated N labels are stored in y.
X = [[ 1, 2], [3, 4], [5, 6], [7, 8]]
y = [0, 0, 1, 1]
k_fold = 2
for X_train, y_train, X_valid, y_valid in k_fold_generator(X, y, k_fold):
# Train using X_train and y_train
# Evaluate using X_valid and y_valid

Categories