I think I'm missing something in the code below.
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
# Split into training and test sets
# Testing Count Vectorizer
X = df[['Spam']]
y = df['Value']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)
X_resample, y_resampled = SMOTE().fit_resample(X_train, y_train)
sm = pd.concat([X_resampled, y_resampled], axis=1)
as I'm getting the error
ValueError: could not convert string to float:
---> 19 X_resampled, y_resampled = SMOTE().fit_resample(X_train, y_train)
Example of data is
Spam Value
Your microsoft account was compromised 1
Manchester United lost against PSG 0
I like cooking 0
I'd consider to transform both train and test sets to fix the issue which is causing the error, but I don't know how to apply to both. I've tried some examples on google, but it hasn't fixed the issue.
convert text data to numeric before applying SMOTE , like below.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer.fit(X_train.values.ravel())
X_train=vectorizer.transform(X_train.values.ravel())
X_test=vectorizer.transform(X_test.values.ravel())
X_train=X_train.toarray()
X_test=X_test.toarray()
and then add your SMOTE code
x_train = pd.DataFrame(X_train)
X_resample, y_resampled = SMOTE().fit_resample(X_train, y_train)
You can use SMOTENC instead of SMOTE. SMOTENC deals with categorical variables directly.
https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTENC.html#imblearn.over_sampling.SMOTENC
Tokenizing your string data before feeding it into SMOTE is an option. You can use any tokenizer and following torch implementation would be something like:
dataloader = torch.utils.data.DataLoader(dataset, batch_size=64)
X, y = [], []
for batch in dataloader:
input_ids = batch['input_ids']
labels = batch['labels']
X.append(input_ids)
y.append(labels)
X_tensor = torch.cat(X, dim=0)
y_tensor = torch.cat(y, dim=0)
X = X_tensor.numpy()
y = y_tensor.numpy()
smote = SMOTE(random_state=42, sampling_strategy=0.6)
X_resampled, y_resampled = smote.fit_resample(X, y)
Related
I have a simple linear regression model below that uses one hot encoding to transform every X value. My question is how can I modify the code below to use one hot encoding for every column except one (e.g. the integer one highlighted below)
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score
# define the location of the dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv"
# load the dataset
dataset = read_csv(url, header=None)
# retrieve the array of data
data = dataset.values
# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)
# split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=1)
# one-hot encode input variables that are objects
onehot_encoder = OneHotEncoder()
onehot_encoder.fit(X_train)
X_train = onehot_encoder.transform(X_train)
X_test = onehot_encoder.transform(X_test)
# ordinal encode target variable
label_encoder = LabelEncoder()
label_encoder.fit(y_train)
y_train = label_encoder.transform(y_train)
y_test = label_encoder.transform(y_test)
# define the model
model = LogisticRegression()
# fit on the training set
model.fit(X_train, y_train)
# predict on test set
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.2f' % (accuracy*100))
I tried only feeding in 8 columns instead of 9 to OHE but got the error:
ValueError: The number of features in X is different to the number of features of the fitted data. The fitted data had 9 features and the X has 8 features.
I want to divide my sample into train/test set respectively 80/20 and after that I want to perform StratifiedKFold.
So let's take some data and divide them into 80/20 using train_test_split
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-
databases/breast-cancer-wisconsin/wdbc.data', header=None)
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
X=df.drop(df.columns[[1]], axis=1)
y=np.array(df[1])
y[y=='M']=0
y[y=='B']=1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2).
Now if I want to see result of my division I see error :
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=8)
for train, validation in kfold.split(X, y):
print(X[train].shape, X[validation].shape)
ValueError: Supported target types are: ('binary', 'multiclass'). Got 'unknown' instead.
I've read about it and it's common error connected to this function, however I'm not sure how to solve the issue.
I saw that we can perform this on iris data :
iris = load_iris()
X = iris.data
y = iris.target
print(X.shape) # initial dataset size
# (150, 4)
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=8)
for train, validation in kfold.split(X, y):
print(X[train].shape, X[validation].shape)
And we will se the result ? What I'm doing differently that this function doesn't want to work ?
You need to recast your y as an array of integers:
y = y.astype(int)
I'm not really sure how it works, but I guess since it started as an array of strings, and was converted one by one (first y=='M', later y=='B') into an array of integers, it just doesn't convert the array itself into an array of integers.
When you convert df[1] to zeros and ones, the type is still a string (or object in numpy):
y=np.array(df[1])
y[y=='M']=0
y[y=='B']=1
y.dtype
dtype('O')
You can either do it as:
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=8)
kfold.get_n_splits(X,df[1])
5
Or:
y = (df[1] == 'B').astype(int)
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=8)
kfold.get_n_splits(X,y)
When you have something in between, it gets confused.
I have built a text classification model with sklearn's DecisionTreeClassifier and would like to add another predictor. My data is in a pandas dataframe with columns labeled 'Impression' (text), 'Volume' (floats), and 'Cancer' (label). I've been using only Impression to predict Cancer but would like to use Impression and Volume to predict Cancer instead.
My code previously that ran without issue:
X_train, X_test, y_train, y_test = train_test_split(data['Impression'], data['Cancer'], test_size=0.2)
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)
dt = DecisionTreeClassifier(class_weight='balanced', max_depth=6, min_samples_leaf=3, max_leaf_nodes=20)
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)
I've tried a few different ways to add the Volume predictor (changes in bold):
1) Only fit_transform the Impressions
X_train, X_test, y_train, y_test = train_test_split(data[['Impression', 'Volume']], data['Cancer'], test_size=0.2)
vectorizer = CountVectorizer()
X_train['Impression'] = vectorizer.fit_transform(X_train['Impression'])
X_test = vectorizer.transform(X_test)
dt = DecisionTreeClassifier(class_weight='balanced', max_depth=6, min_samples_leaf=3, max_leaf_nodes=20)
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)
This throws the error
TypeError: float() argument must be a string or a number, not 'csr_matrix'
...
ValueError: setting an array element with a sequence.
2) Call fit_transform on both Impressions and Volumes. Same code as above except for fit_transform line:
X_train = vectorizer.fit_transform(X_train)
This of course throws the error:
ValueError: Number of labels=1800 does not match number of samples=2
...
X_train.shape
(2, 2)
y_train.shape
(1800,)
I'm pretty sure method #1 is the right way to go but I haven't been able to find any tutorials or solutions for how I can add the float predictor to this text classification model.
Any help would be appreciated!
ColumnTransformer() will exactly solve this problem. Instead of you manually appending the output of CountVectorizer with other columns, we can set the remainder param as passthrough in ColumnTransformer.
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
from sklearn import set_config
set_config(print_changed_only='True', display='diagram')
data = pd.DataFrame({'Impression': ['this is the first text',
'second one goes like this',
'third one is very short',
'This is the final statement'],
'Volume': [123, 1, 2, 123],
'Cancer': [1, 0, 0, 1]})
X_train, X_test, y_train, y_test = train_test_split(
data[['Impression', 'Volume']], data['Cancer'], test_size=0.5)
ct = make_column_transformer(
(CountVectorizer(), 'Impression'), remainder='passthrough')
pipeline = make_pipeline(ct, DecisionTreeClassifier())
pipeline.fit(X_train, y_train)
pipeline.score(X_test, y_test)
Use 0.23.0 version, to see the visuals of pipeline objects (display param in set_config)
You can use hstack to combine two features together.
from scipy.sparse import hstack
X_train = vectorizer.fit_transform(X_train)
X_train_new = hstack(X_train, np.array(data['Volume']))
Now your new train contain both features. And if I may advice, use tfidfvectorizer instead of countvectorizer since tfidf considers the importance of words in each document/Impresion while countvectorizer only counts number of occurrences of words and hence a word like "THE" will have higher importance than those which really matter to us.
Using KNN and I wanted to experiment with different normalizers (Normalizer(), MinMaxScaler(), StandardScaler() etc).
I have loaded the data into a variable called X:
X = pd.read_csv('C:/Users/rmahesh/documents/parkinson.csv')
After doing some data wrangling, I try and run this code:
from sklearn import preprocessing
from sklearn.decomposition import PCA
T = preprocessing.Normalizer().fit(X)
from sklearn.cross_validation import train_test_split
T_train, T_test, y_train, y_test = train_test_split(T, y, test_size = 0.3, random_state = 7)
from sklearn.svm import SVC
model = SVC()
model = model.fit(T_train, y_train)
score = model.score(T_test, y_test)
print(score)
The specific error code I am getting is this:
TypeError: Singleton array array(Normalizer(copy=True, norm='l2'), dtype=object) cannot be considered a valid collection.
The code in which the error is appearing is this line:
T_train, T_test, y_train, y_test = train_test_split(T, y,
test_size = 0.3, random_state = 7)
Any help would be greatly appreciated!
You're fitting your normalizer and then treating it as an array directly. Replace
T = preprocessing.Normalizer().fit(X)
With
T = preprocessing.Normalizer().fit_transform(X)
So that the actual output of the normalization is used instead. .fit() returns the Normalizer object itself.
I'm using my own dataset and I want to do a Deep Neural Network using tflearn.
This is a part of my code.
import tflearn
from tflearn.data_utils import load_csv
#Load the CSV File
X, Y = load_csv('data.csv')
#Split Data in train and Test with tflearn
¿How could I do a function in TFLearn to split X, Y and get train_X, test_X, train_Y, test_Y ?
I know how to do with numpy and other libraries, but I would like to do using tflearn.
In the fit method for the tflearn.DNN model in tflearn (http://tflearn.org/models/dnn/), you can set the option validation_set to a float less than 1, and then the model will automatically split your input in a training and validation set, while training.
Example
import tflearn
from tflearn.data_utils import load_csv
#Load the CSV File
X, Y = load_csv('data.csv')
# Define some network
network = ...
# Training
model = tflearn.DNN(network, tensorboard_verbose=0)
model.fit(X, Y, n_epoch=20, validation_set=0.1) # will use 10% for validation
This will create a validation set while training, which is different from a test set. If you just want a train and test set, I recommend taking a look at the train_test_split function from sklearn, which also can split your data for you.
the answer from Nicki is the simplest solution i think.
But, another easy solution is to use sklearn and the train_test_split()
from sklearn.model_selection import train_test_split
data, target = load_raw_data(data_size) # own method, data := ['hello','...'] target := [1 0 -1] label
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.33, random_state=42)
Or the numpy version:
import numpy as np
texts, target = load_raw_data(data_size) # own method, texts := ['hello','...'] target := [1 0 -1] label
train_indices = np.random.choice(len(target), round(0.8 * len(target)), replace=False)
test_indices = np.array(list(set(range(len(target))) - set(train_indices)))
x_train = [x for ix, x in enumerate(texts) if ix in train_indices]
x_test = [x for ix, x in enumerate(texts) if ix in test_indices]
y_train = np.array([x for ix, x in enumerate(target) if ix in train_indices])
y_test = np.array([x for ix, x in enumerate(target) if ix in test_indices])
So it's your choice, happy coding :)