I'm trying to introduce LightGBM for text multiclassification.
2 columns in pandas dataframe, where 'category' and 'contents' are set as follows.
Dataframe:
contents category
1 this is example1... A
2 this is example2... B
3 this is example3... C
*Actual data frame consists of approx 600 rows and 2 columns.
Hereby I'm trying to classify text into 3 categories as follows.
Codes:
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
stopwords1 = set(stopwords.words('english'))
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
import lightgbm as lgbm
from lightgbm import LGBMClassifier, LGBMRegressor
#--main code--#
X_train, X_test, Y_train, Y_test = train_test_split(df['contents'], df['category'], random_state = 0, test_size=0.3, shuffle=True)
count_vect = CountVectorizer(ngram_range=(1,2), stop_words=stopwords1)
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer(use_idf=True, smooth_idf=True, norm='l2', sublinear_tf=True)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
lgbm_train = lgbm.Dataset(X_train_tfidf, Y_train)
lgbm_eval = lgbm.Dataset(count_vect.transform(X_test), Y_test, reference=lgbm_train)
params = {
'boosting_type':'gbdt',
'objective':'multiclass',
'learning_rate': 0.02,
'num_class': 3,
'early_stopping': 100,
'num_iteration': 2000,
'num_leaves': 31,
'is_enable_sparse': 'true',
'tree_learner': 'data',
'max_depth': 4,
'n_estimators': 50
}
clf_gbm = lgbm.train(params, valid_sets=lgbm_eval)
predicted_LGBM = clf_gbm.predict(count_vect.transform(X_test))
print(accuracy_score(Y_test, predicted_LGBM))
Then I got an error as:
ValueError: could not convert string to float: 'b'
I also convert 'category' column ['a', 'b', 'c'] to int as [0, 1, 2] but got an error as
TypeError: Expected np.float32 or np.float64, met type(int64).
What's wrong with my code?
Any advice / suggestions will be greatly appreciated.
Thanks in advance.
I managed to deal with this issue. Very simple but noted here for reference.
Since LightGBM expects float32/64 for input, so 'categories' should be number, rather than str.
And input data should be converted to float32/64 using .astype().
Changes1:
added following 4 lines after X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf = X_train_tfidf.astype('float32')
X_test_counts = X_test_counts.astype('float32')
Y_train = Y_train.astype('float32')
Y_test = Y_test.astype('float32')
changes2:
just convert 'category' column from [A, B, C, ...] to [0.0, 1.0, 2.0, ...]
Maybe just assigning attirbute as TfidfVecotrizer(dtype=np.float32) works in this case.
And putting vectorized data to LGBMClassifier will be much simpler.
Update
Using TfidfVectorizer is much simpler:
tfidf_vec = TfidfVectorizer(dtype=np.float32, sublinear_tf=True, use_idf=True, smooth_idf=True)
X_data_tfidf = tfidf_vec.fit_transform(df['contents'])
X_train_tfidf = tfidf_vec.transform(X_train)
X_test_tfidf = tfidf_vec.transform(X_test)
clf_LGBM = lgbm.LGBMClassifier(objective='multiclass', verbose=-1, learning_rate=0.5, max_depth=20, num_leaves=50, n_estimators=120, max_bin=2000,)
clf_LGBM.fit(X_train_tfidf, Y_train, verbose=-1)
predicted_LGBM = clf_LGBM.predict(X_test_tfidf)
Related
I am using the xgboost multiclass classifier as outlined in the example below. For each row in the X_test dataframe the model outputs a list with the list elements being the probability corresponding to each category 'a','b','c' or 'd' e.g. [0.44767836 0.2043365 0.15775423 0.19023092].
How can I tell which element in the list corresponds to which class / cateogry (a,b,c or d)? My goal is to create 4 extra columns on the dataframe a,b,c,d with the matching probability as the row value in each column.
import numpy as np
import pandas as pd
import xgboost as xgb
import random
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
#Create Example Data
np.random.seed(312)
data = np.random.random((10000, 3))
y = [random.choice('abcd') for _ in range(data.shape[0])]
features = ["x1", "x2", "x3"]
df = pd.DataFrame(data=data, columns=features)
df['y']=y
#Encode target variable
labelencoder = preprocessing.LabelEncoder()
df['y_target'] = labelencoder.fit_transform(df['y'])
#Train Test Split
X_train, X_test, y_train, y_test = train_test_split(df[features], df['y_target'], test_size=0.2, random_state=42, stratify=y)
#Train Model
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
param = { 'objective':'multi:softprob',
'random_state': 20,
'tree_method': 'gpu_hist',
'num_class':4
}
xgb_model = xgb.train(param, dtrain, 100)
predictions=xgb_model.predict(dtest)
print(predictions)
Predictions follow the same order as your column labels 0, 1, 2, 3. To get original target names use the classes_ attribute from LabelEncoder.
import pandas as pd
pd.DataFrame(predictions, columns=labelencoder.classes_)
>>>
a b c d
0 0.133130 0.214460 0.569207 0.083203
1 0.232991 0.275813 0.237639 0.253557
2 0.163103 0.248531 0.114013 0.474352
3 0.296990 0.202413 0.157542 0.343054
4 0.199861 0.460732 0.228247 0.111159
...
1995 0.021859 0.460219 0.235214 0.282708
1996 0.145394 0.182243 0.225992 0.446370
1997 0.128586 0.318980 0.237229 0.315205
1998 0.250899 0.257968 0.274477 0.216657
1999 0.252377 0.236990 0.221835 0.288798
I have written the following code:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
Spam_model = LogisticRegression(solver='liblinear', penalty='l1')
print(X_train)
Spam_model.fit(X_train, Y_train)
pred = Spam_model.predict(X_test)
accuracy_score(Y_test,pred)
It's throwing the following error. What could be the reason for that?
Logistic regression works with numbers, not strings. You input a value (or more) and it predicts another. A float is a number with decimals. For example, a 2 is an integer and a 2.53 is a float. What you can do is
a = '0.67687980'
print(float(a))
Which returns
0.67687980
However, you cannot do it with a string
a = 'Some string'
print(float(a))
As it returns:
ValueError: could not convert string to float: 'Some string'
If you're using data that isn't numeric, you should convert it all to numbers first to avoid this error
If you have text as data, you need to do feature extraction before applying the classifier. Using an old example from sklearn:
from sklearn.datasets import fetch_20newsgroups
cats = ['alt.atheism', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=cats)
X_train = newsgroups_train.data
Y_train = newsgroups_train.target
newsgroups_test = fetch_20newsgroups(subset='test', categories=cats)
X_test = newsgroups_test.data
Y_test = newsgroups_test.target
Data looks like this:
Y_train
array([0, 1, 1, ..., 1, 1, 1])
X_train[0][:50]
'From: bil#okcforum.osrhe.edu (Bill Conner)\nSubject'
Apply a vectorizer to convert your text into basically numerical features, and then you train the model:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
vectorizer = TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)
model = LogisticRegression(solver='liblinear', penalty='l1')
model.fit(X_train_vec, Y_train)
pred = model.predict(X_test_vec)
accuracy_score(Y_test,pred)
0.906030855539972
I have built a text classification model with sklearn's DecisionTreeClassifier and would like to add another predictor. My data is in a pandas dataframe with columns labeled 'Impression' (text), 'Volume' (floats), and 'Cancer' (label). I've been using only Impression to predict Cancer but would like to use Impression and Volume to predict Cancer instead.
My code previously that ran without issue:
X_train, X_test, y_train, y_test = train_test_split(data['Impression'], data['Cancer'], test_size=0.2)
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)
dt = DecisionTreeClassifier(class_weight='balanced', max_depth=6, min_samples_leaf=3, max_leaf_nodes=20)
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)
I've tried a few different ways to add the Volume predictor (changes in bold):
1) Only fit_transform the Impressions
X_train, X_test, y_train, y_test = train_test_split(data[['Impression', 'Volume']], data['Cancer'], test_size=0.2)
vectorizer = CountVectorizer()
X_train['Impression'] = vectorizer.fit_transform(X_train['Impression'])
X_test = vectorizer.transform(X_test)
dt = DecisionTreeClassifier(class_weight='balanced', max_depth=6, min_samples_leaf=3, max_leaf_nodes=20)
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)
This throws the error
TypeError: float() argument must be a string or a number, not 'csr_matrix'
...
ValueError: setting an array element with a sequence.
2) Call fit_transform on both Impressions and Volumes. Same code as above except for fit_transform line:
X_train = vectorizer.fit_transform(X_train)
This of course throws the error:
ValueError: Number of labels=1800 does not match number of samples=2
...
X_train.shape
(2, 2)
y_train.shape
(1800,)
I'm pretty sure method #1 is the right way to go but I haven't been able to find any tutorials or solutions for how I can add the float predictor to this text classification model.
Any help would be appreciated!
ColumnTransformer() will exactly solve this problem. Instead of you manually appending the output of CountVectorizer with other columns, we can set the remainder param as passthrough in ColumnTransformer.
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
from sklearn import set_config
set_config(print_changed_only='True', display='diagram')
data = pd.DataFrame({'Impression': ['this is the first text',
'second one goes like this',
'third one is very short',
'This is the final statement'],
'Volume': [123, 1, 2, 123],
'Cancer': [1, 0, 0, 1]})
X_train, X_test, y_train, y_test = train_test_split(
data[['Impression', 'Volume']], data['Cancer'], test_size=0.5)
ct = make_column_transformer(
(CountVectorizer(), 'Impression'), remainder='passthrough')
pipeline = make_pipeline(ct, DecisionTreeClassifier())
pipeline.fit(X_train, y_train)
pipeline.score(X_test, y_test)
Use 0.23.0 version, to see the visuals of pipeline objects (display param in set_config)
You can use hstack to combine two features together.
from scipy.sparse import hstack
X_train = vectorizer.fit_transform(X_train)
X_train_new = hstack(X_train, np.array(data['Volume']))
Now your new train contain both features. And if I may advice, use tfidfvectorizer instead of countvectorizer since tfidf considers the importance of words in each document/Impresion while countvectorizer only counts number of occurrences of words and hence a word like "THE" will have higher importance than those which really matter to us.
Im trying to build a text-classification model on a database of site reviews (3 classes).
i cleaned the DF, tokenized it (with countVectorizer) and Tfidf (TfidfTransformer) and built MNB model.
now after i trained and evaluated the model, i want to get a list of the wrong predictions so i can pass them through LIME and explore the words that confuse the model.
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import (
classification_report,
confusion_matrix,
accuracy_score,
roc_auc_score,
roc_curve,
)
df = pd.read_csv(
"https://raw.githubusercontent.com/m-braverman/ta_dm_course_data/master/train3.csv"
)
cleaned_df = df.drop(
labels=["review_id", "user_id", "business_id", "review_date"], axis=1
)
x = cleaned_df["review_text"]
y = cleaned_df["business_category"]
# tokenization
vectorizer = CountVectorizer()
vectorizer_fit = vectorizer.fit(x)
bow_x = vectorizer_fit.transform(x)
#### transform BOW to TF-IDF
transformer = TfidfTransformer()
transformer_x = transformer.fit(bow_x)
tfidf_x = transformer_x.transform(bow_x)
# SPLITTING THE DATASET INTO TRAINING SET AND TESTING SET
x_train, x_test, y_train, y_test = train_test_split(
tfidf_x, y, test_size=0.3, random_state=101
)
mnb = MultinomialNB(alpha=0.14)
mnb.fit(x_train, y_train)
predmnb = mnb.predict(x_test)
my objective is to get the original indices of the reviews that the model predicted wrongly.
I managed to get the result like this:
predictions = c.predict(preprocessed_df['review_text'])
df2= preprocessed_df.join(pd.DataFrame(predictions))
df2.columns = ['review_text', 'business_category', 'word_count', 'prediction']
df2[df2['business_category']!=df2['prediction']]
im sure there is a more elegant way...
It seems like there is another problem in your code, generally the TfIdf vectorizer is fit on the training data only and in order to get the test data in the same format we do the transform operation. This is primarily done to avoid data leakage. Please refer to TfidfVectorizer: should it be used on train only or train+test. I have modified your code to suit your need.
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import (
classification_report,
confusion_matrix,
accuracy_score,
roc_auc_score,
roc_curve,
)
df = pd.read_csv(
"https://raw.githubusercontent.com/m-braverman/ta_dm_course_data/master/train3.csv"
)
cleaned_df = df.drop(
labels=["review_id", "user_id", "business_id", "review_date"], axis=1
)
x = cleaned_df["review_text"]
y = cleaned_df["business_category"]
# SPLITTING THE DATASET INTO TRAINING SET AND TESTING SET
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.3, random_state=101
)
transformer = TfidfTransformer()
x_train_tf = transformer.fit_transform(x_train)
x_test_tf = transformer.transform(x_test)
mnb = MultinomialNB(alpha=0.14)
mnb.fit(x_train_tf, y_train)
predmnb = mnb.predict(x_test_tf)
incorrect_docs = x_test[predmnb == y_test]
I'm using Pandas and SciKit-Learn to do some basic data cleaning and then ML. I have a words_df DataFrame that's 983 rows x 33,600 columns. The columns are mostly from running TFIDF as below:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
corpus = result_df['_text'].tolist()
count_vect = CountVectorizer(min_df=1, stop_words='english')
dtm = count_vect.fit_transform(corpus)
word_counts = dtm.toarray()
tfidf_transformer = TfidfTransformer()
tfidf = tfidf_transformer.fit_transform(word_counts)
words_df = pd.DataFrame(tfidf.todense(), columns=count_vect.get_feature_names())
I extract an X and a Y (input instances and their target values, in my case page views). X is a DataFrame and Y is a Series (I just use words_df['_pageviews']).
I then run:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
Unfortunately, I get this error:
TypeError: Expected sequence or array-like, got estimator _title
Is this because one of my columns is called _title? I'm not sure what else could be causing this error.
Thanks!