How to combine tfidf features with selfmade features - python

For a simple web page classification system I am trying to combine some selfmade features (frequency of HTML tags, frequency of certain word collocations) with the features obtained after applying tfidf. I am facing the following problem, however, and I don't really know how to proceed from here.
Right now I am trying to put all of these together in one dataframe, mainly by following the code from the following link :
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
vectorizer = TfidfVectorizer(stop_words="english")
X_train_counts = vectorizer.fit_transform(train_data['text_no_punkt'])
feature_names = vectorizer.get_feature_names()
dense = X_train_counts.todense()
denselist = dense.tolist()
tfidf_df = pd.DataFrame(denselist, columns=feature_names, index=train_data['text_no_punkt'])
But this doesn't return the index (from 0 to 2464) I had in my original dataframe with the other features, neither does it seem to produce readable column names and instead of using the different words as titles, it uses numbers.
Furthermore I am not sure if this is the right way to combine features as this will result in an extremely high-dimensional dataframe which will probably not benefit the classifiers.

You can use hstack to merge the two sparse matrices, without having to convert to dense format.
from scipy.sparse import hstack
hstack([X_train_counts, X_train_custom])

Related

Clustering sentence vectors in a dictionary

I'm working with a kind of unique situation. I have words in Language1 that I've defined in English. I then took each English word, took its word vector from a pretrained GoogleNews w2v model, and average the vectors for every definition. The result, an example with a 3 dimension vector:
L1_words={
'word1': array([ 5.12695312e-02, -2.23388672e-02, -1.72851562e-01], dtype=float32),
'word2': array([ 5.09211312e-02, -2.67828571e-01, -1.49875201e-03], dtype=float32)
}
What I want to do is cluster (using K-means probably, but I'm open to other ideas) the keys of the dict by their numpy-array values.
I've done this before with standard w2v models, but the issue I'm having is that this is a dictionary. Is there another data set I can convert this to? I'm inclined to write it to a csv/make it into a pandas datafram and use Pandas or R to work on it like that, but I'm told that floats are problem when it comes to things requiring binary (as in: they lose information in unpredictable ways). I tried saving my dictionary to hdf5, but dictionaries are not supported.
Thanks in advance!
If I understand your question correctly, you want to cluster words according to their W2V representation, but you are saving it as dictionary representation. If that's the case, I don't think it is a unique situation at all. All you got to do is to convert the dictionary into a matrix and then perform clustering in the matrix. If you represent each line in the matrix as one word in your dictionary you should be able to reference the words back after clustering.
I couldn't test the code below, so it may not be completely functional, but the idea is the following:
from nltk.cluster import KMeansClusterer
import nltk
# make the matrix with the words
words = L1_words.keys()
X = []
for w in words:
X.append(L1_words[w])
# perform the clustering on the matrix
NUM_CLUSTERS=3
kclusterer = KMeansClusterer(NUM_CLUSTERS,distance=nltk.cluster.util.cosine_distance)
assigned_clusters = kclusterer.cluster(X, assign_clusters=True)
# print the cluster each word belongs
for i in range(len(X)):
print(words[i], assigned_clusters[i])
You can read more in details in this link.

How to use k means for a product recommendation dataset

I have a data set with columns titled as product name, brand,rating(1:5),review text, review-helpfulness. What I need is to propose a recommendation algorithm using reviews. I have to use python for coding here. data set is in .csv format.
To identify the nature of the data set I need to use kmeans on the data set. How to use k means on this data set?
Thus I did following,
1.data pre-processing,
2.review text data cleaning,
3.sentiment analysis,
4.giving sentiment score from 1 to 5 according to the sentiment value (given by sentiment analysis) they get and tagging reviews as very negative, negative, neutral, positive, very positive.
after these procedures i have these columns in my data set, product name, brand,rating(1:5),review text, review-helpfulness, sentiment-value, sentiment-tag.
This is the link to the data set https://drive.google.com/file/d/1YhCJNvV2BQk0T7PbPoR746DCL6tYmH7l/view?usp=sharing
I tried to get k means using following code It run without error. but I don't know this is something useful or is there any other ways to use kmeans on this data set to get some other useful outputs. To identify more about data how should i use k means in this data set..
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
df.info()
X = np.array(df.drop(['sentiment_value'], 1).astype(float))
y = np.array(df['rating'])
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)
plt.show()
You did not plot anything.
So nothing shows up.
Unless you are more specific about what you are trying to achieve we won't be able to help. Figure out what exactly you want to predict. Do you just want to cluster products according to their sentiment score which isn't especially promising or do you want to predict actual product preferences on a new dataset?
If you want to build a recommendation system the only possibility (considering your dataset) would be to identify similar products according to the rating/sentiment. Is that what you want?

WordCloud.process_text vs sklearn's CountVectorizer

I would like to count the term frequency across the corpus. To do that, there are two ways, which was using CountVectorizer and sum in axis=0 as below.
count_vec = CountVectorizer(tokenizer=cab_tokenizer, ngram_range=(1,2), stop_words=stopwords)
cv_X = count_vec.fit_transform(string_list)
Another way is using WordCloud.process_text() (see doc here) which will result in term-frequency dict. I used stopword from previously TfIdfVectorizer using tfidf_vec.get_stop_words().
text_freq = WordCloud(stopwords=stopwords, collocations=True).process_text(text)
The fact that I am using stopwords from the TfIdfVectorizer, I am expecting this to behave the same, however, the features/terms I am getting is different (length of the dict is less than TfIdfVectorizer.get_feature_names().
So, I am wondering, what is the different of using one over another? Is one more accurate than the other?

How to restore the original feature names in XGBoost feature importance plot (after preprocessing removed them)?

Preprocessing the training data (such as centering or scaling) before training an XGBoost model, can lead to a loss of feature names. Most answers on SO suggest training the model in such a way that feature names aren't lost (such as using pd.get_dummies on data frame columns).
I have trained an XGBoost model using the preprocessed data (centre and scale using MinMaxScaler). Thereby, I am in a similar situation where feature names are lost.
For instance:
scaler = MinMaxScaler(feature_range=(0, 1))
X = scaler.fit_transform(X)
my_model_name = XGBClassifier()
my_model_name.fit(X,Y)`
where X and Y are the training data and labels respectively. The scaling above returns a 2D NumPy array, thereby discarding feature names from pandas DataFrame.
Thus, when I try to use plot_importance(my_model_name), it leads to the plot of feature importance, but only with feature names such as f0, f1, f2 etc., and not the actual feature names from the original data set.
Is there a way to map the feature names from the original training data to the feature importance plot generated, so that the original feature names are plotted in the graph? Any help in this regard is highly appreciated.
You can get the features names by:
model.get_booster().feature_names
You are right that when you pass NumPy array to fit method of XGBoost, you loose the feature names. In such a case calling model.get_booster().feature_names is not useful because the returned names are in the form [f0, f1, ..., fn] and these names are shown in the output of plot_importance method as well.
But there should be several ways how to achieve what you want - supposed you stored your original features names somewhere, e.g. orig_feature_names = ['f1_name', 'f2_name', ..., 'fn_name'] or directly orig_feature_names = X.columns if X was pandas DataFrame.
Then you should be able to:
change stored feature names (model.get_booster().feature_names = orig_feature_names) and then use plot_importance method that should already take the updated names and show it on the plot
or since this method return matplotlib ax, you can modified labels using plot_importance(model).set_yticklabels(orig_feature_names) (but you have to set the correct order of you features)
or you can take model.feature_importances_ and combine it with your original feature names by yourselves (i.e. plotting it by ourselves)
similarly, you can also use model.get_booster().get_score() method and combine it with your feature names
or you can try Learning API with xgboost DMatrix and specify your feature names during creating of the dataset (after scaling) with train_data = xgb.DMatrix(X, label=Y, feature_names=orig_feature_names) (but I do not have much experience with this way of training since I usually use Scikit-Learn API)
EDIT:
Thanks to #Noob Programmer (see comments below) there might be some "inconsistencies" based on using different feature importance method. Those are the most important ones:
xgboost.plot_importance uses "weight" as the default importance type (see plot_importance)
model.get_booster().get_score() also uses "weight" as the default (see get_score)
model.feature_importances_ depends on importance_type parameter (model.importance_type) and it seems that the result is normalized to sum of 1 (see this comment)
For more info on this topic, look at How to get feature importance.
I tried the above answers, and didn't work while loading the model after training.
So, the working code for me is :
model.feature_names
it returns a list of the feature names
I think, it is best to turn numpy array back into pandas DataFrame. E.g.
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from xgboost import XGBClassifier
Y=label
X_df = pd.read_csv("train.csv")
orig_feature_names = list(X_df.columns)
scaler = MinMaxScaler(feature_range=(0, 1))
X_scaled_np = scaler.fit_transform(X_df)
X_scaled_df = pd.DataFrame(X_scaled_np, columns=orig_feature_names)
my_model_name = XGBClassifier(max_depth=2, n_estimators=2)
my_model_name.fit(X_scaled_df,Y)
xgb.plot_importance(my_model_name)
plt.show()
This will show the original names.

Using scikit learn's GaussianNB with nltk doesn't work

I am trying to use nltk's wrapper for scikit-learn's classifiers. I use this code to train the classifier:
classifier = SklearnClassifier(GaussianNB())
classifier.train(self.training_set)
Where training_set looks like
[({'name':'Alpha Hotel', 'clicks':765, 'zip_code':75025},'no bookings')]
The error I am getting is
TypeError: A sparse matrix was passed, but dense data is required. Use
X.toarray() to convert to a dense numpy array.
I don't know how to convert to a dense array, especially since nltk's documentation for the train method requires A list of (featureset, label) where each featureset is a dict mapping strings to either numbers, booleans or strings.
You have three features just two of them is in numerical format.You first should convert the 'name' feature to a number. If the name variable is categorical then you can encode it in a meaningful manner as described here:
http://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features
i think your labels also limited, so you can encode them too. The last step is really easy you just need to convert nltk format to numpy array format. just read each feature in a loop and then insert your desired features in X (features) and Y (labels):
http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html
Maybe it's was late, but maybe help other who get same problem(cz i got this problem yesterday).
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
Like error said, it's need to convert to array so i just convert this to array as the error said
vector = vectorizer.transform(corpus).toarray()
So just add .toarray() solve this problem.
;)
when i switch to MultinomialNB or BernoulliNB, neither they didn't error. with or without toarray().
note: dont forget to convert to fit and transform your text to word representation(numeric values).

Categories