WordCloud.process_text vs sklearn's CountVectorizer - python

I would like to count the term frequency across the corpus. To do that, there are two ways, which was using CountVectorizer and sum in axis=0 as below.
count_vec = CountVectorizer(tokenizer=cab_tokenizer, ngram_range=(1,2), stop_words=stopwords)
cv_X = count_vec.fit_transform(string_list)
Another way is using WordCloud.process_text() (see doc here) which will result in term-frequency dict. I used stopword from previously TfIdfVectorizer using tfidf_vec.get_stop_words().
text_freq = WordCloud(stopwords=stopwords, collocations=True).process_text(text)
The fact that I am using stopwords from the TfIdfVectorizer, I am expecting this to behave the same, however, the features/terms I am getting is different (length of the dict is less than TfIdfVectorizer.get_feature_names().
So, I am wondering, what is the different of using one over another? Is one more accurate than the other?

Related

I want to Train 4 more Word2vec models and average the resulting embedding matrices

I wrote the code below, I used Used spacy to restrict the words in the tweets to content words, i.e., nouns, verbs, and adjectives. Transform the words to lower case and add the POS with an underderscore. E.g.:
love_VERB old-fashioneds_NOUN
now I want to Train 4 more Word2vec models and average the resulting embedding matrices.
but I dont have any idea for it, can you help me please ?
# Tokenization of each document
from gensim.models.word2vec import FAST_VERSION
from gensim.models import Word2Vec
import spacy
import pandas as pd
from zipfile import ZipFile
import wget
url = 'https://raw.githubusercontent.com/dirkhovy/NLPclass/master/data/reviews.full.tsv.zip'
wget.download(url, 'reviews.full.tsv.zip')
with ZipFile('reviews.full.tsv.zip', 'r') as zf:
zf.extractall()
# nrows , max amount of rows
df = pd.read_csv('reviews.full.tsv', sep='\t', nrows=100000)
documents = df.text.values.tolist()
nlp = spacy.load('en_core_web_sm') # you can use other methods
# excluded tags
included_tags = {"NOUN", "VERB", "ADJ"}
vocab = [s for s in new_sentences]
sentences = documents[:103] # first 10 sentences
new_sentences = []
for sentence in sentences:
new_sentence = []
for token in nlp(sentence):
if token.pos_ in included_tags:
new_sentence.append(token.text.lower()+'_'+token.pos_)
new_sentences.append(new_sentence)
# initialize model
w2v_model = Word2Vec(
size=100,
window=15,
sample=0.0001,
iter=200,
negative=5,
min_count=1, # <-- it seems your min_count was too high
workers=-1,
hs=0
)
new_sentences
w2v_model.build_vocab(vocab)
w2v_model.train(vocab,
total_examples=w2v_model.corpus_count,
epochs=w2v_model.epochs)
w2v_model.wv['car_NOUN']
There's no reason to average together vectors from multiple training runs; it is more likely to destroy any value from the individual runs than provide any benefit.
No one run creates the 'right' final positions, nor do they all approach some idealized positions. Rather, each just creates a set-of-vectors that is internally comparable to others in that same co-trained set. Comparisons or combinations with vectors from other, non-interleaved training runs are usually going to be nonsense.
Instead, aim for one adequate run. If vectors move around a lot, in repeated runs, that's normal. But each reconfiguration should be about as useful, if used for word-to-word comparisons, or analysis of word neighborhoods/directions, or as input to downstream algorithms. If they vary wildly in usefulness, there are likely other inadequacies in the data or model parameters. (For example: too little data – word2vec requires lots to give meaningful results – or a model that's too large for the data – making it prone to overfitting.)
Other observations about your setup:
Just 103 sentences/texts is tiny for word2vec; you shouldn't expect the vectors from such a run to have any of the value that the algorithm would usually provide. (Running such a tiny dataset might be helpful for verifying no halting-errors in the process, or familiarize yourself with the steps/APIs, but the results will tell you nothing.)
min_count=1 is almost always a bad idea in word2vec and similar algorithms. Words that only appear once (or a few times) don't have the variety of subtly-different uses that are needed to train it into a balanced position against other words – so they wind up with weak/strange final positions, and the sheer number of such words dilutes the training effectiveness for other more-frequent words. The common practice of discarding rare words usually gets better results.
iter=200 is an extreme choice which is typically only valuable to try to squeeze results out of inadequate data. (In such a case, you might also have to reduce the vector-size from normal 100-plus dimensions.) So if you seem to need that, getting more data should be a top priority. (Using 20x more data is far, far better than using 20x more iterations on smaller data – but involves the same amount of training time.)

Clustering sentence vectors in a dictionary

I'm working with a kind of unique situation. I have words in Language1 that I've defined in English. I then took each English word, took its word vector from a pretrained GoogleNews w2v model, and average the vectors for every definition. The result, an example with a 3 dimension vector:
L1_words={
'word1': array([ 5.12695312e-02, -2.23388672e-02, -1.72851562e-01], dtype=float32),
'word2': array([ 5.09211312e-02, -2.67828571e-01, -1.49875201e-03], dtype=float32)
}
What I want to do is cluster (using K-means probably, but I'm open to other ideas) the keys of the dict by their numpy-array values.
I've done this before with standard w2v models, but the issue I'm having is that this is a dictionary. Is there another data set I can convert this to? I'm inclined to write it to a csv/make it into a pandas datafram and use Pandas or R to work on it like that, but I'm told that floats are problem when it comes to things requiring binary (as in: they lose information in unpredictable ways). I tried saving my dictionary to hdf5, but dictionaries are not supported.
Thanks in advance!
If I understand your question correctly, you want to cluster words according to their W2V representation, but you are saving it as dictionary representation. If that's the case, I don't think it is a unique situation at all. All you got to do is to convert the dictionary into a matrix and then perform clustering in the matrix. If you represent each line in the matrix as one word in your dictionary you should be able to reference the words back after clustering.
I couldn't test the code below, so it may not be completely functional, but the idea is the following:
from nltk.cluster import KMeansClusterer
import nltk
# make the matrix with the words
words = L1_words.keys()
X = []
for w in words:
X.append(L1_words[w])
# perform the clustering on the matrix
NUM_CLUSTERS=3
kclusterer = KMeansClusterer(NUM_CLUSTERS,distance=nltk.cluster.util.cosine_distance)
assigned_clusters = kclusterer.cluster(X, assign_clusters=True)
# print the cluster each word belongs
for i in range(len(X)):
print(words[i], assigned_clusters[i])
You can read more in details in this link.

Gensim's Doc2vec - inferred vector isn't similar

When I train Doc2vec (using Gensim's Doc2vec in Python) on corpus of about 10k documents (each has few hundred words) and then infer document vectors using the same documents, they are not at all similar to the trained document vectors. I would expect they would be at least somewhat similar.
That is I do model.docvecs['some_doc_id'] and model.infer_vector(documents['some_doc_id']).
Cosine distances between trained and inferred vectors for few first documents:
0.38277733326
0.284007549286
0.286488652229
0.173178792
0.370117008686
0.275438070297
0.377647638321
0.171194493771
0.350615143776
0.311795353889
0.342757165432
As you can see, they are not really similar. If the similarity is so terrible even for documents used for training, I can't even begin to try to infer unseen documents.
Training configuration:
model = Doc2Vec(documents=documents, dm=1, size=100, window=6, alpha=0.1, workers=4,
seed=44, sample=1e-5, iter=15, hs=0, negative=8, dm_mean=1, min_alpha=0.01, min_count=2)
Inferring:
model.infer_vector(tokens, steps=20, alpha=0.025)
Note on the side: Documents are always preprocessed the same way (I checked that the same list of tokens goes into training and into inferring).
Also I played with parameters around a bit, too, and results were similar. So if your suggestion would be something like "try increasing or decreasing this or that training parameter", I've most likely tried it. Maybe I just didn't come across the 'correct' parameters though.
Thanks for any suggestions as to what can I do to make it work better.
EDIT: I am willing and able to use any other available Python implementation of paragraph vectors (doc2vec). It doesn't have to be this one. If you know of another that can achieve better results.
EDIT: Minimal working example
import fnmatch
import os
from scipy.spatial.distance import cosine
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
from keras.preprocessing.text import text_to_word_sequence
files = {}
folder = 'some path' # each file contains few regular sentences
for f in fnmatch.filter(os.listdir(folder), '*.sent'):
files[f] = open(folder + '/' + f, 'r', encoding="UTF-8").read()
documents = []
for k, v in files.items():
words = text_to_word_sequence(v, lower=True) # converts string to list of words, removes commas etc.
documents.append(TaggedDocument(tags=[k], words=words))
d2 = Doc2Vec(size=200, documents=documents)
for doc in documents:
trained = d2.docvecs[doc.tags[0]]
inferred = d2.infer_vector(doc.words, steps=50)
print(cosine(trained, inferred)) # cosine similarity from scipy
What is the type of your documents object, and are you sure that it is a multiply-iterable object, so that the model can do all of its 16 passes over the set of TaggedDocument-shaped text examples? That is, does iter(documents) always return a fresh iterator, with all items as TaggedDocument-shaped objects with the right list-of-words in words and list-of-tags in tags? (A common error is to supply a corpus that can be iterated over only once, and then ignoring any logged hints/warnings that no real training has happening. The inference/similarity results from such a model will be essentially random.)
Then for infer_vector(), does documents[tag] really return just the list-of-words it expects (not TaggedDocument or string)? (Users often supply strings, rather than lists-of-tokens, for training or inference words and get results that are just noise.)
Was there evaluation-guided reason for changing various defaults, either a little (window=6, negative=8) or a lot (alpha=0.1, min_count=2)? Such tweaks may not be a major factor in your problem, and there's nothing magical about the class defaults. But until you have the basics working, it's best to stick close to common configuration. (And then even after the basics are working, limit changes to those that can be demonstrated as better via a repeatable scoring process.)
Some report needing much higher steps values – 100 or more – to get better inference results, though that would be most crucial for very-small documents (of a handful to couple dozen words) rather than the few-hundred-words documents you describe.
A corpus of 10k documents is on the small side for Paragraph Vectors (Doc2Vec), but with your smallish vector-size (100) and larger number of iterations (15), it might be workable.
If you're still having problems, you should expand your question with more code showing how documents works, some suggestive example documents, and your cosine-similarity evaluation process – to see if there are any oversights at each of those steps.

Sentence matching with gensim word2vec: manually populated model doesn't work

I'm trying to solve a problem of sentence comparison using naive approach of summing up word vectors and comparing the results. My goal is to match people by interest, so the dataset consists of names and short sentences describing their hobbies. The batches are fairly small, few hundreds of people, so i wanted to give it a try before digging into doc2vec.
I prepare the data by cleaning it completely, removing stop words, tokenizing and lemmatizing. I use pre-trained model for word vectors which returns adequate results when finding similarities for some test words. Also tried summing up the sentence words to find similarities in the original model - the matches do make sense. The similarities would be around general sense of the phrase.
For sentence matching I'm trying the following: create an empty model
b = gs.models.Word2Vec(min_count=1, size=300, sample=0, hs=0)
Build vocab out of names (or person id's), no training
#first create vocab with an empty vector
test = [['test']]
b.build_vocab(test)
b.wv.syn0[b.wv.vocab['test'].index] = b.wv.syn0[b.wv.vocab['test'].index]*0
#populate vocab from an array
b.build_vocab([personIds], update=True)
Summ each sentence's word vectors and store the results into the model for a corresponding id
#sentences are pulled from pandas dataset df. 'a' is a pre-trained model i use to get vectors for each word
def summ(phrase, start_model):
'''
vector addition function
'''
#starting with a vector of 0's
sum_vec = start_model.word_vec("cat_NOUN")*0
for word in phrase:
sum_vec += start_model.word_vec(word)
return sum_vec
for i, row in df.iterrows():
try:
personId = row["ID"]
summVec = summ(df.iloc[i,1],a)
#updating syn0 for each name/id in vocabulary
b.wv.syn0[b.wv.vocab[personId].index] = summVec
except:
pass
I understand that i shouldn't be expecting much accuracy here, but the t-SNE print doesn't show any clustering whatsoever. Finding similarities method also fails to find matches (<0.2 similarity coefficient basically for everything). [
Wondering if anyone has an idea of where did i go wrong? Is my approach valid at all?
Your code, as shown, neither does any train() of word-vectors (using your local text), nor does it pre-load any vectors from elsewhere. So any vectors which do exist – created by the build_vocab() calls – will still just be in their randomly-initialized starting locations, and be useless for any semantic purposes.
Suggestions:
either (a) train your own vectors from your text, which makes sense if you have a good quantity of text; or (b) load vectors from elsewhere. But don't try to do both. (Or, in the case of the code above, neither.)
The update=True option for build_vocab() should be considered an expert, experimental option – only worth tinkering with if you've already had things working in simpler modes, and you're sure you need it and understand all the implications.
Normal use won't ever explicitly re-assign new values into the Word2Vec model's syn0 property - those are managed by the class's training routines, so you never need to zero them out or modify them. You should tally up your own text summary vectors, based on word-vectors, outside the model in your own data structures.

How to combine tfidf features with selfmade features

For a simple web page classification system I am trying to combine some selfmade features (frequency of HTML tags, frequency of certain word collocations) with the features obtained after applying tfidf. I am facing the following problem, however, and I don't really know how to proceed from here.
Right now I am trying to put all of these together in one dataframe, mainly by following the code from the following link :
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
vectorizer = TfidfVectorizer(stop_words="english")
X_train_counts = vectorizer.fit_transform(train_data['text_no_punkt'])
feature_names = vectorizer.get_feature_names()
dense = X_train_counts.todense()
denselist = dense.tolist()
tfidf_df = pd.DataFrame(denselist, columns=feature_names, index=train_data['text_no_punkt'])
But this doesn't return the index (from 0 to 2464) I had in my original dataframe with the other features, neither does it seem to produce readable column names and instead of using the different words as titles, it uses numbers.
Furthermore I am not sure if this is the right way to combine features as this will result in an extremely high-dimensional dataframe which will probably not benefit the classifiers.
You can use hstack to merge the two sparse matrices, without having to convert to dense format.
from scipy.sparse import hstack
hstack([X_train_counts, X_train_custom])

Categories