My teacher rarely answers emails so forgive me for asking here. I'm having trouble understanding what he means about the highlighted part:
"Create a python program that will compute the text document similarity between different documents. Your implementation will take a list of documents as an input text corpus and it will compute a dictionary of words for the given corpus."
So is the input a list of strings like so: e = ["a", "a", "b", "f"]
Or is it just a list with a single string where I pull the individual words?
Full question:
Create a python program that will compute the text document similarity between different documents. Your implementation will take a list of documents as an input text corpus and it will compute a dictionary of words for the given corpus. Later, when a new document (i.e search document) is provided, your implementation should provide a list of documents that are similar to the given search document, in descending order of their similarity with the search document. For computing similarity between any two documents in our question, you can use the following distance measures (optionally you can also use any other measure as well).
1.dot product between the two vectors
2.distance norm (or Euclidean distance) between two vectors .e.g.||u−v||
Hint A text document can be represented as a word vector against a given dictionary of words. So first compute the dictionary of words for a given text corpus, containing the unique words from the documents of given corpus. Then transform every text document of the given corpus into vector form i.e. creating a word vector where 0 indicates the word is not in the document and 1 indicates that the word is present in the given document. In our question, a text document is just represented as a string, so the text corpus is nothing but a list of strings
The way I understand it you are given a list of documents and you will match a new document to them, for instance:
my_docs = ["this is the first doc", "this is the second doc", "and this is yet another doc"]
inpd = input('please enter a document: ')
my_vocab, v_dim = build_vocab(my_docs + [inpd]) # build vocabulary from doclist and new doc
my_vdocs = [build_dvec(doc, v_dim, my_vocab) for doc in my_docs] # vectorize documents
inpdv = build_dvec(inpd, v_dim, my_vocab) # vectorize input document
print('docs and score: ', [str(my_docs[itm[0]])+': '+str(itm[1]) for itm in dmtch(inpdv, my_vdocs)])
this would produce
please enter a document: yet another doc
docs and score: ['and this is yet another doc: 3.0', 'this is the first doc: 1.0', 'this is the second doc: 1.0']
The best match is the last document in your list as it contains the most words from the input document. I have omitted the code for constructing the vocabulary, transforming the documents to vectors and matching them (=assignment).
Related
I am trying to retrieve the vocabulary of a text corpus with spacy.
The corpus is a list of strings with each string being a document from the corpus.
I came up with 2 different methods to create the vocabulary. Both work but yield slightly different results and i dont know why.
The first approach results in a vocabulary size of 5000:
words = nlp(" ".join(docs))
vocab2 = []
for word in words:
if word.lemma_ not in vocab2 and word.is_alpha == True:
vocab2.append(word.lemma_)
The second approach results in a vocabulary size of 5001 -> a single word more:
vocab = set()
for doc in docs:
doc = nlp(doc)
for token in doc:
if token.is_alpha:
vocab.add(token.lemma_)
Why do the 2 results differ?
My best guess would be that the model behind nlp() somehow interprets different token when it has the input as a whole vs. the input per document.
Can I find another method in python like findassocs() in R?
I am trying to find association with words in python. However, I can't find any package has that method.
vectorizer = CountVectorizer(min_df = 5)
count = vectorizer.fit_transform(token_doc)
print(count)
print(vectorizer.vocabulary_)
tftrans = TfidfTransformer()
tfidf = tftrans.fit_transform(count).toarray()
print(tfidf)
print(tftrans.vocabulary_)
How can I get a correlation word with word-list is based on Tf-Idf?
If the process that I mentioned is done, I want to put a word in a method and let method me know which word is the most associated word.
you can check what the findassocs is.
It can take input word and return word list that shows which word is associated with.
I'm trying TfidfVectorizer on a sentence taken from wikipedia page about the History of Portugal. However i noticed that the TfidfVec.fit_transform method is ignoring certain words. Here's the sentence i tried with:
sentence = "The oldest human fossil is the skull discovered in the Cave of Aroeira in Almonda."
TfidfVec = TfidfVectorizer()
tfidf = TfidfVec.fit_transform([sentence])
cols = [words[idx] for idx in tfidf.indices]
matrix = tfidf.todense()
pd.DataFrame(matrix,columns = cols,index=["Tf-Idf"])
output of the dataframe:
Essentially, it is ignoring the words "Aroeira" and "Almonda".
But i don't want it to ignore those words so what should i do? I can't find anywhere on the documentation where they talk about this.
Another question is why is the word "the" repeated? should the algorithm consider just one "the" and compute its tf-idf?
tfidf.indices are just indexes for feature names in TfidfVectorizer.
Getting words by this indexes from the sentence is a mistake.
You should get columns names for your df as TfidfVec.get_feature_names()
The output is the giving two the because you have two in the sentence. The entire sentence is encoded and your getting values for each of the indices. The reason why the other two words are not appearing is because they are rare words. You can make them appear by reducing the threshold.
Refer to min_df and max_features:
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
I have a list of product reviews/descriptions in excel and I am trying to classify them using Python based on words that appear in the reviews.
I import both the reviews, and a list of words that would indicate the product falling into a certain classification, into Python using Pandas and then count the number of occurrences of the classification words.
This all works fine for single classification words e.g. 'computer' but I am struggling to make it work for phrases e.g. 'laptop case'.
I have look through a few answers but none were successful for me including:
using just text.count(['laptop case', 'laptop bag']) as per the answer here: Counting phrase frequency in Python 3.3.2 but because you need to split the text up that does not work (and I think maybe text.count does not work for lists either?)
Other answers I have found only look at the occurrence of a single word. Is there something I can do to count words and phrases that does not involve the splitting of the body of text into individual words?
The code I currently have (that works for individual terms) is:
for i in df1.index:
descriptions = df1['detaileddescription'][i]
if type(descriptions) is str:
descriptions = descriptions.split()
pool.append(sum(map(descriptions.count, df2['laptop_bag'])))
else:
pool.append(0)
print(pool)
You're on the right track! You're currently splitting into single words, which facilitates finding occurrences of single words as you pointed out. To find phrases of length n you should split the text into chunks of length n, which are called n-grams.
To do that, check out the NLTK package:
from nltk import ngrams
sentence = 'I have a laptop case and a laptop bag'
n = 2
bigrams = ngrams(sentence.split(), n)
for gram in bigrams:
print(gram)
Sklearn's CountVectorizer is the standard way
from sklearn.feature_extraction import text
vectorizer = text.CountVectorizer()
vec = vectorizer.fit_transform(descriptions)
And if you want to see the counts as a dict:
count_dict = {k:v for k,v in zip(vectorizer.get_feature_names(), vec.toarray()[0]) if v>0}
print (count_dict)
The default is unigrams, you can use bigrams or higher ngrams with the ngram_range parameter
I am trying to find out how to extract the collocates of a specific word out of a text. As in: what are the words that make a statistically significant collocation with e.g. the word "hobbit" in the entire text corpus? I am expecting a result similar to a list of words (collocates ) or maybe tuples (my word + its collocate).
I know how to make bi- and tri-grams using nltk, and also how to select only the bi- or trigrams that contain my word of interest. I am using the following code (adapted from this StackOverflow question).
import nltk
from nltk.collocations import *
corpus = nltk.Text(text) # "text" is a list of tokens
trigram_measures = nltk.collocations.TrigramAssocMeasures()
tri_finder = TrigramCollocationFinder.from_words(corpus)
# Only trigrams that appear 3+ times
tri_finder.apply_freq_filter(3)
# Only the ones containing my word
my_filter = lambda *w: 'Hobbit' not in w
tri_finder.apply_ngram_filter(my_filter)
print tri_finder.nbest(trigram_measures.likelihood_ratio, 20)
This works fine and gives me a list of trigrams (one element of of which is my word) each with their log-likelihood value. But I don't really want to select words only from a list of trigrams. I would like to make all possible N-Gram combinations in a window of my choice (for example, all words in a window of 3 left and 3 right from my word - that would mean a 7-Gram), and then check which of those N-gram words has a statistically relevant frequency paired with my word of interest. I would like to take the Log-Likelihood value for that.
My idea would be:
1) Calculate all N-Gram combinations in different sizes containing my word (not necessarily using nltk, unless it allows to calculate units larger than trigrams, but i haven't found that option),
2) Compute the log-likelihood value for each of the words composing my N-grams, and somehow compare it against the frequency of the n-gram they appear in (?). Here is where I get lost a bit... I am not experienced in this and I don't know how to think this step.
Does anyone have suggestions how I should do?
And assuming I use the pool of trigrams provided by nltk for now: does anyone have ideas how to proceed from there to get a list of the most relevant words near my search word?
Thank you
Interesting problem ...
Related to 1) take a look at this thread...different nice solutions to make ngrams .. basically I lo
from nltk import ngrams
sentence = 'this is a foo bar sentences and i want to ngramize it'
n = 6
sixgrams = ngrams(sentence.split(), n)
for grams in sixgrams:
print (grams)
The other way could be:
phrases = Phrases(doc,min_count=2)
bigram = models.phrases.Phraser(phrases)
phrases = Phrases(bigram[doc],min_count=2)
trigram = models.phrases.Phraser(phrases)
phrases = Phrases(trigram[doc],min_count=2)
Quadgram = models.phrases.Phraser(phrases)
... (you could continue infinitely)
min_count controls the frequency of each word in the corpora.
Related to 2) It's somehow tricky calculating loglikelihood for more than two variables since you should count for all the permutations. look this thesis which guy proposed a solution (page 26 contains a good explanation).
However, in addition to log-likelihood function, there is PMI (Pointwise Mutual Information) metric which calculates the co-occurrence of pair of words divided by their individual frequency in the text. PMI is easy to understand and calculate which you could use it for each pair of the words.