I have two sentences in python, that are represents sets of words the user gives in input as query for an image retrieval software:
sentence1 = "dog is the"
sentence2 = "the dog is a very nice animal"
I have a set of images that have a description, so for example:
sentence3 = "the dog is running in your garden"
I want to recover all the images that have a description "very close" to the query inserted by user, but this part related to description should be normalized between 0 and 1 since it is just a part of a more complex research which considers also geotagging and low level features of images.
Given that I create three sets using:
set_sentence1 = set(sentence1.split())
set_sentence2 = set(sentence2.split())
set_sentence3 = set(sentence3.split())
And compute the intersection between sets as:
intersection1 = set_sentence1.intersection(set_sentence3)
intersection2 = set_sentence2.intersection(set_sentence3)
How can i normalize efficiently the comparison?
I don't want to use levensthein distance, since I'm not interested in string similarity, but in set similarity.
maybe a metric like:
Similarity1 = (1.0 + len(intersection1))/(1.0 + max(len(set_sentence1), len(set_sentence3)))
Similarity2 = (1.0 + len(intersection2))/(1.0 + max(len(set_sentence2), len(set_sentence3)))
have you tried difflib?
example from docs:
>>> s1 = ['bacon\n', 'eggs\n', 'ham\n', 'guido\n']
>>> s2 = ['python\n', 'eggy\n', 'hamster\n', 'guido\n']
>>> for line in context_diff(s1, s2, fromfile='before.py', tofile='after.py'):
... sys.stdout.write(line)
*** before.py
--- after.py
***************
*** 1,4 ****
! bacon
! eggs
! ham
guido
--- 1,4 ----
! python
! eggy
! hamster
guido
We can try jaccard similarity. len(set A intersection set B) / len(set A union set B). More info at https://en.wikipedia.org/wiki/Jaccard_index
Related
I have been racking my brain on this for a week.
I want to
run NMF topic modeling
Assign each document a topic by looking at the maximum of weights,
Graph this distribution as a % bar chart using matplot. (I.e: Topics on the X axis, and % documents that are that topic on the y axis. )
Here is some toy data and completing steps 1 and 2:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
import pandas as pd
# Get data
data = {
"Documents": ["I am a document",
"And me too",
"The cat is big",
"The dog is big"
"My headphones are large",
"My monitor has rabies",
"My headphones are loud"
"The street is loud "]
}
df = pd.DataFrame(data)
# Fit a TFIDF vectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf = tfidf_vectorizer.fit_transform(df['Documents'])
# Run NMF
nmf_model = NMF(n_components=4, random_state=1).fit(tfidf)
# Weights
W = nmf_model.transform(tfidf)
# Topics
H= nmf_model.components_
Now here is how I can assign a document to a topcic:
# Will return document topics as list like [1, 4, 1...] to
# represent that the first document is topic 1, the second 4, and so on.
topics = pd.DataFrame(W).idxmax(axis=1, skipna=True).tolist()
Alright now I should be able to get what I want with these two structures but I am at a loss.
looks like a use case for a Counter().
I'd write something like this:
from collections import Counter
mylist = [1,1,1,1,2,2,3,1,1,2,3,1,1,1]
mycount = Counter(mylist)
for key,value in mycount.items():
print(key,value)
This outputs your topics in the following structure:
1 9
2 3
3 2
One thing to note for Latent dirichlet / non negative matrix is that the entire point is a sentence is composed of multiple topics. Maxing the weight to assign each to a single topic may defeat the purpose. You also may want to consider how to deal with nonsense sentences, as your algorithm will auto assign them to a topic currently.
IIUC, you want to draw a bar char, so do not change topics into list:
topics = pd.DataFrame(W).idxmax(axis=1, skipna=True)
plt.bar(x=topics.index, height=topics.mul(100)/topics.sum())
plt.show()
gives:
According to https://code.google.com/archive/p/word2vec/:
It was recently shown that the word vectors capture many linguistic
regularities, for example vector operations vector('Paris') -
vector('France') + vector('Italy') results in a vector that is very
close to vector('Rome'), and vector('king') - vector('man') +
vector('woman') is close to vector('queen') [3, 1]. You can try out a
simple demo by running demo-analogy.sh.
So we can try from the supplied demo script:
+ ../bin/word-analogy ../data/text8-vector.bin
Enter three words (EXIT to break): paris france berlin
Word: paris Position in vocabulary: 198365
Word: france Position in vocabulary: 225534
Word: berlin Position in vocabulary: 380477
Word Distance
------------------------------------------------------------------------
germany 0.509434
european 0.486505
Please note that paris france berlin is the input hint the demo suggest. The problem is that I'm unable to reproduce this behavior if I open the same word vectors in Gensim and try to compute the vectors myself. For example:
>>> word_vectors = KeyedVectors.load_word2vec_format(BIGDATA, binary=True)
>>> v = word_vectors['paris'] - word_vectors['france'] + word_vectors['berlin']
>>> word_vectors.most_similar(np.array([v]))
[('berlin', 0.7331711649894714), ('paris', 0.6669869422912598), ('kunst', 0.4056406617164612), ('inca', 0.4025722146034241), ('dubai', 0.3934606909751892), ('natalie_portman', 0.3909246325492859), ('joel', 0.3843030333518982), ('lil_kim', 0.3784593939781189), ('heidi', 0.3782389461994171), ('diy', 0.3767407238483429)]
So, what is the word analogy actually doing? How should I reproduce it?
It should be just element-wise addition and subtraction of vectors.
And cosine distance to find the most similar ones.
However, if you use original word2vec embeddings, there is difference between "paris" and "Paris" (strings were not lowered or lemmatised).
You may also try:
v = word_vectors['France'] - word_vectors['Paris'] + word_vectors['Berlin']
or
v = word_vectors['Paris'] - word_vectors['France'] + word_vectors['Germany']
because you should compare identical concepts (city - country + country -> another city)
You should be clear about exactly which word-vector set you're using: different sets will have a different ability to perform well on analogy tasks. (Those trained on the tiny text8 dataset might be pretty weak; the big GoogleNews set Google released would probably do well, at least under certain conditions like discarding low-frequnecy words.)
You're doing the wrong arithmetic for the analogy you're trying to solve. For an analogy "A is to B as C is to ?" often written as:
A : B :: C : _?_
You begin with 'B', subtract 'A', then add 'C'. So the example:
France : Paris :: Italy : _?_
...gives the formula in your excerpted text:
wv('Paris') - wv('France') + wv('Italy`) = target_coordinates # close-to wv('Rome')
And to solve instead:
Paris : France :: Berlin : _?_
You would try:
wv('France') - wv('Paris') + wv('Berlin') = target_coordinates
...then see what's closest to target_coordinates. (Note the difference in operation-ordering to your attempt.)
You can think of it as:
start at a country-vector ('France')
subtract the (country&capital)-vector ('Paris'). This leaves you with an interim vector that's, sort-of, "zero" country-ness, and "negative" capital-ness.
add another (country&capital)-vector ('Berlin'). This leaves you with a result vector that's, again sort-of, "one" country-ness, and "zero" capital-ness.
Note also that gensim's most_similar() takes multiple positive and negative word-examples, to do the arithmetic for you. So you can just do:
sims = word_vectors.most_similar(positive=['France', 'Berlin'], negative=['Paris'])
This code below is to find out the top 150 words which appeared the most in 2 strings.
pwords = re.findall(r'\w+',p)
ptop150words=Counter(pwords).most_common(150)
sorted(ptop150words)
nwords = re.findall(r'\w+',n)
ntop150words=Counter(nwords).most_common(150)
sorted(ntop150words)
This code below is to remove the common words which appeared in the 2 strings.
def new(ntopwords,ptopwords):
for i in ntopwords[:]:
if i in potopwords:
ntopwords.remove(i)
ptopwords.remove(i)
print(i)
However, there is no output for print(i). what is wrong?
Most likely your indentation.
new(negativetop150words,positivetop150words):
for i in negativetop150words[:]:
if i in positivetop150words:
negativetop150words.remove(i)
positivetop150words.remove(i)
print(i)
You could rely on set methods. Once you have both lists, you convert them to sets. The common subset is the intersection of the 2 sets, and you can simply take the difference from both original sets:
positiveset = set(positivewords)
negativeset = set(negativewords)
commons = positiveset & negativeset
positivewords = sorted(positiveset - commons)
negativewords = sorted(negativeset - commons)
commonwords = sorted(commons)
The code you posted does not call the function new(negativetop150words, positivetop150words) Also per Jesse's comment, the print(i) command is outside the function. Here's the code that worked for me:
import re
from collections import Counter
def new(negativetop150words, positivetop150words):
for i in negativetop150words[:]:
if i in positivetop150words:
negativetop150words.remove(i)
positivetop150words.remove(i)
print(i)
return negativetop150words, positivetop150words
positive = 'The FDA is already fairly gung-ho about providing this. It receives about 1,000 applications a year and approves all but 1%. The agency makes sure there is sound science behind the request, and no obvious indication that the medicine would harm the patient.'
negative = 'Thankfully these irritating bits of bureaucracy have been duly dispatched. This victory comes courtesy of campaigning work by a libertarian think-tank, the Goldwater Institute, based in Arizona. It has been pushing right-to-try legislation for around four years, and it can now be found in 40 states. Speaking about the impact of these laws on patients, Arthur Caplan, a professor of bioethics at NYU School of Medicine in New York, says he can think of one person who may have been helped.'
positivewords = re.findall(r'\w+', positive)
positivetop150words = Counter(positivewords).most_common(150)
sorted(positivetop150words)
negativewords = re.findall(r'\w+', negative)
negativetop150words = Counter(negativewords).most_common(150)
words = new(negativewords, positivewords)
This prints:
a
the
It
and
about
the
I have a doubt at using Natural Language ToolKit (NLTK). I'm trying to make an app in order to translate a Natural Language Question into it's logic representation, and query to a database.
The result I got after using the simplify() method under nltk.sem.logic package and got the following expression:
exists z2.(owner(fido, z2) & (z0 = z2))
But what I need is to simplify it as follow:
owner(fido, z0)
Is there another method that could reduce the sentence as I want?
In NLTK, simplify() performs beta reduction (according to the book) which is not what you need. What you are asking is only doable with theorem provers when you apply certain tactics. Which in this case, you either need to know what you expect to get at the end or you know what kinds of axioms can be applied to get such result.
The theorem prover in NLTK is Prover9 which provides tools to check entailment relations. Basically, you can only check if there is a proof with a limited number of steps from a list of expressions (premises) to a goal expression. In your case for example, this was the result:
============================== PROOF =================================
% -------- Comments from original proof --------
% Proof 1 at 0.00 (+ 0.00) seconds.
% Length of proof is 8.
% Level of proof is 4.
% Maximum clause weight is 4.
% Given clauses 0.
1 (exists x (owner(fido,x) & y = x)) # label(non_clause). [assumption].
2 owner(fido,x) # label(non_clause) # label(goal). [goal].
3 owner(fido,f1(x)). [clausify(1)].
4 x = f1(x). [clausify(1)].
5 f1(x) = x. [copy(4),flip(a)].
6 -owner(fido,c1). [deny(2)].
7 owner(fido,x). [back_rewrite(3),rewrite([5(2)])].
8 $F. [resolve(7,a,6,a)].
============================== end of proof ==========================
In NLTK python:
from nltk import Prover9
from nltk.sem import Expression
read_expr = Expression.fromstring
p1 = read_expr('exists z2.(owner(fido, z2) & (z0 = z2))')
c = read_expr('owner(fido, z0)')
result = Prover9().prove(c, [p1])
print(result)
# returns True
UPDATE
In case that you insist on using available tools in python and you want to manually check this certain pattern with regular expressions. You can probably do something like this with regular expression (I don't approve but let's try my nasty tactic):
def my_nasty_tactic(exp):
parameter = re.findall(r'exists ([^.]*)\..*', exp)
if len(parameter) == 1:
parameter = parameter[0]
substitution = re.findall(r'&[ ]*\([ ]*([^ ]+)[ ]*=[ ]*'+parameter+r'[ ]*\)', exp)
if len(substitution) == 1:
substitution = substitution[0]
exp_abs = re.sub(r'exists(?= [^.]*\..*)', "\ ", exp)
exp_abs = re.sub(r'&[ ]*\([ ]*' + substitution + '[ ]*=[ ]*'+parameter+r'[ ]*\)', '', exp_abs)
return read_expr('(%s)(%s)' % (exp_abs, substitution)).simplify()
Then you can use it like this:
my_nasty_tactic('exists z2.(owner(fido, z2) & (z0 = z2))')
# <ApplicationExpression owner(fido,z0)>
I'm trying to analyze a bunch of search terms, so many that individually they don't tell much. That said, I'd like to group the terms because I think similar terms should have similar effectiveness. For example,
Term Group
NBA Basketball 1
Basketball NBA 1
Basketball 1
Baseball 2
It's a contrived example, but hopefully it explains what I'm trying to do. So then, what is the best way to do what I've described? I thought the nltk may have something along those lines, but I'm only barely familiar with it.
Thanks
You'll want to cluster these terms, and for the similarity metric I recommend Dice's Coefficient at the character-gram level. For example, partition the strings into two-letter sequences to compare (term1="NB", "BA", "A ", " B", "Ba"...).
nltk appears to provide dice as nltk.metrics.association.BigramAssocMeasures.dice(), but it's simple enough to implement in a way that'll allow tuning. Here's how to compare these strings at the character rather than word level.
import sys, operator
def tokenize(s, glen):
g2 = set()
for i in xrange(len(s)-(glen-1)):
g2.add(s[i:i+glen])
return g2
def dice_grams(g1, g2): return (2.0*len(g1 & g2)) / (len(g1)+len(g2))
def dice(n, s1, s2): return dice_grams(tokenize(s1, n), tokenize(s2, n))
def main():
GRAM_LEN = 4
scores = {}
for i in xrange(1,len(sys.argv)):
for j in xrange(i+1, len(sys.argv)):
s1 = sys.argv[i]
s2 = sys.argv[j]
score = dice(GRAM_LEN, s1, s2)
scores[s1+":"+s2] = score
for item in sorted(scores.iteritems(), key=operator.itemgetter(1)):
print item
When this program is run with your strings, the following similarity scores are produced:
./dice.py "NBA Basketball" "Basketball NBA" "Basketball" "Baseball"
('NBA Basketball:Baseball', 0.125)
('Basketball NBA:Baseball', 0.125)
('Basketball:Baseball', 0.16666666666666666)
('NBA Basketball:Basketball NBA', 0.63636363636363635)
('NBA Basketball:Basketball', 0.77777777777777779)
('Basketball NBA:Basketball', 0.77777777777777779)
At least for this example, the margin between the basketball and baseball terms should be sufficient for clustering them into separate groups. Alternatively you may be able to use the similarity scores more directly in your code with a threshold.