Is there a way to capture WordNet selectional restrictions (such as +animate, +human, etc.) from synsets through NLTK?
Or is there any other way of providing semantic information about synset? The closest I could get to it were hypernym relations.
It depends on what is your "selectional restrictions" or i would call it semantic features, because in classic semantics, there exists a world of concepts and to compare between concepts we have to find
discriminating features (i.e. features of the concepts that are used to distinguish them from each other) and
similarity features (i.e. features of the concepts similar and highlights the need to differentiate them)
For example:
Man is [+HUMAN], [+MALE], [+ADULT]
Woman is [+HUMAN], [-MALE], [+ADULT]
[+HUMAN] and [+ADULT] = similarity features
[+-MALE] is the discrimating features
The common problem of traditional semantics and applying this theory in computational semantics is the question of
"Is there a specific list of features that we can use to compare any
"If so, what are the features on this list?"
concepts?"
(see www.acl.ldc.upenn.edu/E/E91/E91-1034.pdf for more details)
Getting back to WordNet, I can suggest 2 methods to resolve the "selection restrictions"
First, Check the hypernyms for discriminating features, but first you must decide what is the discriminating features. To differentiate an animal from humans, let's take the discriminating features as [+-human] and [+-animal].
from nltk.corpus import wordnet as wn
# Concepts to compare
dog_sense = wn.synsets('dog')[0] # It's http://goo.gl/b9sg9X
jb_sense = wn.synsets('James_Baldwin')[0] # It's http://goo.gl/CQQIG9
# To access the hypernym_paths()[0]
# It's weird for that hypernym_paths gives a list of list rather than a list, nevertheless it works.
dog_hypernyms = dog_sense.hypernym_paths()[0]
jb_hypernyms = jb_sense.hypernym_paths()[0]
# Discriminating features in terms of concepts in WordNet
human = wn.synset('person.n.01') # i.e. [+human]
animal = wn.synset('animal.n.01') # i.e. [+animal]
try:
assert human in jb_hypernyms and animal not in jb_hypernyms
print "James Baldwin is human"
except:
print "James Baldwin is not human"
try:
assert human in dog_hypernyms and animal not in dog_hypernyms
print "Dog is an animal"
except:
print "Dog is not an animal"
Second, Check for similarity measures as #Jacob had suggested.
dog_sense = wn.synsets('dog')[0] # It's http://goo.gl/b9sg9X
jb_sense = wn.synsets('James_Baldwin')[0] # It's http://goo.gl/CQQIG9
# Features to check against whether the 'dubious' concept is a human or an animal
human = wn.synset('person.n.01') # i.e. [+human]
animal = wn.synset('animal.n.01') # i.e. [+animal]
if dog_sense.wup_similarity(animal) > dog_sense.wup_similarity(human):
print "Dog is more of an animal than human"
elif dog_sense.wup_similarity(animal) < dog_sense.wup_similarity(human):
print "Dog is more of a human than animal"
You could try using some of the similarity functions with handpicked synsets, and use that to filter. But it's essentially the same as following the hypernym tree - afaik all the wordnet similarity functions use hypernym distance in their calculations. Also, there's a lot of optional attributes of a synset that might be worth exploring, but their presence can be very inconsistent.
Related
I have a bunch of text samples. Each sample has a different length, but all of them consist of >200 characters. I need to split each sample into approx 50 chara ters length substrings. To do so, I found this approach:
import re
def chunkstring(string, length):
return re.findall('.{%d}' % length, string)
However, it splits a text by splitting words. For example, the phrase "I have <...> icecream. <...>" can be split into "I have <...> icec" and "ream. <...>".
This is the sample text:
This paper proposes a method that allows non-parallel many-to-many
voice conversion by using a variant of a generative adversarial
network called StarGAN.
I get this result:
['This paper proposes a method that allows non-paral',
'lel many-to-many voice conversion by using a varia',
'nt of a generative adversarial network called Star']
But ideally I would like to get something similar to this result:
['This paper proposes a method that allows non-parallel',
'many-to-many voice conversion by using a variant',
'of a generative adversarial network called StarGAN.']
How could I adjust the above-given code to get the desired result?
For me this sound like task for textwrap built-in module, example using your data
import textwrap
text = "This paper proposes a method that allows non-parallel many-to-many voice conversion by using a variant of a generative adversarial network called StarGAN."
print(textwrap.fill(text,55))
output
This paper proposes a method that allows non-parallel
many-to-many voice conversion by using a variant of a
generative adversarial network called StarGAN.
You will probably need some trials to get value which suits your needs best. If you need list of strs use textwrap.wrap i.e. textwrap.wrap(text,55)
You can use .{0,50}\S* in order to keep matching eventual further non-space characters (\S).
I specified 0 as lowerbound since otherwise you'd risk missing the last substring.
See a demo here.
EDIT:
For excluding the trailing empty chunk, use .{1,50}\S*, in order to force it to match at least one character.
If you also want to automatically strip the side spaces, use \s*(.{1,50}\S*).
def nearestDelimiter(txt, cur):
delimiters = " ;:.!?-—"
if(txt[cur] in delimiters) :
return cur
else:
i=cur
while ( i>=0 ):
if (txt[i] in delimiters) :
return i
i=i-1
return 0
def splitText(sentence,chunkLength):
cursor = 0
curlng = chunkLength
lst = []
while (curlng < len(sentence)):
curlng = nearestDelimiter(sentence, curlng)
substr = (sentence[cursor : curlng]).strip()
cursor = curlng
curlng = (cursor+chunkLength) if (cursor+chunkLength<len(sentence)) else len(sentence)
lst.append(substr)
lst.append((sentence[cursor : curlng]).strip())
return lst
txt = "This paper proposes a method that allows non-parallel many-to-many voice conversion by using a variant of a generative adversarial network called StarGAN."
cvv = splitText(txt,50)
for cv in cvv:
print(cv)
I'm trying to finding a way to know the accuracy of my Recommender System. The method that I used was to create a KNN model based on a User X Movies matrix (where the content are the ratings that a given user gave to a given movie). Based on that model, I have a function, where I can input a movie title and it returns to me the K more similar movies to the one I used as input. Having that, I don't know how to measure if my model is accurate and if the movies shown are really similar to the one I used as input. Any ideas?
Here is a sample of the dataset I'm using
def create_sparse_matrix(df):
sparse_matrix = sparse.csr_matrix((df["rating"], (df["userId"], df["movieId"])))
return sparse_matrix
# getting the transpose - data_cf is the dataFrame name that I'm using
user_movie_matrix = create_sparse_matrix(data_cf).transpose()
knn_cf = NearestNeighbors(n_neighbors=N_NEIGHBORS, algorithm='auto', metric='cosine')
knn_cf.fit(user_movie_matrix)
# Creating function to get movies recommendations based in a movie input.
def get_recommendations_cf(movie_name, model):
# Getting the ID of the movie based on it's title
movieId = data_cf.loc[data_cf["title"] == movie_name]["movieId"].values[0]
distances, suggestions = model.kneighbors(user_movie_matrix.getrow(movieId).todense().tolist(), n_neighbors=10)
for i in range(0, len(distances.flatten())):
if(i == 0):
print('Recomendações para {0}: \n'.format(movie_name))
else:
print('{0}: {1}, com distância de {2}:'.format(i, data_cf.loc[data_cf["movieId"] == suggestions.flatten()[i]]["title"].values[0], distances.flatten()[i]))
return distances, suggestions
Calling the recommender function and showing the "distance" of each movie recommended
Translating:
"Recomendações para Spider-Man 2: " = "Recommendations for Spider-Man 2: "
"1: Spider-Man, com distância de 0.30051949781903664" = "1: Spider-Man, with distance of 0.30051949781903664"
...
"9: Finding Nemo, com distância de 0.4844064554284505:" = "9: Finding Nemo, with distance of 0.4844064554284505:"
When it comes to recommendation systems, measuring performance is never a straightforward task. That is because there are many desirable characteristics that we are looking for in a recommendation: accuracy, diversity, novelty, ... All of which can be measured in some way or another. There are many very helpful articles on the web that cover the topic. I will link a few references that deal with precision in specific:
https://towardsdatascience.com/ranking-evaluation-metrics-for-recommender-systems-263d0a66ef54
https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)
Bear in mind that to do any sort of evaluation you need to split your data into a train and a test set. In the case of recommender systems, since all users and all items must be represented in both the train and test sets, you must use a stratified approach. That means that you should take set aside a percentage of the movie reviews for each user instead of simply sampling lines of your dataset.
How can I get the words of each cluster
I divided them into groups
LabeledSentence1 = gensim.models.doc2vec.TaggedDocument
all_content_train = []
j=0
for em in train['KARMA'].values:
all_content_train.append(LabeledSentence1(em,[j]))
j+=1
print('Number of texts processed: ', j)
d2v_model = Doc2Vec(all_content_train, vector_size = 100, window = 10, min_count = 500, workers=7, dm = 1,alpha=0.025, min_alpha=0.001)
d2v_model.train(all_content_train, total_examples=d2v_model.corpus_count, epochs=10, start_alpha=0.002, end_alpha=-0.016)```
```kmeans_model = KMeans(n_clusters=10, init='k-means++', max_iter=100)
X = kmeans_model.fit(d2v_model.docvecs.doctag_syn0)
labels=kmeans_model.labels_.tolist()
l = kmeans_model.fit_predict(d2v_model.docvecs.doctag_syn0)
pca = PCA(n_components=2).fit(d2v_model.docvecs.doctag_syn0)
datapoint = pca.transform(d2v_model.docvecs.doctag_syn0)
I can get the text and its cluster but how can I learn the words which mainly created those groups
It's not an inherent feature of Doc2Vec to list words most-related to any document or doc-vector. (Other algorithms, such as LDA, will offer that.)
So, you could potentially write your own code, once you've split your documents into clusters, to report the words that are "most over-represented" in each cluster.
For example, calculate every word's frequency in the entire corpus, then each word's frequency in each cluster. For each cluster, report the N words whose in-cluster-frequency is the largest multiple of the full-corpus-frequency. Would this give helpful results on your data, for your needs? You'd have to try it.
Separately, regarding your use of Doc2Vec:
there's no good reason to alias the existing class TaggedDocument to a strange class name like LabeldSentence1. Just use TaggedDocument directly.
if you supply your corpus, all_content_train, to the object-inittialization – as your code does – then you don't need to also call train(). Training will have already happened automatically. If you do want more than the default amount of training (epochs=5), just supply a larger epochs value to the initialization.
the learning-rate values you've supplied to train() – start_alpha=0.002, end_alpha=-0.016 – are nonsensical & destructive. Few users should need to tinker with these alpha values at all, but especially, they should never increase from the beginning to end of a training cycle, as these values do.
If you were running with logging enabled at the INFO level, and/or watching the output closely, you would likely see readouts and warnings indicating that excessive training was happening, or problematic values used.
For a project I would like to measure the amount of ‘human centered’ words within a text. I plan on doing this using WordNet. I have never used it and I am not quite sure how to approach this task. I want to use WordNet to count the amount of words that belong to certain synsets, for example the sysnets ‘human’ and ‘person’.
I came up with the following (simple) piece of code:
word = 'girlfriend'
word_synsets = wn.synsets(word)[0]
hypernyms = word_synsets.hypernym_paths()[0]
for element in hypernyms:
print element
Results in:
Synset('entity.n.01')
Synset('physical_entity.n.01')
Synset('causal_agent.n.01')
Synset('person.n.01')
Synset('friend.n.01')
Synset('girlfriend.n.01')
My first question is, how do I properly iterate over the hypernyms? In the code above it prints them just fine. However, when using an ‘if’ statement, for example:
count_humancenteredness = 0
for element in hypernyms:
if element == 'person':
print 'found person hypernym'
count_humancenteredness +=1
I get ‘AttributeError: 'str' object has no attribute '_name'’. What method can I use to iterate over the hypernyms of my word and perform an action (e.g. increase the count of human centerdness) when a word does indeed belong to the ‘person’ or ‘human’ synset.
Secondly, is this an efficient approach? I assume that iterating over several texts and iterating over the hypernyms of each noun will take quite some time.. Perhaps that there is another way to use WordNet to perform my task more efficiently.
Thanks for your help!
wrt the error message
hypernyms = word_synsets.hypernym_paths() returns a list of list of SynSets.
Hence
if element == 'person':
tries to compare a SynSet object against a string. That kind of comparison is not supported by the SynSet.
Try something like
target_synsets = wn.synsets('person')
if element in target_synsets:
...
or
if u'person' in element.lemma_names():
...
instead.
wrt efficiency
Currently, you do a hypernym-lookup for every word inside your input text. As you note, this is not necessarily efficient. However, if this is fast enough, stop here and do not optimize what is not broken.
To speed up the lookup, you can pre-compile a list of "person related" words in advance by making use of the transitive closure over the hyponyms as explained here.
Something like
person_words = set(w for s in p.closure(lambda s: s.hyponyms()) for w in s.lemma_names())
should do the trick. This will return a set of ~ 10,000 words, which is not too much to store in main memory.
A simple version of the word counter then becomes something on the lines of
from collections import Counter
word_count = Counter()
for word in (w.lower() for w in words if w in person_words):
word_count[word] += 1
You might also need to pre-process the input words using stemming or other morphologic reductions before passing the words on to WordNet, though.
To get all the hyponyms of a synset, you can use the following function (tested with NLTK 3.0.3, dhke's closure trick doesn't work on this version):
def get_hyponyms(synset):
hyponyms = set()
for hyponym in synset.hyponyms():
hyponyms |= set(get_hyponyms(hyponym))
return hyponyms | set(synset.hyponyms())
Example:
from nltk.corpus import wordnet
food = wordnet.synset('food.n.01')
print(len(get_hyponyms(food))) # returns 1526
I'm trying to analyze a bunch of search terms, so many that individually they don't tell much. That said, I'd like to group the terms because I think similar terms should have similar effectiveness. For example,
Term Group
NBA Basketball 1
Basketball NBA 1
Basketball 1
Baseball 2
It's a contrived example, but hopefully it explains what I'm trying to do. So then, what is the best way to do what I've described? I thought the nltk may have something along those lines, but I'm only barely familiar with it.
Thanks
You'll want to cluster these terms, and for the similarity metric I recommend Dice's Coefficient at the character-gram level. For example, partition the strings into two-letter sequences to compare (term1="NB", "BA", "A ", " B", "Ba"...).
nltk appears to provide dice as nltk.metrics.association.BigramAssocMeasures.dice(), but it's simple enough to implement in a way that'll allow tuning. Here's how to compare these strings at the character rather than word level.
import sys, operator
def tokenize(s, glen):
g2 = set()
for i in xrange(len(s)-(glen-1)):
g2.add(s[i:i+glen])
return g2
def dice_grams(g1, g2): return (2.0*len(g1 & g2)) / (len(g1)+len(g2))
def dice(n, s1, s2): return dice_grams(tokenize(s1, n), tokenize(s2, n))
def main():
GRAM_LEN = 4
scores = {}
for i in xrange(1,len(sys.argv)):
for j in xrange(i+1, len(sys.argv)):
s1 = sys.argv[i]
s2 = sys.argv[j]
score = dice(GRAM_LEN, s1, s2)
scores[s1+":"+s2] = score
for item in sorted(scores.iteritems(), key=operator.itemgetter(1)):
print item
When this program is run with your strings, the following similarity scores are produced:
./dice.py "NBA Basketball" "Basketball NBA" "Basketball" "Baseball"
('NBA Basketball:Baseball', 0.125)
('Basketball NBA:Baseball', 0.125)
('Basketball:Baseball', 0.16666666666666666)
('NBA Basketball:Basketball NBA', 0.63636363636363635)
('NBA Basketball:Basketball', 0.77777777777777779)
('Basketball NBA:Basketball', 0.77777777777777779)
At least for this example, the margin between the basketball and baseball terms should be sufficient for clustering them into separate groups. Alternatively you may be able to use the similarity scores more directly in your code with a threshold.