Counting occurrences in string column - python

Within a dataframe I have a variable containing different abstracts of academic literature. Below you find a example of the first 3 observations:
abstract = ['Word embeddings are an active topic in the NLP', 'We propose a new shared task for tactical data', 'We evaluate a semantic parser based on a character']
I want to split the sentences in this variable in seperate words and remove possible periods '.'
The line of code in this case should return the following list:
abstractwords = ['Word', 'embeddings', 'are', 'an', 'active', 'topic', 'in', 'the', 'NPL', 'We', 'Propose', 'a', 'new', 'shared', 'task', 'for', 'tactical', 'data', 'We', 'evaluate', 'a', 'semantic', 'parser', 'based', 'on', 'a', 'character']

You can use nested list comprehension:
abstract = ['Word embeddings are an active topic in the NLP.', 'We propose a new shared task for tactical data.', 'We evaluate a semantic parser based on a character.']
words = [word.strip('.') for sentence in abstract for word in sentence.split()]
print(words)
# ['Word', 'embeddings', 'are', 'an', 'active', 'topic', 'in', 'the', 'NLP', 'We', 'propose', 'a', 'new', 'shared', 'task', 'for', 'tactical', 'data', 'We', 'evaluate', 'a', 'semantic', 'parser', 'based', 'on', 'a', 'character']
If you want to remove '.' in the middle of the words as well, use word.replace('.', '') instead.

Use for..each loop to go through elements, replace "." with a space. Split the sentence, and concatenate the lists.
abstractwords = []
for sentence in abstract:
sentence = sentence.replace(".", " ")
abstractwords.extend(sentence.split())

Related

How can you combine a list of tokens and characters (including punctuation and symbols) into a single sentence string in python?

I'm trying to join a of list of words and characters such as the one below list (ls), and convert it into a single, correctly formatted sentence string (sentence) for a collection of lists.
ls = ['"', 'Time', '"', 'magazine', 'said' 'the', 'film', 'was',
'"', 'a', 'multimillion', 'dollar', 'improvisation', 'that',
'does', 'everything', 'but', 'what', 'the', 'title', 'promises',
'"', 'and', 'suggested', 'that', '"', 'writer', 'George',
'Axelrod', '(', '"', 'The', 'Seven', 'Year', 'Itch', '"', ')',
'and', 'director', 'Richard', 'Quine', 'should', 'have', 'taken',
'a', 'hint', 'from', 'Holden', "'s", 'character', 'Richard',
'Benson', 'who', 'writes', 'his', 'movie', ',', 'takes', 'a',
'long', 'sober', 'look', 'at', 'what', 'he', 'has', 'wrought',
',', 'and', 'burns', 'it', '.', '"']
sentence = '"Time" magazine said the film was "a multimillion dollar improvisation that does everything but what the title promises" and suggested that "writer George Axelrod ("The Seven Year Itch") and director Richard Quine should have taken a hint from Holden's character Richard Benson who writes his movie, takes a long sober look at what he has wrought, and burns it."'
I've tried a rule based approach that adds an empty space after an element depending on the contents of the next element but my method ended as a really long piece of code that contains rules for as many cases as I could think of like those for parenthesis or quotations. Is there a way to effectively join this list into a correctly formatted sentence more efficiently and effectively?
I think a simple for should do the trick:
sentence = ""
for word in ls:
if (word == ',' or word == '.') and sentence != '':
sentence = sentence[:-1] #removing the last space added
sentence += word
if word != '\"' or word != '(':
sentence += ' ' #adding a space after each word

Missing last word in a sentence when using regular expression

Code:
import re
def main():
a=['the mississippi is well worth reading about', ' it is not a commonplace river, but on the contrary is in all ways remarkable']
b=word_find(a)
print(b)
def word_find(sentence_list):
word_list=[]
word_reg=re.compile(r"[\(|\)|,|\'|\"|:|\[|\]|\{|\}| |\-\-+|\t|;]?(.+?)[\(|\)|,|\'|\"|:|\[|\]|\{|\}| |\-\-+|\t|;]")
for i in range(len(sentence_list)):
words=re.findall(word_reg,sentence_list[i])
word_list.append(words)
return word_list
main()
What I need is to break every words into single elements of a list
now the output looks like this:
[['the', 'mississippi', 'is', 'well', 'worth', 'reading'], ['it', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways']]
I found that the last word of the first sentence 'about' and the second sentence 'remarkable'is missing
It might be some problem in my regular expression
word_reg=re.compile(r"[\(|\)|,|\'|\"|:|\[|\]|\{|\}| |\-\-+|\t|;]?(.+?)[\(|\)|,|\'|\"|:|\[|\]|\{|\}| |\-\-+|\t|;]")
But if I add a question mark into the last part of this regular expression like this:
[\(|\)|,|\'|\"|:|\[|\]|\{|\}| |\-\-+|\t|;]**?**")
the result become many single letters instead of words. What can I do with it?
Edit:
The reason why I didn't use string.split is that there might be many ways for people to break words
For example: when people input a--b, there is no space, but we have to break it into 'a','b'
Using the right tools is always the winning strategy. In your case, the right tool is the NLTK word tokenizer, because it was designed to do just that: break sentences into words.
import nltk
a = ['the mississippi is well worth reading about',
' it is not a commonplace river, but on the contrary is in all ways remarkable']
nltk.word_tokenize(a[1])
#['it', 'is', 'not', 'a', 'commonplace', 'river', ',', 'but',
# 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']
Suggest a simpler solution:
b = re.split(r"[\W_]", a)
The regex [\W_] matches any single non-word characters (non-letter and non-digit and non-underline) plus the underline, which is practically enough.
Your current regex requires that the word is followed by one of the characters in your list, but not "end of line", which can be matched with $.
You can use re.split and filter:
filter(None, re.split("[, \-!?:]+", a])
Where I have put the string "[, \-!?:]+", you should put whatever characters it is that are your delimiters. filter will just remove any empty strings because of leading/trailing separators.
You can either find what you don't want and split on that:
>>> a=['the mississippi is well worth reading about', ' it is not a commonplace river, but on the contrary is in all ways remarkable']
>>> [re.split(r'\W+', s) for s in a]
[['the', 'mississippi', 'is', 'well', 'worth', 'reading', 'about'], ['', 'it', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']]
(You may need to filter the '' elements produced by re.split)
Or capture what you do want with re.findall and keep those elements:
>>> [re.findall(r'\b\w+', s) for s in a]
[['the', 'mississippi', 'is', 'well', 'worth', 'reading', 'about'], ['it', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']]
Thanks everybody
From others answers, the solution is to use re.split()
and there is a SUPER STAR NLTK in the uppermost answer
def word_find(sentence_list):
word_list=[]
for i in range(len(sentence_list)):
word_list.append(re.split('\(|\)|,|\'|\"|:|\[|\]|\{|\}| |\-\-+|\t|;',sentence_list[i]))
return word_list

getting a grammar to read more than one keyword in the text

I still consider myself a newbie to pyparsing. I threw together 2 quick grammars and neither succeeds at what I am trying to do. I am trying to come up with a grammar that seems really simple to do but it turns out to be (at least for me) not so trivial. The language has one basic definition. its broken down by keywords and body text. body's can span multiple lines. keywords are found at the beginning of a line within the first 20 chars or so but are terminated with a ';' (no quotes). So I threw together a quick demo program so I could test with a couple of grammars. However when I try to use them, they always get the first keyword but none after that.
I've attached the source code as an example and the output that is occurring. Even though this is just test code, out of habit i did documentation. In the example below the two keywords are NOW; and lastly; Ideally I wouldn't want the semicolon included in the keyword.
Any ideas what I should do to make this work?
from pyparsing import *
def testString(text,grammar):
"""
#summary: perform a test of a grammar
2type text: text
#param text: text buffer for input (a message to be parsed)
#type grammar: MatchFirst or equivalent pyparsing construct
#param grammar: some grammar defined somewhere else
#type pgm: text
#param pgm: typically name of the program, which invoked this function.
#status: 20130802 CODED
"""
print 'Input Text is %s' % text
print 'Grammar is %s' % grammar
tokens = grammar.parseString(text)
print 'After parse string: %s' % tokens
tokens.dump()
tokens.keys()
return tokens
def getText(msgIndex):
"""
#summary: make a text string suitable for parsing
#returns: returns a text buffer
#type msgIndex: int
#param msgIndex: a number corresponding to a text buffer to retrieve
#status: 20130802 CODED
"""
msg = [ """NOW; is the time for a few good ones to come to the aid
of new things to come for it is almost time for
a tornado to strike upon a small hill
when least expected.
lastly; another day progresses and
then we find that which we seek
and finally we will
find our happiness perhaps its closer than 1 or 2 years or not so
""",
'',
]
return msg[msgIndex]
def getGrammar(grammarIndex):
"""
#summary: make a grammar given an index
#type: grammarIndex: int
#param grammarIndex: a number corresponding to the grammar to be retrieved
#Note: a good run will return 2 keys: NOW: and lastly: and each key will have an associated body. The body is all
words and text up to the next keyword or eof which ever is first.
"""
kw = Combine(Word(alphas + nums) + Literal(';'))('KEY')
kw.setDebug(True)
body1 = delimitedList(OneOrMore(Word(alphas + nums)) +~kw)('Body')
body1.setDebug(True)
g1 = OneOrMore(Group(kw + body1))
# ok start defining a new grammar (borrow kw from grammar).
body2 = SkipTo(~kw, include=False)('BODY')
body2.setDebug(True)
g2 = OneOrMore(Group(kw+body2))
grammar = [g1,
g2,
]
return grammar[grammarIndex]
if __name__ == '__main__':
# list indices [ text, grammar ]
tests = {1: [0,0],
2: [0,1],
}
check = tests.keys()
check.sort()
for testno in check:
print 'STARTING Test %d' % testno
text = getText(tests[testno][0])
grammar = getGrammar(tests[testno][1])
tokens = testString(text, grammar)
print 'Tokens found %s' % tokens
print 'ENDING Test %d' % testno
the output looks like this: (using python 2.7 and pyparsing 2.0.1)
STARTING Test 1
Input Text is NOW; is the time for a few good ones to come to the aid
of new things to come for it is almost time for
a tornado to strike upon a small hill
when least expected.
lastly; another day progresses and
then we find that which we seek
and finally we will
find our happiness perhaps its closer than 1 or 2 years or not so
Grammar is {Group:({Combine:({W:(abcd...) ";"}) {{W:(abcd...)}... ~{Combine:({W:(abcd...) ";"})}} [, {{W:(abcd...)}... ~{Combine:({W:(abcd...) ";"})}}]...})}...
Match Combine:({W:(abcd...) ";"}) at loc 0(1,1)
Matched Combine:({W:(abcd...) ";"}) -> ['NOW;']
Match {{W:(abcd...)}... ~{Combine:({W:(abcd...) ";"})}} [, {{W:(abcd...)}... ~{Combine:({W:(abcd...) ";"})}}]... at loc 4(1,5)
Match Combine:({W:(abcd...) ";"}) at loc 161(4,20)
Exception raised:Expected W:(abcd...) (at char 161), (line:4, col:20)
Matched {{W:(abcd...)}... ~{Combine:({W:(abcd...) ";"})}} [, {{W:(abcd...)}... ~{Combine:({W:(abcd...) ";"})}}]... -> ['is', 'the', 'time', 'for', 'a', 'few', 'good', 'ones', 'to', 'come', 'to', 'the', 'aid', 'of', 'new', 'things', 'to', 'come', 'for', 'it', 'is', 'almost', 'time', 'for', 'a', 'tornado', 'to', 'strike', 'upon', 'a', 'small', 'hill', 'when', 'least', 'expected']
Match Combine:({W:(abcd...) ";"}) at loc 161(4,20)
Exception raised:Expected W:(abcd...) (at char 161), (line:4, col:20)
After parse string: [['NOW;', 'is', 'the', 'time', 'for', 'a', 'few', 'good', 'ones', 'to', 'come', 'to', 'the', 'aid', 'of', 'new', 'things', 'to', 'come', 'for', 'it', 'is', 'almost', 'time', 'for', 'a', 'tornado', 'to', 'strike', 'upon', 'a', 'small', 'hill', 'when', 'least', 'expected']]
Tokens found [['NOW;', 'is', 'the', 'time', 'for', 'a', 'few', 'good', 'ones', 'to', 'come', 'to', 'the', 'aid', 'of', 'new', 'things', 'to', 'come', 'for', 'it', 'is', 'almost', 'time', 'for', 'a', 'tornado', 'to', 'strike', 'upon', 'a', 'small', 'hill', 'when', 'least', 'expected']]
ENDING Test 1
STARTING Test 2
Input Text is NOW; is the time for a few good ones to come to the aid
of new things to come for it is almost time for
a tornado to strike upon a small hill
when least expected.
lastly; another day progresses and
then we find that which we seek
and finally we will
find our happiness perhaps its closer than 1 or 2 years or not so
Grammar is {Group:({Combine:({W:(abcd...) ";"}) SkipTo:(~{Combine:({W:(abcd...) ";"})})})}...
Match Combine:({W:(abcd...) ";"}) at loc 0(1,1)
Matched Combine:({W:(abcd...) ";"}) -> ['NOW;']
Match SkipTo:(~{Combine:({W:(abcd...) ";"})}) at loc 4(1,5)
Match Combine:({W:(abcd...) ";"}) at loc 4(1,5)
Exception raised:Expected ";" (at char 7), (line:1, col:8)
Matched SkipTo:(~{Combine:({W:(abcd...) ";"})}) -> ['']
Match Combine:({W:(abcd...) ";"}) at loc 5(1,6)
Exception raised:Expected ";" (at char 7), (line:1, col:8)
After parse string: [['NOW;', '']]
Tokens found [['NOW;', '']]
ENDING Test 2
Process finished with exit code 0
I'm good with TDD, but here your whole testing and alternative-selecting infrastructure really gets in the way of seeing just where the grammar is and what's going on with it. If I strip away all the extra machinery, I see your grammar is just:
kw = Combine(Word(alphas + nums) + Literal(';'))('KEY')
body1 = delimitedList(OneOrMore(Word(alphas + nums)) +~kw)('Body')
g1 = OneOrMore(Group(kw + body1))
The first issue I see is your definition of body1:
body1 = delimitedList(OneOrMore(Word(alphas + nums)) +~kw)('Body')
You are on the right track with a negative lookahead, but for it to work in pyparsing, you have to put it at the beginning of the expression, not at the end. Think of it as "before I match another valid word, I will first rule out that it is a keyword.":
body1 = delimitedList(OneOrMore(~kw + Word(alphas + nums)))('Body')
(Why is this a delimitedList, by the way? delimitedList is usually reserved for true lists with delimiters, such as comma-delimited arguments to a program function. All this does is accept any commas that might be mixed into the body, which should be handled more straightforwardly using a list of punctuation.)
Here my test version of your code:
from pyparsing import *
kw = Combine(Word(alphas + nums) + Literal(';'))('KEY')
body1 = OneOrMore(~kw + Word(alphas + nums))('Body')
g1 = OneOrMore(Group(kw + body1))
msg = [ """NOW; is the time for a few good ones to come to the aid
of new things to come for it is almost time for
a tornado to strike upon a small hill
when least expected.
lastly; another day progresses and
then we find that which we seek
and finally we will
find our happiness perhaps its closer than 1 or 2 years or not so
""",
'',
][0]
result = g1.parseString(msg)
# we expect multiple groups, each containing "KEY" and "Body" names,
# so iterate over groups, and dump the contents of each
for res in result:
print res.dump()
I still get the same results as you, just the first keyword matches. So to see where the disconnect is happening, I use scanString, which returns not only the matched tokens, but also the start and end of the matched tokens:
result,start,end = next(g1.scanString(msg))
print len(msg),end
Which gives me:
320 161
So I see that we are ending at location 161 in a string whose total length is 320, so I'll add one more print statement:
print msg[end:end+10]
and I get:
.
lastly;
The trailing period in your body text is the culprit. If I remove that from message and try parseString again, I now get:
['NOW;', 'is', 'the', 'time', 'for', 'a', 'few', 'good', 'ones', 'to', 'come', 'to', 'the', 'aid', 'of', 'new', 'things', 'to', 'come', 'for', 'it', 'is', 'almost', 'time', 'for', 'a', 'tornado', 'to', 'strike', 'upon', 'a', 'small', 'hill', 'when', 'least', 'expected']
- Body: ['is', 'the', 'time', 'for', 'a', 'few', 'good', 'ones', 'to', 'come', 'to', 'the', 'aid', 'of', 'new', 'things', 'to', 'come', 'for', 'it', 'is', 'almost', 'time', 'for', 'a', 'tornado', 'to', 'strike', 'upon', 'a', 'small', 'hill', 'when', 'least', 'expected']
- KEY: NOW;
['lastly;', 'another', 'day', 'progresses', 'and', 'then', 'we', 'find', 'that', 'which', 'we', 'seek', 'and', 'finally', 'we', 'will', 'find', 'our', 'happiness', 'perhaps', 'its', 'closer', 'than', '1', 'or', '2', 'years', 'or', 'not', 'so']
- Body: ['another', 'day', 'progresses', 'and', 'then', 'we', 'find', 'that', 'which', 'we', 'seek', 'and', 'finally', 'we', 'will', 'find', 'our', 'happiness', 'perhaps', 'its', 'closer', 'than', '1', 'or', '2', 'years', 'or', 'not', 'so']
- KEY: lastly;
If you want to handle punctuation, I suggest you add something like:
PUNC = oneOf(". , ? ! : & $")
and add it to body1:
body1 = OneOrMore(~kw + (Word(alphas + nums) | PUNC))('Body')

Technique to remove common words(and their plural versions) from a string

I am attempting to find tags(keywords) for a recipe by parsing a long string of text. The text contains the recipe ingredients, directions and a short blurb.
What do you think would be the most efficient way to remove common words from the tag list?
By common words, I mean words like: 'the', 'at', 'there', 'their' etc.
I have 2 methodologies I can use, which do you think is more efficient in terms of speed and do you know of a more efficient way I could do this?
Methodology 1:
- Determine the number of times each word occurs(using the library Collections)
- Have a list of common words and remove all 'Common Words' from the Collection object by attempting to delete that key from the Collection object if it exists.
- Therefore the speed will be determined by the length of the variable delims
import collections from Counter
delim = ['there','there\'s','theres','they','they\'re']
# the above will end up being a really long list!
word_freq = Counter(recipe_str.lower().split())
for delim in set(delims):
del word_freq[delim]
return freq.most_common()
Methodology 2:
- For common words that can be plural, look at each word in the recipe string, and check if it partially contains the non-plural version of a common word. Eg; For the string "There's a test" check each word to see if it contains "there" and delete it if it does.
delim = ['this','at','them'] # words that cant be plural
partial_delim = ['there','they',] # words that could occur in many forms
word_freq = Counter(recipe_str.lower().split())
for delim in set(delims):
del word_freq[delim]
# really slow
for delim in set(partial_delims):
for word in word_freq:
if word.find(delim) != -1:
del word_freq[delim]
return freq.most_common()
I'd just do something like this:
from nltk.corpus import stopwords
s=set(stopwords.words('english'))
txt="a long string of text about him and her"
print filter(lambda w: not w in s,txt.split())
which prints
['long', 'string', 'text']
and in terms of complexity should be O(n) in number of words in the string, if you believe the hashed set lookup is O(1).
FWIW, my version of NLTK defines 127 stopwords:
'all', 'just', 'being', 'over', 'both', 'through', 'yourselves', 'its', 'before', 'herself', 'had', 'should', 'to', 'only', 'under', 'ours', 'has', 'do', 'them', 'his', 'very', 'they', 'not', 'during', 'now', 'him', 'nor', 'did', 'this', 'she', 'each', 'further', 'where', 'few', 'because', 'doing', 'some', 'are', 'our', 'ourselves', 'out', 'what', 'for', 'while', 'does', 'above', 'between', 't', 'be', 'we', 'who', 'were', 'here', 'hers', 'by', 'on', 'about', 'of', 'against', 's', 'or', 'own', 'into', 'yourself', 'down', 'your', 'from', 'her', 'their', 'there', 'been', 'whom', 'too', 'themselves', 'was', 'until', 'more', 'himself', 'that', 'but', 'don', 'with', 'than', 'those', 'he', 'me', 'myself', 'these', 'up', 'will', 'below', 'can', 'theirs', 'my', 'and', 'then', 'is', 'am', 'it', 'an', 'as', 'itself', 'at', 'have', 'in', 'any', 'if', 'again', 'no', 'when', 'same', 'how', 'other', 'which', 'you', 'after', 'most', 'such', 'why', 'a', 'off', 'i', 'yours', 'so', 'the', 'having', 'once'
obviously you can provide your own set; I'm in agreement with the comment on your question that it's probably easiest (and fastest) to just provide all the variations you want to eliminate up front, unless you want to eliminate a lot more words than this but then it becomes more a question of spotting interesting ones than eliminating spurious ones.
Your problem domain is "Natural Language Processing".
If you don't want to reinvent the wheel, use NLTK, search for stemming in the docs.
Given that NLP is one of the hardest subjects in computer science, reinventing this wheel is a lot of work...
You ask about speed, but you should be more concerned with accuracy. Both your suggestions will make a lot of mistakes, removing either too much or too little (for example, there are a lot of words that contain the substring "at"). I second the suggestion to look into the nltk module. In fact, one of the early examples in the NLTK book involves removing common words until the most common remaining ones reveal something about the genre. You'll get not only tools, but instruction on how to go about it.
Anyway you'll spend much longer writing your program than your computer will spend executing it, so focus on doing it well.

from a list of strings, how do you get the strangest word/string in python

I have a list of strings:
['twas', 'brillig', 'and', 'the', 'slithy', 'toves', 'did', 'gyre', 'and', 'gimble', 'in', 'the', 'wabe', 'all', 'mimsy', 'were', 'the', 'borogoves', 'and', 'the', 'mome', 'raths', 'outgrabe', '"beware', 'the', 'jabberwock', 'my', 'son', 'the', 'jaws', 'that', 'bite', 'the', 'claws', 'that', 'catch', 'beware', 'the', 'jubjub', 'bird', 'and', 'shun', 'the', 'frumious', 'bandersnatch', 'he', 'took', 'his', 'vorpal', 'sword', 'in', 'hand', 'long', 'time', 'the', 'manxome', 'foe', 'he', 'sought', 'so', 'rested', 'he', 'by', 'the', 'tumtum', 'tree', 'and', 'stood', 'awhile', 'in', 'thought', 'and', 'as', 'in', 'uffish', 'thought', 'he', 'stood', 'the', 'jabberwock', 'with', 'eyes', 'of', 'flame', 'came', 'whiffling', 'through', 'the', 'tulgey', 'wood', 'and', 'burbled', 'as', 'it', 'came', 'one', 'two', 'one', 'two', 'and', 'through', 'and', 'through', 'the', 'vorpal', 'blade', 'went', 'snicker-snack', 'he', 'left', 'it', 'dead', 'and', 'with', 'its', 'head', 'he', 'went', 'galumphing', 'back', '"and', 'has', 'thou', 'slain', 'the', 'jabberwock', 'come', 'to', 'my', 'arms', 'my', 'beamish', 'boy', 'o', 'frabjous', 'day', 'callooh', 'callay', 'he', 'chortled', 'in', 'his', 'joy', '`twas', 'brillig', 'and', 'the', 'slithy', 'toves', 'did', 'gyre', 'and', 'gimble', 'in', 'the', 'wabe', 'all', 'mimsy', 'were', 'the', 'borogoves', 'and', 'the', 'mome', 'raths', 'outgrabe']
How do I return a list of word(s) which are most different from other words in the string - based on the minimum similarity with all other words in the list and the average similarity value(as a float).
I have absolutley no idea how to do this. I think I need to use the cossim(word1,word2) function which calculates the similarity between 'word1' and 'word2' as we were given this function by our lecturer, but I do not know how to use it.
def cossim(word1,word2):
"""Calculate the cosine similarity between the two words"""
# sub-function for constructing a letter vector from argument `word`
# which returns the tuple `(vec,veclen)`, where `vec` is a dictionary of
# characters in `word`, and `veclen` is the length of the vector
def wordvec(word):
vec = defaultdict(int) # letter vector
# count the letters in the word
for char in word:
vec[char] += 1
# calculate the length of the letter vector
len = 0.0
for char in vec:
len += vec[char]**2
# return the letter vector and vector length
return vec,math.sqrt(len)
# calculate a vector,length tuple for each of `word1` and `word2`
vec1,len1 = wordvec(word1)
vec2,len2 = wordvec(word2)
# calculate the dot product between the letter vectors for the two words
dotprod = 0.0
for char in vec1:
dotprod += vec1[char]*vec2[char]
# divide by the lengths of the two vectors
if dotprod:
dotprod /= len1*len2
return dotprod
The answer I should get from the list above should be:
({'my'], 0.088487238234566931)
Any help is greatly appreciated,
Thanks,
Keely
The list of words needs to be deduplicated first before using an approach like Robert Rossney suggested. Otherwise the resulting number will be slightly off because the same w can appear multiple times in one d[word].
One possible way to do this would be to create a set from the list:
set_of_words = set(mylist)
differences = {}
for word in set_of_words:
differences[word] = [cossim(word, word2) for word2 in set_of_words if word != word2]
This creates a dictionary assigning to each word a list of differences to each other word.
Instead of assigning these lists directly to the dictionary entries you could also save them in a variable within the loop and calculate the avg like afg suggested in Robert's solution, using that variable.
The dictionary function iteritems lets you iterate over (key, value)-pairs and the min function has a special parameter key to specify what to minimize, for example key=lambda x: x[1] to sort by the second element of a tuple or list.
For a starting point, you probably want to construct a dictionary whose keys are the words in the list and whose values are all of the other words in the list:
d = {}
for word in mylist:
d[word] = [w for w in mylist if w != word]
This gives you a quick way of computing the similarity values for each word:
similarities = {}
for word in mylist:
similarities[word] = [cossim(w, word) for w in d[word]]
From that it's easy to calculate the minimum and average similarities for each word.
So the goal, if I understand correctly, is to find the word with the minimum sum of cossim with all of the other words. For that, the following code would suffice:
/* removed at the reasonable request of agf */
From a high-level perspective, what we're doing is looping through each word in your list, and checking to see how similar it is to all the other words. If it is less similar than any of the other words we've seen thus far, we store it. Our output is then the word with the lowest similarity with all other words.
i think the module Python-Levenshtein (pypi link)may help get the similarity of word1 and word2:
use two functions:
import Levenshtein
str1 = 'abcde'
str2 = 'abcdf'
print(Levenshtein.distance(str1,str2))
# 1
print(Levenshtein.ratio(str1,str2))
# 0.8
is enough.

Categories