Extract grocery list out of free text - python

I am looking for a python library / algorithm / paper to extract a list of groceries out of free text.
For example:
"One salad and two beers"
Should be converted to:
{'salad':1, 'beer': 2}

In [1]: from word2number import w2n
In [2]: print w2n.word_to_num("One")
1
In [3]: print w2n.word_to_num("Two")
2
In [4]: print w2n.word_to_num("Thirty five")
35
You can convert to number with using this package and rest of things you can implement as your needs.
Installation of this package.
pip install word2number
Update
You can implement like this.
from word2number import w2n
result = {}
input = "One salad and two beers"
b = input.split()
for i in b:
if type(w2n.word_to_num(i)) is int:
result[b[b.index(i)+1]] = w2n.word_to_num(i)
Result
{'beers': 2, 'salad': 1}

I suggest using WordNet. You can call it from java (JWNL library), etc. Here is the suggestion: for each single word, check it's hypernym. For edibles at the top level of the hypernymy hierarchy you will find " food, nutrient". Which is probably what you want. Now to test this, query the word "beer" in the Online version. Click on the "S", and then click on "inherited hypernym ". You will find this somewhere in the hierarchy:
....
S: (n) beverage, drink, drinkable, potable (any liquid suitable for drinking) "may I take your beverage order?"
S: (n) food, nutrient (any substance that can be metabolized by an animal to give energy and build tissue)
....
You can traverse this hierarchy using the programming language of your choice, etc. Once you flagged all the edibles, then you can catch the number , i.e. 2 in "2 beers", and you have all the information you need. Note that catching the numbers by itself can be a descent coding task! Hope it helps!

Related

Extract text from <ol> after specific <h3> in a specific <div> with BeautifulSoup

I am trying to extract the text from a certain 'ol' in this page using BeautifulSoup. The information I want to get is under a specific 'div' with a certain class but I want the text from the list items that immediately appear after a certain 'h3' that includes a 'span' with a class and id. See the picture:
The output should be:
Verb
1. (transitive) To join or unite (e.g. one thing to another, or as several particulars) so as to increase the number, augment the quantity or enlarge the magnitude, or so as to form into one aggregate.
2. To sum up; to put together mentally.
...
What I have done so far is:
from bs4 import BeautifulSoup
import urllib
url = urllib.urlopen('https://en.wiktionary.org/wiki/add#English')
content = url.read()
soup = BeautifulSoup(content, 'lxml')
main_div = soup.findAll('div',attrs={"class":"mw-parser-output"})
for x in main_div:
all_h3 = x.findAll('h3')
all_ol = x.findAll('ol')
The first answer to this question might be related but I didn't know how to edit it for my task.
Instead of BeautifulSoup, you could use lxml.html and XPath expressions.
Python
import requests
import io
from lxml import html
res = requests.get("https://en.wiktionary.org/wiki/add#English")
tree = html.parse(io.StringIO(res.text))
outputs = []
h3 = tree.xpath("//h3[span[#class = 'mw-headline' and #id = 'Verb']]")[0]
outputs.append(h3.xpath("span")[0].text)
ol = h3.xpath("following::ol[1]")[0]
outputs.append(ol.text_content())
print(outputs)
Output
['Verb',
'(transitive) To join or unite (e.g. one thing to another, or as several particulars) so as to increase the number, augment the quantity, or enlarge the magnitude, or so as to form into one aggregate.\nTo sum up; to put together mentally.\n1689, John Locke, An Essay Concerning Human Understanding\n […] as easily as he can add together the ideas of two days or two years.\nto add numbers\n(transitive) To combine elements of (something) into one quantity.\nto add a column of numbers\n(transitive) To give by way of increased possession (to someone); to bestow (on).\n1611, King James Version, Genesis 30:24:\nThe LORD shall add to me another son.\n1667, John Milton, Paradise Lost:\nBack to thy punishment, False fugitive, and to thy speed add wings.\n(transitive) To append (e,g, a statement); to say further information.\n1855, Thomas Babington Macaulay, The History of England from the Accession of James the Second, volume 3, page 37\xa0[1]:\nHe added that he would willingly consent to the entire abolition of the tax\n1900, L. Frank Baum, The Wonderful Wizard of Oz Chapter 23\n"Bless your dear heart," she said, "I am sure I can tell you of a way to get back to Kansas." Then she added, "But, if I do, you must give me the Golden Cap."\n(intransitive) To make an addition; to augment; to increase.\n1611, King James Version, 1 Kings 12:14:\nI will add to your yoke\n2013 June 29, “A punch in the gut”, in The Economist, volume 407, number 8842, page 72-3:Mostly, the microbiome is beneficial. […] Research over the past few years, however, has implicated it in diseases from atherosclerosis to asthma to autism. Dr Yoshimoto and his colleagues would like to add liver cancer to that list.\nIt adds to our anxiety.\n(intransitive, mathematics) To perform the arithmetical operation of addition.\nHe adds rapidly.\n(intransitive, video games) To summon minions or reinforcements.\nTypically, a hostile mob will add whenever it\'s within the aggro radius of a player.']

How to count the top 150 words and remove common words from 2 lists?

This code below is to find out the top 150 words which appeared the most in 2 strings.
pwords = re.findall(r'\w+',p)
ptop150words=Counter(pwords).most_common(150)
sorted(ptop150words)
nwords = re.findall(r'\w+',n)
ntop150words=Counter(nwords).most_common(150)
sorted(ntop150words)
This code below is to remove the common words which appeared in the 2 strings.
def new(ntopwords,ptopwords):
for i in ntopwords[:]:
if i in potopwords:
ntopwords.remove(i)
ptopwords.remove(i)
print(i)
However, there is no output for print(i). what is wrong?
Most likely your indentation.
new(negativetop150words,positivetop150words):
for i in negativetop150words[:]:
if i in positivetop150words:
negativetop150words.remove(i)
positivetop150words.remove(i)
print(i)
You could rely on set methods. Once you have both lists, you convert them to sets. The common subset is the intersection of the 2 sets, and you can simply take the difference from both original sets:
positiveset = set(positivewords)
negativeset = set(negativewords)
commons = positiveset & negativeset
positivewords = sorted(positiveset - commons)
negativewords = sorted(negativeset - commons)
commonwords = sorted(commons)
The code you posted does not call the function new(negativetop150words, positivetop150words) Also per Jesse's comment, the print(i) command is outside the function. Here's the code that worked for me:
import re
from collections import Counter
def new(negativetop150words, positivetop150words):
for i in negativetop150words[:]:
if i in positivetop150words:
negativetop150words.remove(i)
positivetop150words.remove(i)
print(i)
return negativetop150words, positivetop150words
positive = 'The FDA is already fairly gung-ho about providing this. It receives about 1,000 applications a year and approves all but 1%. The agency makes sure there is sound science behind the request, and no obvious indication that the medicine would harm the patient.'
negative = 'Thankfully these irritating bits of bureaucracy have been duly dispatched. This victory comes courtesy of campaigning work by a libertarian think-tank, the Goldwater Institute, based in Arizona. It has been pushing right-to-try legislation for around four years, and it can now be found in 40 states. Speaking about the impact of these laws on patients, Arthur Caplan, a professor of bioethics at NYU School of Medicine in New York, says he can think of one person who may have been helped.'
positivewords = re.findall(r'\w+', positive)
positivetop150words = Counter(positivewords).most_common(150)
sorted(positivetop150words)
negativewords = re.findall(r'\w+', negative)
negativetop150words = Counter(negativewords).most_common(150)
words = new(negativewords, positivewords)
This prints:
a
the
It
and
about
the

WordNet: Iterate over synsets

For a project I would like to measure the amount of ‘human centered’ words within a text. I plan on doing this using WordNet. I have never used it and I am not quite sure how to approach this task. I want to use WordNet to count the amount of words that belong to certain synsets, for example the sysnets ‘human’ and ‘person’.
I came up with the following (simple) piece of code:
word = 'girlfriend'
word_synsets = wn.synsets(word)[0]
hypernyms = word_synsets.hypernym_paths()[0]
for element in hypernyms:
print element
Results in:
Synset('entity.n.01')
Synset('physical_entity.n.01')
Synset('causal_agent.n.01')
Synset('person.n.01')
Synset('friend.n.01')
Synset('girlfriend.n.01')
My first question is, how do I properly iterate over the hypernyms? In the code above it prints them just fine. However, when using an ‘if’ statement, for example:
count_humancenteredness = 0
for element in hypernyms:
if element == 'person':
print 'found person hypernym'
count_humancenteredness +=1
I get ‘AttributeError: 'str' object has no attribute '_name'’. What method can I use to iterate over the hypernyms of my word and perform an action (e.g. increase the count of human centerdness) when a word does indeed belong to the ‘person’ or ‘human’ synset.
Secondly, is this an efficient approach? I assume that iterating over several texts and iterating over the hypernyms of each noun will take quite some time.. Perhaps that there is another way to use WordNet to perform my task more efficiently.
Thanks for your help!
wrt the error message
hypernyms = word_synsets.hypernym_paths() returns a list of list of SynSets.
Hence
if element == 'person':
tries to compare a SynSet object against a string. That kind of comparison is not supported by the SynSet.
Try something like
target_synsets = wn.synsets('person')
if element in target_synsets:
...
or
if u'person' in element.lemma_names():
...
instead.
wrt efficiency
Currently, you do a hypernym-lookup for every word inside your input text. As you note, this is not necessarily efficient. However, if this is fast enough, stop here and do not optimize what is not broken.
To speed up the lookup, you can pre-compile a list of "person related" words in advance by making use of the transitive closure over the hyponyms as explained here.
Something like
person_words = set(w for s in p.closure(lambda s: s.hyponyms()) for w in s.lemma_names())
should do the trick. This will return a set of ~ 10,000 words, which is not too much to store in main memory.
A simple version of the word counter then becomes something on the lines of
from collections import Counter
word_count = Counter()
for word in (w.lower() for w in words if w in person_words):
word_count[word] += 1
You might also need to pre-process the input words using stemming or other morphologic reductions before passing the words on to WordNet, though.
To get all the hyponyms of a synset, you can use the following function (tested with NLTK 3.0.3, dhke's closure trick doesn't work on this version):
def get_hyponyms(synset):
hyponyms = set()
for hyponym in synset.hyponyms():
hyponyms |= set(get_hyponyms(hyponym))
return hyponyms | set(synset.hyponyms())
Example:
from nltk.corpus import wordnet
food = wordnet.synset('food.n.01')
print(len(get_hyponyms(food))) # returns 1526

Grouping Similar Strings

I'm trying to analyze a bunch of search terms, so many that individually they don't tell much. That said, I'd like to group the terms because I think similar terms should have similar effectiveness. For example,
Term Group
NBA Basketball 1
Basketball NBA 1
Basketball 1
Baseball 2
It's a contrived example, but hopefully it explains what I'm trying to do. So then, what is the best way to do what I've described? I thought the nltk may have something along those lines, but I'm only barely familiar with it.
Thanks
You'll want to cluster these terms, and for the similarity metric I recommend Dice's Coefficient at the character-gram level. For example, partition the strings into two-letter sequences to compare (term1="NB", "BA", "A ", " B", "Ba"...).
nltk appears to provide dice as nltk.metrics.association.BigramAssocMeasures.dice(), but it's simple enough to implement in a way that'll allow tuning. Here's how to compare these strings at the character rather than word level.
import sys, operator
def tokenize(s, glen):
g2 = set()
for i in xrange(len(s)-(glen-1)):
g2.add(s[i:i+glen])
return g2
def dice_grams(g1, g2): return (2.0*len(g1 & g2)) / (len(g1)+len(g2))
def dice(n, s1, s2): return dice_grams(tokenize(s1, n), tokenize(s2, n))
def main():
GRAM_LEN = 4
scores = {}
for i in xrange(1,len(sys.argv)):
for j in xrange(i+1, len(sys.argv)):
s1 = sys.argv[i]
s2 = sys.argv[j]
score = dice(GRAM_LEN, s1, s2)
scores[s1+":"+s2] = score
for item in sorted(scores.iteritems(), key=operator.itemgetter(1)):
print item
When this program is run with your strings, the following similarity scores are produced:
./dice.py "NBA Basketball" "Basketball NBA" "Basketball" "Baseball"
('NBA Basketball:Baseball', 0.125)
('Basketball NBA:Baseball', 0.125)
('Basketball:Baseball', 0.16666666666666666)
('NBA Basketball:Basketball NBA', 0.63636363636363635)
('NBA Basketball:Basketball', 0.77777777777777779)
('Basketball NBA:Basketball', 0.77777777777777779)
At least for this example, the margin between the basketball and baseball terms should be sufficient for clustering them into separate groups. Alternatively you may be able to use the similarity scores more directly in your code with a threshold.

Wordnet selectional restrictions in NLTK

Is there a way to capture WordNet selectional restrictions (such as +animate, +human, etc.) from synsets through NLTK?
Or is there any other way of providing semantic information about synset? The closest I could get to it were hypernym relations.
It depends on what is your "selectional restrictions" or i would call it semantic features, because in classic semantics, there exists a world of concepts and to compare between concepts we have to find
discriminating features (i.e. features of the concepts that are used to distinguish them from each other) and
similarity features (i.e. features of the concepts similar and highlights the need to differentiate them)
For example:
Man is [+HUMAN], [+MALE], [+ADULT]
Woman is [+HUMAN], [-MALE], [+ADULT]
[+HUMAN] and [+ADULT] = similarity features
[+-MALE] is the discrimating features
The common problem of traditional semantics and applying this theory in computational semantics is the question of
"Is there a specific list of features that we can use to compare any
"If so, what are the features on this list?"
concepts?"
(see www.acl.ldc.upenn.edu/E/E91/E91-1034.pdf‎ for more details)
Getting back to WordNet, I can suggest 2 methods to resolve the "selection restrictions"
First, Check the hypernyms for discriminating features, but first you must decide what is the discriminating features. To differentiate an animal from humans, let's take the discriminating features as [+-human] and [+-animal].
from nltk.corpus import wordnet as wn
# Concepts to compare
dog_sense = wn.synsets('dog')[0] # It's http://goo.gl/b9sg9X
jb_sense = wn.synsets('James_Baldwin')[0] # It's http://goo.gl/CQQIG9
# To access the hypernym_paths()[0]
# It's weird for that hypernym_paths gives a list of list rather than a list, nevertheless it works.
dog_hypernyms = dog_sense.hypernym_paths()[0]
jb_hypernyms = jb_sense.hypernym_paths()[0]
# Discriminating features in terms of concepts in WordNet
human = wn.synset('person.n.01') # i.e. [+human]
animal = wn.synset('animal.n.01') # i.e. [+animal]
try:
assert human in jb_hypernyms and animal not in jb_hypernyms
print "James Baldwin is human"
except:
print "James Baldwin is not human"
try:
assert human in dog_hypernyms and animal not in dog_hypernyms
print "Dog is an animal"
except:
print "Dog is not an animal"
Second, Check for similarity measures as #Jacob had suggested.
dog_sense = wn.synsets('dog')[0] # It's http://goo.gl/b9sg9X
jb_sense = wn.synsets('James_Baldwin')[0] # It's http://goo.gl/CQQIG9
# Features to check against whether the 'dubious' concept is a human or an animal
human = wn.synset('person.n.01') # i.e. [+human]
animal = wn.synset('animal.n.01') # i.e. [+animal]
if dog_sense.wup_similarity(animal) > dog_sense.wup_similarity(human):
print "Dog is more of an animal than human"
elif dog_sense.wup_similarity(animal) < dog_sense.wup_similarity(human):
print "Dog is more of a human than animal"
You could try using some of the similarity functions with handpicked synsets, and use that to filter. But it's essentially the same as following the hypernym tree - afaik all the wordnet similarity functions use hypernym distance in their calculations. Also, there's a lot of optional attributes of a synset that might be worth exploring, but their presence can be very inconsistent.

Categories