Suggest category for a piece of text - python

I've been searching for a opensource solution to suggest a category given a question or text.
For example, "who is Lady Gaga?" would probably return 'Entertainment', 'Music', or 'Celebrity'.
"How many strike out there are for baseball?" would give me 'Baseball', or 'Sport'.
The categorization doesn't have to be perfect but should be some what close.
Also is there anywhere I can get a list of popular categories?

This is a document classification problem - your "document" is simply the query or text.
You'll first need to decide what the list of possible categories is. "Who is Lady Gaga?" could be Entertainment, Celebrity, Questions-In-English, Biography, People, etc. Next you'll apply a decision framework to assign a score for each category to the text. The highest score is its category - as long as it's above a noise threshold and there isn't a second-place category that's too close to differentiate. Decision frameworks can include approaches like a Bayesian network or a set of custom rules.
Some open source projects that implement classifiers include:
Classifier4J
Matlab Classification Toolbox
POPFile (for email)
OpenNLP Maximum Entropy Package

Screen scrape Wolfram alpha.
http://www.wolframalpha.com/input/?i=Who+is+lady+gaga
http://www.wolframalpha.com/input/?i=What+is+baseball
You can probably get a good list of categories from dmoz.

Not much of an answer, but perhaps this categorized dictionary would help:
http://www.provalisresearch.com/wordstat/WordNet.html
I imagine you could extract the uncommon words from the string, look them up in the categorized dictionary, and return the categories that get the most matches on your terms. It'll be tricky to deal with pop culture references like "Lady Gaga", though...maybe you could do a Google search and analyze the results of that.

Others have done quite a bit of work on your behalf, so I'd suggest just using something like the OpenCalais API. There's a python wrapper to the API at http://code.google.com/p/python-calais/.
"Who is Lady Gaga?" seems to be too short a piece of text for them to give a decent response. However, if you took the trouble to do a two step process and grab the first paragraph from wikipedia for Lady Gaga, and then supply that to the OpenCalais API you get very good results.
You can check it out quickly by just cut and pasting the first paragraph from wikipedia into the OpenCalais viewer. The result is a classification into the topic "Entertainment Culture" with a 100% confidence estimate.
Similarly, the baseball example returns "sports" as the topic with further social tags of "recreation", "baseball" etc.
Edit Here's another thought prompted by Calais' use of social tags: sending the wikipedia url for Lady Gaga to the delicious API with
curl -k https://user:password#api.del.icio.us/v1/posts/suggest?url=http://en
.wikipedia.org/wiki/Lady_gaga
returns
<?xml version="1.0" encoding="UTF-8"?>
<suggest>
<recommended>music</recommended>
<recommended>wikipedia</recommended>
<recommended>wiki</recommended>
<recommended>people</recommended>
<recommended>bio</recommended>
<recommended>cool</recommended>
<recommended>facts</recommended>
<popular>music</popular>
<popular>gaga</popular>
<popular>ladygaga</popular>
<popular>wikipedia</popular>
<popular>lady</popular>
etc. Should be easy enough to ignore the wikipedia/wiki type entries.

Related

Web scraping 40+ websites in search of opportunities in python

I am trying to automate the task of searching for opportunities (tenders) in 40+ websites, for a company. The opportunities are usually displayed in table format. They have a title, date published, and a clickable link that takes you to a detailed description of what the opportunity is.
One website example is:
http://www.eib.org/en/about/procurement/index.htm
The goal would be to retrieve the new opportunities that are posted everyday and that fit specific criteria. So I need to look at specific keywords within the opportunities' title. These keywords are the fields and regions in which the company had previous experience.
My question is: After I extract these tables, with the tenders' titles, in a dataframe format, how do I search for the right opportunities and sort them by relevance (given a list of keywords)? Do I use NLP in this case and turn the words in the titles into binary code (0s and 1s)? Or are there other simpler methods I should be looking at?
Thanks in advance!
To sort the tenders by relevance, you need define the relevance.
In this case you could count the number of occurrence of your keywords in the tender and this would be your relevance score. You can then only keep the ones that have at least one appearing keyword.
This is a first try, you can improve this by adding keywords, or assign a higher score if the keyword is in the title rather than in the detailed description...
The task you might be trying to solve here is information retrieval: rank documents (the tenders) given their relevance to a query (your keyword).
So then you can use weighing schemes like Tf-Idf or BM25, etc... But it depends on your needs, maybe counting the keyword is more than enough !

Contextual Namend Entity Recognition with spacy - Howto?

For a new project I have a need to extract information from web pages, more precisely imprint information. I use brat to label the documents and have started first experiments with spacy and NER. There are many videos and tutorials about this, but still some basic questions remain.
Is it possible to include the context of an entity?
Example text:
Responsible for the content:
The Good Company GmbH 0331 Berlin
You can contact us via +49 123 123 123.
This website was created by good design GmbH, contact +49 12314 453 5.
Well, spacy is very good at extracting the phone numbers. According to my latest tests, the error rate is less than two percent. I was able to achieve this already after 250 labeled documents, in the meantime I have labeled 450 documents, my goal is about 5000 documents.
Now to the actual point. Relevant are only the phone numbers that are shown in the context of the sentence "Responsible for the content", the other phone numbers are not relevant.
I could now imagine to train these introductory sentences as entities, because they are always somehow similar. But how can I create the context? Are there perhaps already models based on NER that do just that?
Maybe someone has already read some hints or something about it somewhere? As a beginner the hurdle is relatively high, because the material is really deep (little play on words).
Greetings from Germany!
If I understand your question and use-case correctly, I would advise the following approach:
Train/design some system that recognizes all phone numbers - it looks like you've already got that
Train a text classifier to recognize the "responsible for content" sentences.
Implement some heuristics (can probably be rule-based?) to determine whether or not any recognized phone number is connected to any of the predicted "responsible for content" sentences - probably using straightforward features such as number of sentences in between, taking the first phone number after the sentence, etc.
So basically I would advice to solve each NLP challenge separately, and then connect the information throughout the document.

Assigning different weights to search keywords

I'm implementing a search engine and so far I am done with the part for web crawling, storing the results in the index and retrieving results for the search keywords entered by the user. However I would like the search results to be more specific. Let's say I'm searching "Shoe shops in Hyderabad". Is there any NLP library in python that can just process the text and assign higher weights on important words like in this case "shoes" and "Hyderabad".
Thanks.
I don't think one approach is going to solve the entire problem here. Your question is broad and it will take multiple steps to get best results. Here is how I would approach the problem
Create N-grams analyser with Lucene and query. Lucene also allows
Phrase queries. Shoe shops in Hyderabad is a good fit for that.
Use cosine similarity to treat Shoe shops in Hyderabad and Footwear shops in Hyderabad similarly.
Also think of some linguistic angle. Simple POS tagging and role based rule engine can help you get much smarter results for queries like Shoe shops in Hyderabad or Shoe under 500 bucks where a very finite word set of in, under, on etc can be assigned rules on location/ comparison etc. This point assumes you are looking at English language. You will have to build this layer separately for each language though.
Hope this helps.
I think the question is good (I was looking something similar last week) and of course as other people mention the question is too broad. But I think you can face it using a information Retrieval system. I can recommend you Lemur project and spefically Indri, it includes a lot of customization features for queries, then is possible weighting using n-grams, tf-idf (as previuos two answer suggest) or just use you own criteria. If you want use Indri, check this is an tutorial/introduction, and something about weighting is at page 56.
Good look!

Defining the context of a word - Python

I think this is an interesting question, at least for me.
I have a list of words, let's say:
photo, free, search, image, css3, css, tutorials, webdesign, tutorial, google, china, censorship, politics, internet
and I have a list of contexts:
Programming
World news
Technology
Web Design
I need to try and match words with the appropriate context/contexts if possible.
Maybe discovering word relationships in some way.
Any ideas?
Help would be much appreciated!
This sounds like it's more of a categorization/ontology problem than NLP. Try WordNet for a standard ontology.
I don't see any real NLP in your stated problem, but if you do need some semantic analysis or a parser try NLTK.
Where do these words come from? Do they come from real texts. If they are then it is a classic data mining problem. What you need to do is to your set of documents into the matrix where rows represent which document the word came from and the columns represent the words in the documents.
For example if you have two documents like this:
D1: Need to find meaning.
D2: Need to separate Apples from oranges
you matrix will look like this:
Need to find meaning Apples Oranges Separate From
D1: 1 1 1 1 0 0 0 0
D2: 1 1 0 0 1 1 1 1
This is called term by document matrix
Having collected this statistics you can use algorithms like K-Means to group similar documents together. Since you already know how many concepts you have your tasks should be soomewhat easier. K-Means is very slow algorithm, so you can try to optimize it using techniques such as SVD
I just found this a couple days ago: ConceptNet
It's a commonsense ontology, so it might not be as specific as you would like, but it has a python API and you can download their entire database (currently around 1GB decompressed). Just keep in mind their licensing restrictions.
If you read the papers that were published by the team that developed it, you may get some ideas on how to relate your words to concepts/contexts.
The answer to your question obviously depends on the target taxonomy you are trying to map your terms into. Once you have decided on this you need to figure out how fine-grained the concepts should be. WordNet, as it has been suggested in other responses, will give you synsets, i.e. sets of terms which are more or less synonymous but which you will have to map to concepts like 'Web Design' or 'World News' by some other mechanism since these are not encoded in WordNet. If you're aiming at a very broad semantic categorization, you could use WordNet's higher-level concept nodes which differentiate, e.g. (going up the hierarchy) human from animal, animates from plants, substances from solids, concrete from abstract things, etc.
Another kind-of-taxonomy which may be quite useful to you is the Wikipedia category system. This is not just a spontaneous idea I just came up with, but there has been a lot of work on deriving real ontologies from Wikipedia categories. Take a look at the Java Wikipedia Library - the idea would be to find a wikipedia article for the term in question (e.g. 'css3'), extract the categories this article belongs to, and pick the best ones with respect to some criterion (i.e. 'programming', 'technology', and 'web-development'). Depending on what you're trying to do this last step (choosing the best of several given categories) may or may not be difficult.
See here for a list of other ontologies / knowledge bases you could use.

How do content discovery engines, like Zemanta and Open Calais work?

I was wondering how as semantic service like Open Calais figures out the names of companies, or people, tech concepts, keywords, etc. from a piece of text. Is it because they have a large database that they match the text against?
How would a service like Zemanta know what images to suggest to a piece of text for instance?
Michal Finkelstein from OpenCalais here.
First, thanks for your interest. I'll reply here but I also encourage you to read more on OpenCalais forums; there's a lot of information there including - but not limited to:
http://opencalais.com/tagging-information
http://opencalais.com/how-does-calais-learn
Also feel free to follow us on Twitter (#OpenCalais) or to email us at team#opencalais.com
Now to the answer:
OpenCalais is based on a decade of research and development in the fields of Natural Language Processing and Text Analytics.
We support the full "NLP Stack" (as we like to call it):
From text tokenization, morphological analysis and POS tagging, to shallow parsing and identifying nominal and verbal phrases.
Semantics come into play when we look for Entities (a.k.a. Entity Extraction, Named Entity Recognition). For that purpose we have a sophisticated rule-based system that combines discovery rules as well as lexicons/dictionaries. This combination allows us to identify names of companies/persons/films, etc., even if they don't exist in any available list.
For the most prominent entities (such as people, companies) we also perform anaphora resolution, cross-reference and name canonization/normalization at the article level, so we'll know that 'John Smith' and 'Mr. Smith', for example, are likely referring to the same person.
So the short answer to your question is - no, it's not just about matching against large databases.
Events/Facts are really interesting because they take our discovery rules one level deeper; we find relations between entities and label them with the appropriate type, for example M&As (relations between two or more companies), Employment Changes (relations between companies and people), and so on. Needless to say, Event/Fact extraction is not possible for systems that are based solely on lexicons.
For the most part, our system is tuned to be precision-oriented, but we always try to keep a reasonable balance between accuracy and entirety.
By the way there are some cool new metadata capabilities coming out later this month so stay tuned.
Regards,
Michal
I'm not familiar with the specific services listed, but the field of natural language processing has developed a number of techniques that enable this sort of information extraction from general text. As Sean stated, once you have candidate terms, it's not to difficult to search for those terms with some of the other entities in context and then use the results of that search to determine how confident you are that the term extracted is an actual entity of interest.
OpenNLP is a great project if you'd like to play around with natural language processing. The capabilities you've named would probably be best accomplished with Named Entity Recognizers (NER) (algorithms that locate proper nouns, generally, and sometimes dates as well) and/or Word Sense Disambiguation (WSD) (eg: the word 'bank' has different meanings depending on it's context, and that can be very important when extracting information from text. Given the sentences: "the plane banked left", "the snow bank was high", and "they robbed the bank" you can see how dissambiguation can play an important part in language understanding)
Techniques generally build on each other, and NER is one of the more complex tasks, so to do NER successfully, you will generally need accurate tokenizers (natural language tokenizers, mind you -- statistical approaches tend to fare the best), string stemmers (algorithms that conflate similar words to common roots: so words like informant and informer are treated equally), sentence detection ('Mr. Jones was tall.' is only one sentence, so you can't just check for punctuation), part-of-speech taggers (POS taggers), and WSD.
There is a python port of (parts of) OpenNLP called NLTK (http://nltk.sourceforge.net) but I don't have much experience with it yet. Most of my work has been with the Java and C# ports, which work well.
All of these algorithms are language-specific, of course, and they can take significant time to run (although, it is generally faster than reading the material you are processing). Since the state-of-the-art is largely based on statistical techniques, there is also a considerable error rate to take into account. Furthermore, because the error rate impacts all the stages, and something like NER requires numerous stages of processing, (tokenize -> sentence detect -> POS tag -> WSD -> NER) the error rates compound.
Open Calais probably use language parsing technology and language statics to guess which words or phrases are Names, Places, Companies, etc. Then, it is just another step to do some kind of search for those entities and return meta data.
Zementa probably does something similar, but matches the phrases against meta-data attached to images in order to acquire related results.
It certainly isn't easy.

Categories