Semantic similarity of strings - poor results

Semantic similarity of strings - poor results - python

My goal is to create a basic program which semantically compares strings and decides which is more similar (in terms of semantics) to which. For now I did not want to built from scratch a new (doc2vec?) model in NTLK or in SKlearn or in Gensim but I wanted to test the already existing APIs which can do semantic analysis.
Specifically, I chose to test ParallelDots AI API and for this reason I wrote the following program in python:
import paralleldots
api_key = "*******************************************"
paralleldots.set_api_key(api_key)
phrase1 = "I have a swelling on my eyelid"
phrase2 = "I have a lump on my hand"
phrase3 = "I have a lump on my lid"
print(phrase1, " VS ", phrase3, "\n")
print(paralleldots.similarity(phrase1, phrase3), "\n\n")
print(phrase2, " VS ", phrase3, "\n")
print(paralleldots.similarity(phrase2, phrase3))
This is the response I get from the API:
I have a swelling on my eyelid VS I have a lump on my lid
{'normalized_score': 1.38954, 'usage': 'By accessing ParallelDots API or using information generated by ParallelDots API, you are agreeing to be bound by the ParallelDots API Terms of Use: http://www.paralleldots.com/terms-and-conditions', 'actual_score': 0.114657, 'code': 200}
I have a lump on my hand VS I have a lump on my lid
{'normalized_score': 3.183968, 'usage': 'By accessing ParallelDots API or using information generated by ParallelDots API, you are agreeing to be bound by the ParallelDots API Terms of Use: http://www.paralleldots.com/terms-and-conditions', 'actual_score': 0.323857, 'code': 200}
This response is rather disappointing for me. It is obvious that the phrase
I have a lump on my lid
is almost semantically identical to the phrase
I have a swelling on my eyelid
and it is also related to the phrase
I have a lump on my hand
as they are referring to lumps but obviously it is not at all as close as to the former one. However, ParallelDots AI API outputs almost the exact opposite results.
If I am right, ParallelDots AI API is one of most popular APIs for semantic analysis along with others such as Dandelion API etc but it fetches so disappointing results. I expected that these APIs were using some rich databases of synonyms. I have also tested Dandelion API with these three phrases but the results are poor too (and actually they are even worse).
What can I fix at my program above to retrieve more reasonable results?
Is there any other faster way to semantically compare strings?

I am one of the data scientists at ParallelDots. While it is unfortunate that you did not get the desired results, please be aware that general models available via an API are trained on publically available datasets like news and twitter.
In the case of semantic similarity API, we have trained it on news corpus, and it is highly unlikely that the relatedness of lump and swelling can be picked up if one reads general news articles.
Just the word lump has so many different meanings in English under different context that it makes the model highly data dependent. For example, in the financial world lump is closer to sum, investments, etc.
If you are trying to find semantic relatedness on your domain data, I would advise you to utilize ParallelDots Enterprise services to customize the semantic relatedness model on your data. You need a large corpus of unlabelled data to do the customization, and you benefit from higher accuracy which only improves as more data gets iteratively added to the model.
I will end the answer with a general note which I have observed with software developers when they tend to services like AI which are probabilistic in nature. As a software engineer, we are trained to do troubleshooting when results are not what we expect for a set of inputs, however in machine learning; we tend to test a model on sufficiently large samples before trying to debug it and reach a conclusion. I will encourage the software developers to build a test set and run the AI models on it to find the accuracy metrics and evaluate whether the model is useful for their dataset.

Related

Recommendation for ML algorithm to differentiate between ban or allowance

I am new to Machine Learning and wanted to see if any of you could recommend an algorithm I could apply onto a project I'm doing. Basically I want to scrape popular housing websites and look at their descriptions to see if they allow/disallow something, for example pets. The problem is a simple search for pets leads contradictory results: 'pets allowed' and 'no additional cost for pets' or 'no pets' or 'I don't accept at this time'. As seen from these examples, often negative keywords 'no' are used to indicate pets are allowed, whereas positive keywords like 'accept' are used to indicate a ban. As such, I was wondering if there was any algorithm I could use (preferably in python) differentiate between the two. (Note: I can't run training data to generate an algorithm myself as the thing I am actually looking for is quite niche).
Thank you very much for your help!!

The keyword you're looking for is "document classification". This is a document classification problem. You start with documents (i.e. webpages) and you want to classify them as "allows pets" or "doesn't allow pets" (or whatever). There are a lot of good tutorials out there for performing document classification but a full explanation is beyond the scope of a StackOverflow answer.
You won't be able to do this for your particular niche case without providing at least some training data but you could gather, say 30 example websites, extract their text, manually add labels ("does fit my niche" vs "doesn't fit my niche"), and then run through a standard document classification system and see if that gets you the accuracy you want. Also in order for this to work with a small amount of training data (like your 30 documents), you'll need to start from a pretrained model.
Good luck!

New Google Natural Language API

I've recently used the Language API to gather sentiment predictions for a work project. I had about 1,300 unlabeled documents and we used NLTK's tools initially, which was based on a dictionary of terms with polarity estimates of each word in the dictionary. I turned to the API, and after reviewing the predictions, the API produced much better results than NLTK.
I understand that the engineers probably won't want to release the details of the prediction engine, but I am curious how it works at a high level. If anybody could enlighten me or point me in the right direction, I'd appreciate it. For example, "it uses a Neural Network, trained on billions of observations," would be a reasonable answer.
Again, I'm using this for a work project and I'd like to be able to give a brief justification of why I switched from NLTK to the API (the improved results should speak for themselves, but I will definitely get "well, how does it work?").

The language API is a pipeline of state-of-the-art machine-learned systems that are trained on a combination of public data (like the Penn Treebank) and proprietary data annotated by Google's linguists.
Performance improvements compared to something like NLTK come from a combination of more and better data for training, as well as cutting edge machine learning algorithms, including but not limited to neural networks.
Related links that discuss some of the algorithms:
https://research.googleblog.com/2016/05/announcing-syntaxnet-worlds-most.html (Parsing algorithms)
https://www.wired.com/2016/11/googles-search-engine-can-now-answer-questions-human-help/ (Mentions the linguist team)
https://research.googleblog.com/2016/08/acl-2016-research-at-google.html (Recent publications from the NLP research team)

How to programmatically classify a list of objects

I'm trying to take a long list of objects (in this case, applications from the iTunes App Store) and classify them more specifically. For instance, there are a bunch of applications currently classified as "Education," but I'd like to label them as Biology, English, Math, etc.
Is this an AI/Machine Learning problem? I have no background in that area whatsoever but would like some resources or ideas on where to start for this sort of thing.

Yes, you are correct. Classification is a machine learning problem, and classifying stuff based on text data involves natural language processing.
The canonical classification problem is spam detection using a Naive Bayes classifier, which is very simple. The idea is as follows:
Gather a bunch of data (emails), and label them by class (spam, or not spam)
For each email, remove stopwords, and get a list of the unique words in that email
Now, for each word, calculate the probability it appears in a spam email, vs a non-spam email (ie count occurrences in spam, vs non spam)
Now you have a model- the probability of a email being spam, given it contains a word. However, an email contains many words. In Naive Bayes, you assume the words occur independently of each other (which turns out to to be an ok assumption), and multiply the probabilities of all words in the email against each other.
You usually divide data into training and testing, so you'll have a set of emails you train your model on, and then a set of labeled stuff you test against where you calculate precision and recall.
I'd highly recommend playing around with NLTK, a python machine learning and nlp library. It's very user friendly and has good docs and tutorials, and is a good way to get acquainted with the field.
EDIT: Here's an explanation of how to build a simple NB classifier with code.

Probably not. You'd need to do a fair bit of work to extract data in some usable form (such as names), and at the end of the day, there are probably few enough categories that it would simply be easier to manually identify a list of keywords for each category and set a parser loose on titles/descriptions.
For example, you could look through half a dozen biology apps, and realize that in the names/descriptions/whatever you have access to, the words "cell," "life," and "grow" appear fairly often - not as a result of some machine learning, but as a result of your own human intuition. So build a parser to classify everything with those words as biology apps, and do similar things for other categories.
Unless you're trying to classify the entire iTunes app store, that should be sufficient, and it would be a relatively small task for you to manually check any apps with multiple classifications or no classifications. The labor involved with using a simple parser + checking anomalies manually is probably far less than the labor involved with building a more complex parser to aid machine learning, setting up machine learning, and then checking everything again, because machine learning is not 100% accurate.

Neural Network based ranking of documents

I'm planning of implementing a document ranker which uses neural networks. How can one rate a document by taking in to consideration the ratings of similar articles?. Any good python libraries for doing this?. Can anyone recommend a good book for AI, with python code.
EDIT
I'm planning to make a recommendation engine which would make recommendations from similar users as well as using the data clustered using tags. User would be given chance to vote for articles. There will be about hundred thousand articles. Documents would be clustered based on their tags. Given a keyword articles would be fetched based on their tags and passed through a neural network for ranking.

The problem you are trying to solve is called "collaborative filtering".
Neural Networks
One state-of-the-art neural network method is Deep Belief Networks and Restricted Boltzman Machines. For a fast python implementation for a GPU (CUDA) see here. Another option is PyBrain.
Academic papers on your specific problem:
This is probably the state-of-the-art of neural networks and collaborative filtering (of movies):
Salakhutdinov, R., Mnih, A. Hinton, G, Restricted Boltzman
Machines for Collaborative Filtering, To appear in
Proceedings of the 24th International Conference on
Machine Learning 2007.
PDF
A Hopfield network implemented in Python:
Huang, Z. and Chen, H. and Zeng, D. Applying associative retrieval techniques to alleviate the sparsity problem in collaborative filtering.
ACM Transactions on Information Systems (TOIS), 22, 1,116--142, 2004, ACM. PDF
A thesis on collaborative filtering with Restricted Boltzman Machines (they say Python is not practical for the job):
G. Louppe. Collaborative filtering: Scalable
approaches using restricted Boltzmann machines.
Master's thesis, Universite de Liege, 2010.
PDF
Neural networks are not currently the state-of-the-art in collaborative filtering. And they are not the simplest, wide-spread solutions. Regarding your comment about the reason for using NNs being having too little data, neural networks don't have an inherent advantage/disadvantage in that case. Therefore, you might want to consider simpler Machine Learning approaches.
Other Machine Learning Techniques
The best methods today mix k-Nearest Neighbors and Matrix Factorization.
If you are locked on Python, take a look at pysuggest (a Python wrapper for the SUGGEST recommendation engine) and PyRSVD (primarily aimed at applications in collaborative filtering, in particular the Netflix competition).
If you are open to try other open source technologies look at: Open Source collaborative filtering frameworks and http://www.infoanarchy.org/en/Collaborative_Filtering.

Packages
If you're not committed to neural networks, I've had good luck with SVM, and k-means clustering might also be helpful. Both of these are provided by Milk. It also does Stepwise Discriminant Analysis for feature selection, which will definitely be useful to you if you're trying to find similar documents by topic.
God help you if you choose this route, but the ROOT framework has a powerful machine learning package called TMVA that provides a large number of classification methods, including SVM, NN, and Boosted Decision Trees (also possibly a good option). I haven't used it, but pyROOT provides python bindings to ROOT functionality. To be fair, when I first used ROOT I had no C++ knowledge and was in over my head conceptually too, so this might actually be amazing for you. ROOT has a HUGE number of data processing tools.
(NB: I've also written a fairly accurate document language identifier using chi-squared feature selection and cosine matching. Obviously your problem is harder, but consider that you might not need very hefty tools for it.)
Storage vs Processing
You mention in your question that:
...articles would be fetched based on their tags and passed through a neural network for ranking.
Just as another NB, one thing you should know about machine learning is that processes like training and evaluating tend to take a while. You should probably consider ranking all documents for each tag only once (assuming you know all the tags) and storing the results. For machine learning generally, it's much better to use more storage than more processing.
Now to your specific case. You don't say how many tags you have, so let's assume you have 1000, for roundness. If you store the results of your ranking for each doc on each tag, that gives you 100 million floats to store. That's a lot of data, and calculating them all will take a while, but retrieving them is very fast. If instead you recalculate the ranking for each document on demand, you have to do 1000 passes of it, one for each tag. Depending on the kind of operations you're doing and the size of your docs, that could take a few seconds to a few minutes. If the process is simple enough that you can wait for your code to do several of these evaluations on demand without getting bored, then go for it, but you should time this process before making any design decisions / writing code you won't want to use.
Good luck!

If I understand correctly, your task is something related to Collaborative filtering. There are many possible approaches to this problem; I suggest you follow the wikipedia page to have an overview of the main approaches you can choose.
For your project work I can suggest looking at Python based intro to Neural Networks with a simple BackProp NN implementation and a classification example. This is not "the" solution, but perhaps you can build your system out of that example without the need for a bigger framework.

You might want to check out PyBrain.

The FANN library also looks promising.

I am not really sure if a neural networks are the best way to solve this. I think Euclidean Distance Score or Pearson Correlation Score combined with item or user based filtering would be a good start.
An excellent book on the topic is: Programming Collective Intelligence from Toby Segaran

How do content discovery engines, like Zemanta and Open Calais work?

I was wondering how as semantic service like Open Calais figures out the names of companies, or people, tech concepts, keywords, etc. from a piece of text. Is it because they have a large database that they match the text against?
How would a service like Zemanta know what images to suggest to a piece of text for instance?

Michal Finkelstein from OpenCalais here.
First, thanks for your interest. I'll reply here but I also encourage you to read more on OpenCalais forums; there's a lot of information there including - but not limited to:
http://opencalais.com/tagging-information
http://opencalais.com/how-does-calais-learn
Also feel free to follow us on Twitter (#OpenCalais) or to email us at team#opencalais.com
Now to the answer:
OpenCalais is based on a decade of research and development in the fields of Natural Language Processing and Text Analytics.
We support the full "NLP Stack" (as we like to call it):
From text tokenization, morphological analysis and POS tagging, to shallow parsing and identifying nominal and verbal phrases.
Semantics come into play when we look for Entities (a.k.a. Entity Extraction, Named Entity Recognition). For that purpose we have a sophisticated rule-based system that combines discovery rules as well as lexicons/dictionaries. This combination allows us to identify names of companies/persons/films, etc., even if they don't exist in any available list.
For the most prominent entities (such as people, companies) we also perform anaphora resolution, cross-reference and name canonization/normalization at the article level, so we'll know that 'John Smith' and 'Mr. Smith', for example, are likely referring to the same person.
So the short answer to your question is - no, it's not just about matching against large databases.
Events/Facts are really interesting because they take our discovery rules one level deeper; we find relations between entities and label them with the appropriate type, for example M&As (relations between two or more companies), Employment Changes (relations between companies and people), and so on. Needless to say, Event/Fact extraction is not possible for systems that are based solely on lexicons.
For the most part, our system is tuned to be precision-oriented, but we always try to keep a reasonable balance between accuracy and entirety.
By the way there are some cool new metadata capabilities coming out later this month so stay tuned.
Regards,
Michal

I'm not familiar with the specific services listed, but the field of natural language processing has developed a number of techniques that enable this sort of information extraction from general text. As Sean stated, once you have candidate terms, it's not to difficult to search for those terms with some of the other entities in context and then use the results of that search to determine how confident you are that the term extracted is an actual entity of interest.
OpenNLP is a great project if you'd like to play around with natural language processing. The capabilities you've named would probably be best accomplished with Named Entity Recognizers (NER) (algorithms that locate proper nouns, generally, and sometimes dates as well) and/or Word Sense Disambiguation (WSD) (eg: the word 'bank' has different meanings depending on it's context, and that can be very important when extracting information from text. Given the sentences: "the plane banked left", "the snow bank was high", and "they robbed the bank" you can see how dissambiguation can play an important part in language understanding)
Techniques generally build on each other, and NER is one of the more complex tasks, so to do NER successfully, you will generally need accurate tokenizers (natural language tokenizers, mind you -- statistical approaches tend to fare the best), string stemmers (algorithms that conflate similar words to common roots: so words like informant and informer are treated equally), sentence detection ('Mr. Jones was tall.' is only one sentence, so you can't just check for punctuation), part-of-speech taggers (POS taggers), and WSD.
There is a python port of (parts of) OpenNLP called NLTK (http://nltk.sourceforge.net) but I don't have much experience with it yet. Most of my work has been with the Java and C# ports, which work well.
All of these algorithms are language-specific, of course, and they can take significant time to run (although, it is generally faster than reading the material you are processing). Since the state-of-the-art is largely based on statistical techniques, there is also a considerable error rate to take into account. Furthermore, because the error rate impacts all the stages, and something like NER requires numerous stages of processing, (tokenize -> sentence detect -> POS tag -> WSD -> NER) the error rates compound.

Open Calais probably use language parsing technology and language statics to guess which words or phrases are Names, Places, Companies, etc. Then, it is just another step to do some kind of search for those entities and return meta data.
Zementa probably does something similar, but matches the phrases against meta-data attached to images in order to acquire related results.
It certainly isn't easy.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.