I've recently used the Language API to gather sentiment predictions for a work project. I had about 1,300 unlabeled documents and we used NLTK's tools initially, which was based on a dictionary of terms with polarity estimates of each word in the dictionary. I turned to the API, and after reviewing the predictions, the API produced much better results than NLTK.
I understand that the engineers probably won't want to release the details of the prediction engine, but I am curious how it works at a high level. If anybody could enlighten me or point me in the right direction, I'd appreciate it. For example, "it uses a Neural Network, trained on billions of observations," would be a reasonable answer.
Again, I'm using this for a work project and I'd like to be able to give a brief justification of why I switched from NLTK to the API (the improved results should speak for themselves, but I will definitely get "well, how does it work?").
The language API is a pipeline of state-of-the-art machine-learned systems that are trained on a combination of public data (like the Penn Treebank) and proprietary data annotated by Google's linguists.
Performance improvements compared to something like NLTK come from a combination of more and better data for training, as well as cutting edge machine learning algorithms, including but not limited to neural networks.
Related links that discuss some of the algorithms:
https://research.googleblog.com/2016/05/announcing-syntaxnet-worlds-most.html (Parsing algorithms)
https://www.wired.com/2016/11/googles-search-engine-can-now-answer-questions-human-help/ (Mentions the linguist team)
https://research.googleblog.com/2016/08/acl-2016-research-at-google.html (Recent publications from the NLP research team)
Related
I'm trying to learn NLP with python. Although I work with a variety of programming languages I'm looking for some kind of from the ground up solution that I can put together to come up with a product that has a high standard of spelling and grammer like grammerly?
I've tried some approaches with python. https://pypi.org/project/inflect/
Spacy for parts of speech.
Could someone point me in the direction of some kind of fully fledged API, that I can pull apart and try and work out how to get to a decent standard of english, like grammerly.
Many thanks,
Vince.
i would suggest to checkout Stanford core nlp, nltk as a start. but if you want to train your own ner and symantic modeling gensim.
I'm developing a script that detects peaks on a signal data from a biological source. I want to create a semi-automated model that helps predict which peaks are the correct ones. This script improves as the user manually selects a few of these peaks to help teach the model which ones are correct.
The workflow I'm trying to attain is this:
1. User manually selects data
2. Script obtains the correct data and fits it into the model
3. Use the model to predict the likelihood of a given peak to be correct.
4. Hopefully with enough data and training, it could be automated to run through the rest.
I also don't know the name of the general topic and I'm struggling to find what to google.
I've tried to fit it on linear regression model in scikit learn but I don't have enough datasets (as it learns from the user's first intervention). Is what I'm doing possible?
Sorry for the general-ness of this answer but the OP asked for general topics.
It sounds like semi-supervised learning and here for scikit-learn and here for more details may work.
There is no labeled data to start. A manual process is started to gain some labeled data. Soon, semi-supervised can kick in and take over - with a process measuring its accuracy. A match to your situation and a good place to start.
Eventually you may have "enough" correctly labeled data that you can investigate fitting a classic algorithm to predict the remainder. "Enough" being relative to how hard the problem is. Could be tens, hundreds, thousands, ...
Depending on other details of your situation, Reinforcement learning may work. As you described the situation, this may not work but there may be other details in your environment to leverage this family.
Word of warning - machine learning and semi-supervised in particular may not always work great to every problem. Measure, measure, measure.
Thank you everyone for all your help. I was talking to a colleague and he referred me to Online Machine Learning. I think this was the one I was looking for. Although I would not be handling time-series data nor streaming data from online, the method i think is sufficient for my needs. This method allows that data is trained one by one and not as a batch. I think SciKit Learn currently does not have the ability of out-of-the-box online machine learning.
This i think gives a great rundown on the strengths of online machine learning (also showcasing of the creme python library).
Thanks again!
My goal is to create a basic program which semantically compares strings and decides which is more similar (in terms of semantics) to which. For now I did not want to built from scratch a new (doc2vec?) model in NTLK or in SKlearn or in Gensim but I wanted to test the already existing APIs which can do semantic analysis.
Specifically, I chose to test ParallelDots AI API and for this reason I wrote the following program in python:
import paralleldots
api_key = "*******************************************"
paralleldots.set_api_key(api_key)
phrase1 = "I have a swelling on my eyelid"
phrase2 = "I have a lump on my hand"
phrase3 = "I have a lump on my lid"
print(phrase1, " VS ", phrase3, "\n")
print(paralleldots.similarity(phrase1, phrase3), "\n\n")
print(phrase2, " VS ", phrase3, "\n")
print(paralleldots.similarity(phrase2, phrase3))
This is the response I get from the API:
I have a swelling on my eyelid VS I have a lump on my lid
{'normalized_score': 1.38954, 'usage': 'By accessing ParallelDots API or using information generated by ParallelDots API, you are agreeing to be bound by the ParallelDots API Terms of Use: http://www.paralleldots.com/terms-and-conditions', 'actual_score': 0.114657, 'code': 200}
I have a lump on my hand VS I have a lump on my lid
{'normalized_score': 3.183968, 'usage': 'By accessing ParallelDots API or using information generated by ParallelDots API, you are agreeing to be bound by the ParallelDots API Terms of Use: http://www.paralleldots.com/terms-and-conditions', 'actual_score': 0.323857, 'code': 200}
This response is rather disappointing for me. It is obvious that the phrase
I have a lump on my lid
is almost semantically identical to the phrase
I have a swelling on my eyelid
and it is also related to the phrase
I have a lump on my hand
as they are referring to lumps but obviously it is not at all as close as to the former one. However, ParallelDots AI API outputs almost the exact opposite results.
If I am right, ParallelDots AI API is one of most popular APIs for semantic analysis along with others such as Dandelion API etc but it fetches so disappointing results. I expected that these APIs were using some rich databases of synonyms. I have also tested Dandelion API with these three phrases but the results are poor too (and actually they are even worse).
What can I fix at my program above to retrieve more reasonable results?
Is there any other faster way to semantically compare strings?
I am one of the data scientists at ParallelDots. While it is unfortunate that you did not get the desired results, please be aware that general models available via an API are trained on publically available datasets like news and twitter.
In the case of semantic similarity API, we have trained it on news corpus, and it is highly unlikely that the relatedness of lump and swelling can be picked up if one reads general news articles.
Just the word lump has so many different meanings in English under different context that it makes the model highly data dependent. For example, in the financial world lump is closer to sum, investments, etc.
If you are trying to find semantic relatedness on your domain data, I would advise you to utilize ParallelDots Enterprise services to customize the semantic relatedness model on your data. You need a large corpus of unlabelled data to do the customization, and you benefit from higher accuracy which only improves as more data gets iteratively added to the model.
I will end the answer with a general note which I have observed with software developers when they tend to services like AI which are probabilistic in nature. As a software engineer, we are trained to do troubleshooting when results are not what we expect for a set of inputs, however in machine learning; we tend to test a model on sufficiently large samples before trying to debug it and reach a conclusion. I will encourage the software developers to build a test set and run the AI models on it to find the accuracy metrics and evaluate whether the model is useful for their dataset.
I'm learning statistical learning these days using python's pandas and scikit-learn library and they're fantastic tools for me.
I could have learned the way of classification, regression and also clustering with them of course.
But, I cannot find the way how can I start with them when I would like to make a recommendation model. For example, if I have a customer's purchase dataset, which contains date, product name, product maker, price, order device etc...
What is the problem type of recommendation? classification, regression, or anything else?
In fact, I could find out there are very famous algorithms like collaborative filtering when someone has to solve this problem.
If so, can I use those algorithms using scikit-learn? or should I have to learn another M.L libraries?
Regards
Scikit-learn does not offer any recommendation system tools. You can give a look at mahout which is giving really easy to start proposition or spark.
However recommendation is a problem in itself in machine learning word. It can be regression if you are trying to predict the rate that a user would give to a movie for instance or classification if you want to know if a user will like the movie or not (binary choice).
The important thing is that recommendation is using tools and algorithms dedicated to this problem like item-based or content-based recommendation. These concepts are actually quite simple to understand and implementing yourself a little recommendation engine might be the best.
I advice you the book mahout in action which is a great introduction to recommendation concept
How about Crab https://github.com/python-recsys/crab, which is a a Python framework for building recommender engines integrated with the world of scientific Python packages (numpy, scipy, matplotlib).
I have not used this framework but just found it. And it seems there is only version 0.1 and Crab hasn't been updated for years. So I doubt whether it is well documented. Whatever, if you decide to try Crab, please give us a feedback after that:)
I'm planning of implementing a document ranker which uses neural networks. How can one rate a document by taking in to consideration the ratings of similar articles?. Any good python libraries for doing this?. Can anyone recommend a good book for AI, with python code.
EDIT
I'm planning to make a recommendation engine which would make recommendations from similar users as well as using the data clustered using tags. User would be given chance to vote for articles. There will be about hundred thousand articles. Documents would be clustered based on their tags. Given a keyword articles would be fetched based on their tags and passed through a neural network for ranking.
The problem you are trying to solve is called "collaborative filtering".
Neural Networks
One state-of-the-art neural network method is Deep Belief Networks and Restricted Boltzman Machines. For a fast python implementation for a GPU (CUDA) see here. Another option is PyBrain.
Academic papers on your specific problem:
This is probably the state-of-the-art of neural networks and collaborative filtering (of movies):
Salakhutdinov, R., Mnih, A. Hinton, G, Restricted Boltzman
Machines for Collaborative Filtering, To appear in
Proceedings of the 24th International Conference on
Machine Learning 2007.
PDF
A Hopfield network implemented in Python:
Huang, Z. and Chen, H. and Zeng, D. Applying associative retrieval techniques to alleviate the sparsity problem in collaborative filtering.
ACM Transactions on Information Systems (TOIS), 22, 1,116--142, 2004, ACM. PDF
A thesis on collaborative filtering with Restricted Boltzman Machines (they say Python is not practical for the job):
G. Louppe. Collaborative filtering: Scalable
approaches using restricted Boltzmann machines.
Master's thesis, Universite de Liege, 2010.
PDF
Neural networks are not currently the state-of-the-art in collaborative filtering. And they are not the simplest, wide-spread solutions. Regarding your comment about the reason for using NNs being having too little data, neural networks don't have an inherent advantage/disadvantage in that case. Therefore, you might want to consider simpler Machine Learning approaches.
Other Machine Learning Techniques
The best methods today mix k-Nearest Neighbors and Matrix Factorization.
If you are locked on Python, take a look at pysuggest (a Python wrapper for the SUGGEST recommendation engine) and PyRSVD (primarily aimed at applications in collaborative filtering, in particular the Netflix competition).
If you are open to try other open source technologies look at: Open Source collaborative filtering frameworks and http://www.infoanarchy.org/en/Collaborative_Filtering.
Packages
If you're not committed to neural networks, I've had good luck with SVM, and k-means clustering might also be helpful. Both of these are provided by Milk. It also does Stepwise Discriminant Analysis for feature selection, which will definitely be useful to you if you're trying to find similar documents by topic.
God help you if you choose this route, but the ROOT framework has a powerful machine learning package called TMVA that provides a large number of classification methods, including SVM, NN, and Boosted Decision Trees (also possibly a good option). I haven't used it, but pyROOT provides python bindings to ROOT functionality. To be fair, when I first used ROOT I had no C++ knowledge and was in over my head conceptually too, so this might actually be amazing for you. ROOT has a HUGE number of data processing tools.
(NB: I've also written a fairly accurate document language identifier using chi-squared feature selection and cosine matching. Obviously your problem is harder, but consider that you might not need very hefty tools for it.)
Storage vs Processing
You mention in your question that:
...articles would be fetched based on their tags and passed through a neural network for ranking.
Just as another NB, one thing you should know about machine learning is that processes like training and evaluating tend to take a while. You should probably consider ranking all documents for each tag only once (assuming you know all the tags) and storing the results. For machine learning generally, it's much better to use more storage than more processing.
Now to your specific case. You don't say how many tags you have, so let's assume you have 1000, for roundness. If you store the results of your ranking for each doc on each tag, that gives you 100 million floats to store. That's a lot of data, and calculating them all will take a while, but retrieving them is very fast. If instead you recalculate the ranking for each document on demand, you have to do 1000 passes of it, one for each tag. Depending on the kind of operations you're doing and the size of your docs, that could take a few seconds to a few minutes. If the process is simple enough that you can wait for your code to do several of these evaluations on demand without getting bored, then go for it, but you should time this process before making any design decisions / writing code you won't want to use.
Good luck!
If I understand correctly, your task is something related to Collaborative filtering. There are many possible approaches to this problem; I suggest you follow the wikipedia page to have an overview of the main approaches you can choose.
For your project work I can suggest looking at Python based intro to Neural Networks with a simple BackProp NN implementation and a classification example. This is not "the" solution, but perhaps you can build your system out of that example without the need for a bigger framework.
You might want to check out PyBrain.
The FANN library also looks promising.
I am not really sure if a neural networks are the best way to solve this. I think Euclidean Distance Score or Pearson Correlation Score combined with item or user based filtering would be a good start.
An excellent book on the topic is: Programming Collective Intelligence from Toby Segaran