Implementing alternative forms of LDA

Implementing alternative forms of LDA - python

I am using Latent Dirichlet Allocation with a corpus of news data from six different sources. I am interested in topic evolution, emergence, and want to compare how the sources are alike and different from each other over time. I know that there are a number of modified LDA algorithms such as the Author-Topic model, Topics Over Time, and so on.
My issue is that very few of these alternate model specifications are implemented in any standard format. A few are available in Java, but most exist as conference papers only. What is the best way to go about implementing some of these algorithms on my own? I am fairly proficient in R and jags, and can stumble around in Python when given long enough. I am willing to write the code, but I don't really know where to start and I don't know C or Java. Can I build a model in JAGS or Python just having the formulas from the manuscript? If so, can someone point me at an example of doing this? Thanks.

My friend's response is below, pardon the language please.
First I wrote up a Python implementation of the collapsed Gibbs sampler seen here (http://www.pnas.org/content/101/suppl.1/5228.full.pdf+html) and fleshed out here (http://cxwangyi.files.wordpress.com/2012/01/llt.pdf). This was slow as balls.
Then I used a Python wrapping of a C implementation of this paper (http://books.nips.cc/papers/files/nips19/NIPS2006_0511.pdf). Which is fast as f*ck, but the results are not as great as one would see with NMF.
But NMF implementations I've seen, with scitkits, and even with the scipy sparse-compatible recently released NIMFA library, they all blow the f*ck up on any sizable corpus. My new white whale is a sliced, distributed implementation of the thing. This'll be non-trivial.

In Python, do you know of PyMC? It's flexible in specifying both the model and the fitting algorithm.
Also, when starting with R and JAGS, there is this tutorial on "Using JAGS in R with the rjags Package" together with a collection of examples.

Related

Is there a good option for creation of custom branching rules in branch-and-bound for MILPs in Python?

Basically, I want to recreate the conceptual results from the paper "Learning to Branch in Mixed Integer Programming" by Khalil, et al, at the same time avoiding, if possible:
1)The necessity of obtaining an academic license for CPLEX (which was used in the paper) or similar serious commercial solver
2)The necessity of using C based API. This is not a strict requirement, but Python has the benefit of having good and very accessible ML libraries, which seems like a great advantage for this specific goal
I am aware, that there is a great number of open source Python based MILP solvers, but a lot of them focus on the end-to-end solution of relatively simple problems in their presentation and, if we also consider, that a lot of them (if not all) hook up to other C based solvers, it is highly non-obvious to judge, which ones actually have needed customization potential.
So, if anyone has more in-depth experience with trying to customize Python solvers for their highly specific needs, I would appreciate the advice.

I'm afraid you will hit a roadblock at some point there. It's really hard to do that without doing C/C++ work (imho).
Python-way
I only know three projects with some low-level functionality (and it's still hard to say if those fit your needs).
https://github.com/coin-or/python-mip
relatively new
promises interactive cut-gen
has a chapter Developing Customized Branch-&-Cut algorithms
but i'm not sure if there is enough freedom for your task (seems to focus on cuts for now)
build around open-source solver Cbc/Clp (besides Gurobi)
https://github.com/coin-or/CyLP
not much develeopment for years now
the whole python-3 dev was sad (see issues; pull-request not processed for years; it's a resource problem: the maintainers are nice people!)
was designed to research pivoting
but it also says: For example, you may
.. define cut generators, branch-and-bound strategies
hard to see how to achieve what you look for except for abstract LP-relax - fix - resolve
might be hard to control specifics (warm-start vs. hot-start)
build around open-source Cbc/Clp
https://github.com/SCIP-Interfaces/PySCIPOpt
basic docs show more high-level usage
but it's internal code at least has entries for branchexeclp and co.
maybe it's ready to use (maybe not)
raw list of interface classes
as those things (maybe) wrap the original C-API, there is a lot of good documentation in the parent-project!
build around open-source solver SCIP
easier to grab the solver in academic setting, but by no means free (i'm not a lawyer and won't try to find the right words)
at least one developer of it is active on StackOverflow
Alternative: C++
If trying to get full-control; which might be needed, with minimal need for understanding the underlying solver in all it's details, you probably want to use C/C++ within Coin OSI. Sadly the Cbc part is unmaintained, but depending on your exact task, you might only need Clp for example.
Alternative: Julia
I did not follow the recent developments there, but the language did have a strong early focus on Mathematical Optimization (driven by a big group of people) surpassing python even in it's early days (imho!).
But i'm not sure if MathOptInterface is fine-grained enough for your task.

how can I make recommendation model using python's scikit-learn

I'm learning statistical learning these days using python's pandas and scikit-learn library and they're fantastic tools for me.
I could have learned the way of classification, regression and also clustering with them of course.
But, I cannot find the way how can I start with them when I would like to make a recommendation model. For example, if I have a customer's purchase dataset, which contains date, product name, product maker, price, order device etc...
What is the problem type of recommendation? classification, regression, or anything else?
In fact, I could find out there are very famous algorithms like collaborative filtering when someone has to solve this problem.
If so, can I use those algorithms using scikit-learn? or should I have to learn another M.L libraries?
Regards

Scikit-learn does not offer any recommendation system tools. You can give a look at mahout which is giving really easy to start proposition or spark.
However recommendation is a problem in itself in machine learning word. It can be regression if you are trying to predict the rate that a user would give to a movie for instance or classification if you want to know if a user will like the movie or not (binary choice).
The important thing is that recommendation is using tools and algorithms dedicated to this problem like item-based or content-based recommendation. These concepts are actually quite simple to understand and implementing yourself a little recommendation engine might be the best.
I advice you the book mahout in action which is a great introduction to recommendation concept

How about Crab https://github.com/python-recsys/crab, which is a a Python framework for building recommender engines integrated with the world of scientific Python packages (numpy, scipy, matplotlib).
I have not used this framework but just found it. And it seems there is only version 0.1 and Crab hasn't been updated for years. So I doubt whether it is well documented. Whatever, if you decide to try Crab, please give us a feedback after that:)

Neural Network based ranking of documents

I'm planning of implementing a document ranker which uses neural networks. How can one rate a document by taking in to consideration the ratings of similar articles?. Any good python libraries for doing this?. Can anyone recommend a good book for AI, with python code.
EDIT
I'm planning to make a recommendation engine which would make recommendations from similar users as well as using the data clustered using tags. User would be given chance to vote for articles. There will be about hundred thousand articles. Documents would be clustered based on their tags. Given a keyword articles would be fetched based on their tags and passed through a neural network for ranking.

The problem you are trying to solve is called "collaborative filtering".
Neural Networks
One state-of-the-art neural network method is Deep Belief Networks and Restricted Boltzman Machines. For a fast python implementation for a GPU (CUDA) see here. Another option is PyBrain.
Academic papers on your specific problem:
This is probably the state-of-the-art of neural networks and collaborative filtering (of movies):
Salakhutdinov, R., Mnih, A. Hinton, G, Restricted Boltzman
Machines for Collaborative Filtering, To appear in
Proceedings of the 24th International Conference on
Machine Learning 2007.
PDF
A Hopfield network implemented in Python:
Huang, Z. and Chen, H. and Zeng, D. Applying associative retrieval techniques to alleviate the sparsity problem in collaborative filtering.
ACM Transactions on Information Systems (TOIS), 22, 1,116--142, 2004, ACM. PDF
A thesis on collaborative filtering with Restricted Boltzman Machines (they say Python is not practical for the job):
G. Louppe. Collaborative filtering: Scalable
approaches using restricted Boltzmann machines.
Master's thesis, Universite de Liege, 2010.
PDF
Neural networks are not currently the state-of-the-art in collaborative filtering. And they are not the simplest, wide-spread solutions. Regarding your comment about the reason for using NNs being having too little data, neural networks don't have an inherent advantage/disadvantage in that case. Therefore, you might want to consider simpler Machine Learning approaches.
Other Machine Learning Techniques
The best methods today mix k-Nearest Neighbors and Matrix Factorization.
If you are locked on Python, take a look at pysuggest (a Python wrapper for the SUGGEST recommendation engine) and PyRSVD (primarily aimed at applications in collaborative filtering, in particular the Netflix competition).
If you are open to try other open source technologies look at: Open Source collaborative filtering frameworks and http://www.infoanarchy.org/en/Collaborative_Filtering.

Packages
If you're not committed to neural networks, I've had good luck with SVM, and k-means clustering might also be helpful. Both of these are provided by Milk. It also does Stepwise Discriminant Analysis for feature selection, which will definitely be useful to you if you're trying to find similar documents by topic.
God help you if you choose this route, but the ROOT framework has a powerful machine learning package called TMVA that provides a large number of classification methods, including SVM, NN, and Boosted Decision Trees (also possibly a good option). I haven't used it, but pyROOT provides python bindings to ROOT functionality. To be fair, when I first used ROOT I had no C++ knowledge and was in over my head conceptually too, so this might actually be amazing for you. ROOT has a HUGE number of data processing tools.
(NB: I've also written a fairly accurate document language identifier using chi-squared feature selection and cosine matching. Obviously your problem is harder, but consider that you might not need very hefty tools for it.)
Storage vs Processing
You mention in your question that:
...articles would be fetched based on their tags and passed through a neural network for ranking.
Just as another NB, one thing you should know about machine learning is that processes like training and evaluating tend to take a while. You should probably consider ranking all documents for each tag only once (assuming you know all the tags) and storing the results. For machine learning generally, it's much better to use more storage than more processing.
Now to your specific case. You don't say how many tags you have, so let's assume you have 1000, for roundness. If you store the results of your ranking for each doc on each tag, that gives you 100 million floats to store. That's a lot of data, and calculating them all will take a while, but retrieving them is very fast. If instead you recalculate the ranking for each document on demand, you have to do 1000 passes of it, one for each tag. Depending on the kind of operations you're doing and the size of your docs, that could take a few seconds to a few minutes. If the process is simple enough that you can wait for your code to do several of these evaluations on demand without getting bored, then go for it, but you should time this process before making any design decisions / writing code you won't want to use.
Good luck!

If I understand correctly, your task is something related to Collaborative filtering. There are many possible approaches to this problem; I suggest you follow the wikipedia page to have an overview of the main approaches you can choose.
For your project work I can suggest looking at Python based intro to Neural Networks with a simple BackProp NN implementation and a classification example. This is not "the" solution, but perhaps you can build your system out of that example without the need for a bigger framework.

You might want to check out PyBrain.

The FANN library also looks promising.

I am not really sure if a neural networks are the best way to solve this. I think Euclidean Distance Score or Pearson Correlation Score combined with item or user based filtering would be a good start.
An excellent book on the topic is: Programming Collective Intelligence from Toby Segaran

Nash equilibrium in Python

Is there a Python library out there that solves for the Nash equilibrium of two-person zero-games? I know the solution can be written down in terms of linear constraints and, in theory, scipy should be able to optimize it. However, for two-person zero-games the solution is exact and unique, but some of the solvers fail to converge for certain problems.
Rather than listing any of the libraries on Linear programing on the Python website, I would like to know what library would be most effective in terms of ease of use and speed.

Raymond Hettinger wrote a recipe for solving zero-sum payoff matrices. It should serve your purposes alright.
As for a more general library for solving game theory, there's nothing specifically designed for that. But, like you said, scipy can tackle optimization problems like this. You might be able to do something with GarlicSim, which claims to be for "any kind of simulation: Physics, game theory..." but I've never used it before so I can't recommend it.

There is Gambit, which is a little difficult to set up, but has a python API.

I've just started putting together some game theory python code: http://drvinceknight.github.com/Gamepy/
There's code which:
solves matching games,
calculates shapley values in cooperative games,
runs agent based simulations to identify emergent behaviour in normal form games,
(clumsily - my python foo is still growing) uses the lrs library (written in C: http://cgm.cs.mcgill.ca/~avis/C/lrs.html) to calculate the solutions to normal form games (this is I believe what you want).
The code is all available on github and that site (the first link at the beginning of this answer) explains how the code works and gives user examples.
You might also want to check out 'Gambit' which I've never used.

Unstructured Text to Structured Data

I am looking for references (tutorials, books, academic literature) concerning structuring unstructured text in a manner similar to the google calendar quick add button.
I understand this may come under the NLP category, but I am interested only in the process of going from something like "Levi jeans size 32 A0b293"
to: Brand: Levi, Size: 32, Category: Jeans, code: A0b293
I imagine it would be some combination of lexical parsing and machine learning techniques.
I am rather language agnostic but if pushed would prefer python, Matlab or C++ references
Thanks

You need to provide more information about the source of the text (the web? user input?), the domain (is it just clothes?), the potential formatting and vocabulary...
Assuming worst case scenario you need to start learning NLP. A very good free book is the documentation of NLTK: http://www.nltk.org/book . It is also a very good introduction to Python and the SW is free (for various usages). Be warned: NLP is hard. It doesn't always work. It is not fun at times. The state of the art is no where near where you imagine it is.
Assuming a better scenario (your text is semi-structured) - a good free tool is pyparsing. There is a book, plenty of examples and the resulting code is extremely attractive.
I hope this helps...

Possibly look at "Collective Intelligence" by Toby Segaran. I seem to remember that addressing the basics of this in one chapter.

After some researching I have found that this problem is commonly referred to as Information Extraction and have amassed a few papers and stored them in a Mendeley Collection
http://www.mendeley.com/research-papers/collections/3237331/Information-Extraction/
Also as Tai Weiss noted NLTK for python is a good starting point and this chapter of the book, looks specifically at information extraction

If you are only working for cases like the example you cited, you are better off using some manual rule-based that is 100% predictable and covers 90% of the cases it might encounter production..
You could enumerable lists of all possible brands and categories and detect which is which in an input string cos there's usually very little intersection in these two lists..
The other two could easily be detected and extracted using regular expressions. (1-3 digit numbers are always sizes, etc)
Your problem domain doesn't seem big enough to warrant a more heavy duty approach such as statistical learning.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.