I'm building a corpus of texts harvested alongside some metadata from HTML with BeautifulSoup. It would be really helpful if I could call Mallet from within Python, and have it model topics from Python strings, rather than from text files in a directory. That way I could put the n keywords located by Mallet into each file.
I get a message saying that Mallet has been recognised when I run:
from nltk.classify import mallet
from subprocess import call
mallet.config_mallet("malletdir/mallet-2.0.7/bin")
But I haven't had any luck with the next steps, and am not even sure if Mallet accepts anything other than saved files.
I have not been able to turn up any documentation that I can really understand. Has anybody seen digestable documentation for this? (The NLTK book doesn't get into Mallet). I would also be happy to learn of any other means of topic modelling within Python that I could operationalise without a really deep knowledge of Python.
Sorry, this is my first rodeo.
In case you are still looking for a solution: Gensim (a Python topic modeling/machine learning packet) has a wrapper for Mallet which is easy to use and well documented. Here are some Gensim tutorials and a specific tutorial for the Mallet wrapper. You may also want to read some installation instructions (mostly the part about setting Java memory) here and then you'd be ready to go.
I once tried implementing Mallet with an NLTK project and I too ran into dead end after dead end. I think that main thing to keep in here is Mallet is Java based while NLTK is written in Python.
You already knew that but my point is for me personally I struggled with mixing the technologies because I do not have a strong background with Java. I've received the same feedback from coworkers about Mallet with Python, "Be ready to spend a lot of time debugging."
Since then I've been using the sklearn library for Python. It is aimed at machine learning more generally, not directly for NLP but can be used for it quite nicely. It comes with a very large selection of modelling tools and most of it seems to rely on NumPy so it should be pretty fast. I've used it quite a bit and can say that it is very well written and documented.
I don't want to discourage you from using Mallet, especially just because I said so. But if you are open to alternatives, I think you will find that when building projects with NLTK it's far easier to using Python modules since it itself is written in Python. I hope this helps!
Related
I'm using Stephen Hansen's topicmodels package to create a topic model. I'm using the tutorial data available at that link to start. This tutorial uses Python 2 but only requires one changes xrange() to range() in order to have it work properly with Python 3. I was able to run through the tutorial fully but I'm struggling to visually represent the results. First, does anyone have and resources on visualizing LDA models using this package? I haven't been able to locate any.
Primarily, I'm trying to create two kinds of graphics. First, I'm trying to create a word cloud by topic. The second is something that looks like this: (full paper available here)
I have a CSV document from running the tutorial that has all the necessary information to do this that looks like the following:
The issue is that I'm unsure how to go about creating this document. I'm also unsure how I could use this to create word clouds by topic. There's not much use of topicmodels in python and I don't have very much experience creating any graphics with Python. I'm lost trying to figure out how to convert my output to something I can use with wordclouds or any other visual.
If anyone wants to replicate this data, you just need to have a C++ compiler installed on your computer, download and adjust xrange() in the tutorial, and run it through.
I want to read and understand the code of some of the basic Machine Learning Models like Linear Regression from Python Scikit-learn Package, but it is too confusing at the start. Can someone tell me from where (which class) to start?
If you know the basics and want to know about the internals (not just using it), than I would start with the sklearn Developer’s Guide.
It explains the idea behind the API, explains some of the util functions that are often found in the code, to e.g. check input, and explains how C/C++ and Cython are used within the source to go for most speed - this was confusing me at the beginning as I thought it was all pure Python and did not know about Cython.
I would suggest you to go through some courses online first, for example there is an Econometrics course on Coursera which is dealing with those concepts
https://www.coursera.org/learn/erasmus-econometrics/home/welcome
while most of the time using a full course to understand a concept feels like an overshoot, but in general it is really worth it. The courses I posted were the ones which helped me through my PhD difficulties, and gave a very good overview on not only the technicalities of the given package, but also the purpose it has to be used for.
I'm trying to model twitter stream data with topic models. Gensim, being an easy to use solution, is impressive in it's simplicity. It has a truly online implementation for LSI, but not for LDA. For a changing content stream like twitter, Dynamic Topic Models are ideal. Is there any way, or even a hack - an implementation or even a strategy, using which I can utilize Gensim for this purpose?
Are there any other python implementations which derive (preferably) from Gensim or independent? I am preferring python, since I want to get started asap, but if there is an optimum solution with some work, please mention it.
Thanks.
Gensim (http://radimrehurek.com/gensim/models/dtmmodel.html) has a python wrapper for the orig. C++ code.
The DTM wrapper in Gensim is working, but none of the documentation is particularly complete at this time. On the Gensim side, the most useful thing to look at is the DTM example buried in docs/notebooks. This shows you what all of the input variables need to look like. A couple of things to note:
the DTM model has been moved into gensim.models.wrappers.dtmmodel
initialize_lda=True must be set because of a bug in the DTM code (this will be the default in future -- PR #676)
You'll also need a working compiled version of DTM itself (you provide the path to that executable). You can try using the appropriate executable from a github repo, but if that doesn't work you'll probably need to compile the original code by running the included makefile.
Having talked with David Blei and John Lafferty about exactly this, the answer right now is no, there aren't.
Sean Gerrish's DTM implementation works with a documented memory leak, but works on manageable collections.
I've started to learn Python and programming from scratch. I have not programmed before so it's a new experience. I do seem to grasp most of the concepts, from variables to definitions and modules. I still need to learn a lot more about what the different libraries and modules do and also I lack knowledge on OOP and classes in Python.
I see people who just program in Python like that's all they have ever done and I am still just coming to grips with it.
Is there a way, some tools, a logical methodology that would give me an overview or a good hold of how to handle programming problems ?
For instance, I'm trying to create a parser which we need at the office . I also need to create a spider that would collect links from various websites.
Is there a formidable way of studying the various modules to see what is needed ? Or is it just nose to the grind stone and understand what the documentation says ?
Sorry for the lengthy question..
The MIT Intro to Computer Science course on the MIT OpenCourseWare website was taught using Python. There are 24 lectures available as videos that you can watch for free.
It's kind of academic to be sure, but it would give you a very solid foundation to start from.
Start working your way through the Essential Python Reading List, which has articles on how to code in Python and how to do it well.
If you like a more academical approach try Learning Python from Mark Lutz.
For the use of standard libraries, the official docs are very good. More hands on descriptions can also be found in PYMOTW from Doug Hellmann
It might be useful to get some information on Object Oriented programming (just what is the whole class thing about, and how do you tell if your classes are good/poor/indifferent). Mark Lutz' book Learning Python has an entire Part (several chapters) on OO. If this stuff is new to you, it might be helpful to take a look. Two other books I have found quite useful: The Python Cookbook (Alex Martelli, a prolific contributor here), and the Python Essential Reference (David Beazley).
Just do your project, learning what you need to along the way. By the time you do that a couple times, you'll "get" it. And you'll only improve from there.
You can also read other peoples' code: download X that looks interesting and read through the code to understand how it works.
Those two tips will help you learn any language. Aside from that, Dive Into Python is a great resource for learning a lot about Python.
I am trying to learn Python and referencing the documentation for the standard Python library from the Python website, and I was wondering if this was really the only library and documentation I will need or is there more? I do not plan to program advanced 3d graphics or anything advanced at the moment.
Edit:
Thanks very much for the responses, they were very useful. My problem is where to start on a script I have been thinking of. I want to write a script that converts images into a web format but I am not completely sure where to begin. Thanks for any more help you can provide.
For the basics, yes, the standard Python library is probably all you'll need. But as you continue programming in Python, eventually you will need some other library for some task -- for instance, I recently needed to generate a tone at a specific, but differing, frequency for an application, and pyAudiere did the job just right.
A lot of the other libraries out there generate their documentation differently from the core Python style -- it's just visually different, the content is the same. Some only have docstrings, and you'll be best off reading them in a console, perhaps.
Regardless of how the other documentation is generated, get used to looking through the Python APIs to find the functions/classes/methods you need. When the time comes for you to use non-core libraries, you'll know what you want to do, but you'll have to find how to do it.
For the future, it wouldn't hurt to be familiar with C, either. There's a number of Python libraries that are actually just wrappers around C libraries, and the documentation for the Python libraries is just the same as the documentation for the C libraries. PyOpenGL comes to mind, but it's been a while since I've personally used it.
As others have said, it depends on what you're into. The package index at http://pypi.python.org/pypi/ has categories and summaries that are helpful in seeing what other libraries are available for different purposes. (Select "Browse packages" on the left to see the categories.)
One very common library, that should also fit your current needs, is the Python Image Library (PIL).
Note: the latest version is still in beta, and available only at Effbot site.
If you're just beginning, all you'll need to know is the stuff you can get from the Python website. Failing that a quick Google is the fastest way to get (most) Python answers these days.
As you develop your skills and become more advanced, you'll start looking for more exciting things to do, at which point you'll naturally start coming across other libraries (for example, pygame) that you can use for your more advanced projects.
It's very hard to answer this without knowing what you're planning on using Python for. I recommend Dive Into Python as a useful resource for learning Python.
In terms of popular third party frameworks, for web applications there's the Django framework and associated documentation, network stuff there's Twisted ... the list goes on. It really depends on what you're hoping to do!
Assuming that the standard library doesn't provide what we need and we don't have the time, or the knowledge, to implement the code we reuse 3rd party libraries.
This is a common attitude regardless of the programming language.
If there's a chance that someone else ever wanted to do what you want to do, there's a chance that someone created a library for it. A few minutes Googling something like "python image library" will find you what you need, or let you know that someone hasn't created a library for your purposes.