I study literature and am trying to work out how I would go about importing a series of novels from .txt or other formats into python to play around with different word frequencies, similarities, etc. I hope to try and establish some quantitative ways to define a genre beyond just subject matter.
I particularly want to see if certain word strings, concepts, and locations occur in each of these novels. Something like this: (http://web.uvic.ca/~mvp1922/modmac/). I would then like to focus in on one novel, using the past data as comparison and also analyzing it separately for character movement and interactions with other characters.
I am very sorry if this is vague, unclear, or just a stupid question. I am just starting out.
Welcome to StackOverflow!
This is a really, really big topic. If you're just getting started, I would recommend this book, which walks you through some of the basics of NLP using Python's nltk library. (If you're already experienced with Python and just not NLP, some parts of the book will be a bit elementary.) I've used this book in teaching university-level courses and had a good experience with it.
Once you've got the fundamentals under your belt, it sounds like you basically have a text classification (or possibly clustering) problem. There are a lot of good tutorials out there on this topic, including many that use Python libraries, such as scikit-learn. For more efficient Googling, other topics you'll want to explore are "bag of words" (analysis that ignores sentence structure, most likely the approach you'll start with) and "named-entity recognition" (if you want to identify characters, locations, etc.).
For future questions, the best way to get helpful answers on SO is to post specific examples of code that you're struggling with - this is a good resource on how to do that. Many users will avoid open-ended questions but jump all over puzzles with a clear, specific problem to solve.
Happy learning!
Related
I'm a complete beginner when it comes to NLP. Just looking for someone to point me in the right direction.
I have documents that contains lots of multiple-choice questions, the choices, and their answers (like the picture below).
I would like to build a program that is able to get each question, its choices, and the answer. The problem is that not every document follows the exact same format/spacing, so I want to build an all-encompassing program that is able to account for the various formats. Is there anything within NLTK, scikit-learn, or TensorFlow that can help me do this?
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I am working on a project wherein I have to extract the following information from a set of articles (the articles could be on anything):
People Find the names of any people present, like "Barack Obama"
Topic or related tags of the article, like "Parliament", "World Energy"
Company/Organisation I should be able to obtain the names of the any companies or organisations mentioned, like "Apple" or "Google"
Is there an NLP framework/library of this sort available in Python which would help me accomplish this task?
#sel and #3kt really good answers. OP you are looking for Entity Extraction, commonly referred to as Named entity recognition. There exist many APIs to perform this. But the first question you need to ask yourself is
What is the structure of my DATA? or rather,
Are my sentences good English sentences?
In the sense of figuring out whether the data you are working with is consistently grammatically correct, well capitalized and is well structured. These factors are paramount when it comes to extracting entities. The data I worked with were tweets. ABSOLUTE NIGHTMARE!! I performed a detailed analysis on the performance of various APIs on entity extraction and I shall share with you what I found.
Here's APIs that perform fabulous entity extraction-
NLTK has a handy reference book which talks in-depth about its functions with multiple examples. NLTK does not perform well on noisy data(tweets) because it has been trained on structured data.NLTK is absolute garbage for badly capitalized words(Eg, DUCK, Verb, CHAIR). Moreover, it is slightly less precise when compared to other APIs. It is great for structured data or curated data from News articles and Scholarly reports. It is a great learning tool for beginners.
Alchemy is simpler to implement and performs very well in categorizing the named entities.It has great precision when compared to the APIs I have mentioned.However, it has a certain transaction cost. You can only perform 1000 queries in a day! It identifies twitter-handles and can handle awkward capitalization.
IMHO sPacy API is probably the best. It's open source. It outperforms the Alchemy API but is not as precise. Categorizes entities almost as well Alchemy.
Choosing which API should be a simple problem for you now that you know how each API is likely to behave according to the data you have.
EXTRA -
POLYGLOT is yet another API.
Here is a blog post that performs entity extraction in NLTK.
There is a beautiful paper by Alan Ritter that might go over your head. But it is the standard for entity extraction(particularly in noisy data) at a professional level. You could refer to it every now and then to understand complex concepts like LDA or SVM for capitalisation.
What you are actually looking for is called in literature 'Named entity Recognition' or NER.
You might like to take a look at this tutorial:
http://textminingonline.com/how-to-use-stanford-named-entity-recognizer-ner-in-python-nltk-and-other-programming-languages
One easy way of solving this problem partially this problem is using regular expressions to extract words having the patterns that you can find in this paper to extract peoples names. This of course might lead to extracting all the categories you are looking for i.e. the topics and the campanies names as well.
There is also an API that you can use, that actually gives the same results you are looking for, which is called Alchemy. Unfortunatelly no documentation is available to explain the method they use to extract the topics nor the people's names.
Hope this helps.
You should take a look at NLTK.
Finding names and companies can be achieved by tagging the recovered text, and extracting proper nouns (tagged NNP). Finding the topic is a bit more tricky, and may require some machine learning on a given set of article.
Also, since we're talking about articles, I recommend the newspaper module, that can recover those from their URLs, and do some basic nlp operations (summary, keywords).
I am wondering if there is a way to use NLP (specifically the nltk module in python) to find similarities between the subjects within sentences. The problem is that the texts refer back to subjects within a separate sentence, and don't specifically refer to them by name (E.g. www.legaltips.org/Alabama/alabama_code/2-2-30.aspx). Any ideas or experience with this would be super helpful.
The short answer to your question is yes. :)
It sounds like the problem you are trying to solve is what we call anaphora or co-reference resolution in NLP - although that only refers to tracking the same referent through different sentences. You can try getting started here: http://nlp.stanford.edu/software/dcoref.shtml
If you want to find simply similarities then this is a different problem entirely - you should let people know what kind of similarities you are talking about - semantic, syntatic, etc... and then you can get an answer (if that is your problem).
I want to make something like that, but much more simple. I am using web2py and python.
For example:
I make two profiles for two individuals. For each one I want them to be able to choose 10 movies they like. After that, it compares the movies and output the percentage of similarity.
Again, something like how OkCupid makes you answer questions, and then compares you answers to someone else and give you how much your answers matches theirs.
I am a beginner and need help. If not at least what to look into and study to learn. But a bast generic example would be great.
The best introduction to this area that I have seen is Programming Collective Intelligence by Toby Segaran. There is a section in the introductory chapter on 'Finding similar users' that does exactly what you are intending, and it is even in Python.
If you want to play with a recommendation system (a user likes these ten movies, which other movies might she also like?), then you may want to try the python-recsys library which has an example using the movielens dataset.
Finally, welcome to SO and a recommendation for best participation in this community: try to come back with specific code-related questions, something along the lines of: "I have such-and-such a coding problem. This is what I have tried, can you help me move it further?"
I am looking for references (tutorials, books, academic literature) concerning structuring unstructured text in a manner similar to the google calendar quick add button.
I understand this may come under the NLP category, but I am interested only in the process of going from something like "Levi jeans size 32 A0b293"
to: Brand: Levi, Size: 32, Category: Jeans, code: A0b293
I imagine it would be some combination of lexical parsing and machine learning techniques.
I am rather language agnostic but if pushed would prefer python, Matlab or C++ references
Thanks
You need to provide more information about the source of the text (the web? user input?), the domain (is it just clothes?), the potential formatting and vocabulary...
Assuming worst case scenario you need to start learning NLP. A very good free book is the documentation of NLTK: http://www.nltk.org/book . It is also a very good introduction to Python and the SW is free (for various usages). Be warned: NLP is hard. It doesn't always work. It is not fun at times. The state of the art is no where near where you imagine it is.
Assuming a better scenario (your text is semi-structured) - a good free tool is pyparsing. There is a book, plenty of examples and the resulting code is extremely attractive.
I hope this helps...
Possibly look at "Collective Intelligence" by Toby Segaran. I seem to remember that addressing the basics of this in one chapter.
After some researching I have found that this problem is commonly referred to as Information Extraction and have amassed a few papers and stored them in a Mendeley Collection
http://www.mendeley.com/research-papers/collections/3237331/Information-Extraction/
Also as Tai Weiss noted NLTK for python is a good starting point and this chapter of the book, looks specifically at information extraction
If you are only working for cases like the example you cited, you are better off using some manual rule-based that is 100% predictable and covers 90% of the cases it might encounter production..
You could enumerable lists of all possible brands and categories and detect which is which in an input string cos there's usually very little intersection in these two lists..
The other two could easily be detected and extracted using regular expressions. (1-3 digit numbers are always sizes, etc)
Your problem domain doesn't seem big enough to warrant a more heavy duty approach such as statistical learning.