Using NLP to link the subjects in sentences - python

I am wondering if there is a way to use NLP (specifically the nltk module in python) to find similarities between the subjects within sentences. The problem is that the texts refer back to subjects within a separate sentence, and don't specifically refer to them by name (E.g. www.legaltips.org/Alabama/alabama_code/2-2-30.aspx). Any ideas or experience with this would be super helpful.

The short answer to your question is yes. :)
It sounds like the problem you are trying to solve is what we call anaphora or co-reference resolution in NLP - although that only refers to tracking the same referent through different sentences. You can try getting started here: http://nlp.stanford.edu/software/dcoref.shtml
If you want to find simply similarities then this is a different problem entirely - you should let people know what kind of similarities you are talking about - semantic, syntatic, etc... and then you can get an answer (if that is your problem).

Related

Import Novels/Non-Fiction from txt Files

I study literature and am trying to work out how I would go about importing a series of novels from .txt or other formats into python to play around with different word frequencies, similarities, etc. I hope to try and establish some quantitative ways to define a genre beyond just subject matter.
I particularly want to see if certain word strings, concepts, and locations occur in each of these novels. Something like this: (http://web.uvic.ca/~mvp1922/modmac/). I would then like to focus in on one novel, using the past data as comparison and also analyzing it separately for character movement and interactions with other characters.
I am very sorry if this is vague, unclear, or just a stupid question. I am just starting out.
Welcome to StackOverflow!
This is a really, really big topic. If you're just getting started, I would recommend this book, which walks you through some of the basics of NLP using Python's nltk library. (If you're already experienced with Python and just not NLP, some parts of the book will be a bit elementary.) I've used this book in teaching university-level courses and had a good experience with it.
Once you've got the fundamentals under your belt, it sounds like you basically have a text classification (or possibly clustering) problem. There are a lot of good tutorials out there on this topic, including many that use Python libraries, such as scikit-learn. For more efficient Googling, other topics you'll want to explore are "bag of words" (analysis that ignores sentence structure, most likely the approach you'll start with) and "named-entity recognition" (if you want to identify characters, locations, etc.).
For future questions, the best way to get helpful answers on SO is to post specific examples of code that you're struggling with - this is a good resource on how to do that. Many users will avoid open-ended questions but jump all over puzzles with a clear, specific problem to solve.
Happy learning!

String similarity (semantic meaning) in python

How can I calculate the string similarity (semantic meaning) between 2 string?
For example if I have 2 string like "Display" and "Screen" the string similarity must be close to 100%
If I have "Display" and "Color" the screen similarity must be close to 0%
I'm writing my script in Python... My question is if exists some library or framework to do this kind or think... In alternative can someone suggest me a good approach?
Based on your examples, I think you are looking for semantical similarity. You can do this for instance by using WordNet, but you will have to add for instance that you are working with nouns and possible iterate over the different meanings of the word. The link shows two examples that calculate the similarity according to various implementations.
Most implementations are however computationally expensive: they make use of a large amount of text to calculate how often two words are close to each other, etc.
What you're looking to solve is an NLP problem; which, if you're not familiar with, can be a hassle. The most popular library out there is NTLK, which has a lot of AI tools. A quick google of what you're looking for yields logic of semantics: http://www.nltk.org/book/ch10.html
This is a computationally heavy process, since it involves loading a dictionary of the entire English language. If you have a small subset of examples, you might be better off creating a mapping yourself.
I am not good at in NPL, but I think Levenshtein Distance Algorithm can help you solve this problem.Becuase I use this algorithm to calculate the similarity between to strings. And the preformance is not bad.
The following are my CPP code, click the link, maybe you can transform the code to Python.I will post the Python code later.
If you understance Dynamic Programming, I think you can understande it.
enter link description here
Have a look in following libraries:
https://simlibrary.wordpress.com/
http://sourceforge.net/projects/semantics/
http://radimrehurek.com/gensim/
Check out word2vec as implemented in the Gensim library. One of its features is to compute word similarity.
https://radimrehurek.com/gensim/models/word2vec.html
More details and demos can be found here.
I believe this is the state of the art right now.
As another user suggested, the Gensim library can do this using the word2vec technique. The below example is modified from this blog post. You can run it in Google Colab.
Google Colab comes with the Gensim package installed. We can import the part of it we require:
from gensim.models import KeyedVectors
We will download training data from Google News, and load it up
!wget -P /root/input/ -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
word_vectors = KeyedVectors.load_word2vec_format('/root/input/GoogleNews-vectors-negative300.bin.gz', binary=True)
This gives us a measure of similarity between any two words. From your examples:
word_vectors.similarity('display', 'color')
>>> 0.3068566
word_vectors.similarity('display', 'screen')
>>> 0.32314363
Compare those resulting numbers and you will see the words display and screen are more similar than display and color are.

Generate Random Sentence From Grammar or Ngrams?

I am writing a program that should spit out a random sentence of a complexity of my choosing. As a concrete example, I would like to aid my language learning by spitting out valid sentences of a grammar structure and using words that I have already learned. I would like to use python and nltk to do this, although I am open to other ideas.
It seems like there are a couple of approaches:
Define a grammar file that uses the grammar and lexicon I know about, and then generate all valid sentences from this list, then selecting a random answer.
Load in corpora to train ngrams, which then can be used to construct a sentence.
Am I thinking about this correctly? Is one approach preferred over the other? Any tips are appreciated. Thanks!
If I'm getting it right and if the purpose is to test yourself on the vocabulary you already have learned, then another approach could be taken:
Instead of going through the difficult labor of NLG (Natural Language Generation), you could create a search program that goes online, reads news feeds or even simply Wikipedia, and finds sentences with only the words you have defined.
In any case, for what you want, you will have to create lists of words that you have learned. You could then create search algorithms for sentences that contain only / nearly only these words.
That would have the major advantage of testing yourself on real sentences, as opposed to artificially-constructed ones (which are likely to sound not quite right in a number of cases).
An app like this would actually be a great help for learning a foreign language. If you did it nicely I'm sure a lot of people would benefit from using it.
If your purpose is really to make a language learning aid, you need to generate grammatical (i.e., correct) sentences. If so, do not use ngrams. They stick together words at random, and you just get intriguingly natural-looking nonsense.
You could use a grammar in principle, but it will have to be a very good and probably very large grammar.
Another option you haven't considered is to use a template method. Get yourself a bunch of sentences, identify some word classes you are interested in, and generate variants by fitting, e.g., different nouns as the subject or object. This method is much more likely to give you usable results in a finite amount of time. There's any number of well-known bots that work on this principle, and it's also pretty much what language-teaching books do.

Guess tags of a paragraph programmatically using python

I've trying to read about NLP in general and nltk in specific to use with python. I don't know for sure if what am looking for exists out there, or if I perhaps need to develop it.
I have a program that collect text from different files, the text is extremely random and talks about different things. Each file contains a paragraph or 3 maximum, my program opens the files and store them into a table.
My question is, can i guess tags of what the paragraph is about? if anyone knows of an existing technology or approach, I would really appreciate it.
Thanks,
Your task is called "document classification", and the nltk book has a whole chapter on it. I'd start with that.
It all depends on your criteria for assigning tags. Are you interested in matching your documents against a pre-existing set of tags, or perhaps in topic extraction (select the N most important words or phrases in the text)?
You should train a classifier, the easiest one to develop (and you don't really need to develop it as NLTK provides one) is the naive baesian. The problem is that you'll need to classify manually a corpus of observations and then have the program guess what tag best fits a given paragraph (needless to say that the bigger the training corpus the more precise will be your classifier, IMHO you can reach a 80-85% of correctness). Take a look at the docs.

How do I approach this named-entity classification task?

I am asking a related question here but this question is more general. I have taken a large corpora and annotated some words with their named-entities. In my case, they are domain-specific and I call them: Entity, Action, Incident. I want to use these as a seed for extracting more named-entities. For example, following is one sentence:
When the robot had a technical glitch, the object was thrown but was later caught by another robot.
is tagged as:
When the (robot)/Entity had a (technical glitch)/Incident, the
(object)/Entity was (thrown)/Action but was later (caught)/Action by
(another robot)/Entity.
Given examples like this, is there anyway I can train a classifier to recognize new named-entities? For instance, given a sentence like this:
The nanobot had a bug and so it crashed into the wall.
should be tagged somewhat like this:
The (nanobot)/Entity had a (bug)/Incident and so it (crashed)/Action into the (wall)/Entity.
Of course, I am aware that 100% accuracy is not possible but I would be interested in knowing any formal approaches to do this. Any suggestions?
This is not named-entity recognition at all, since none of the labeled parts are names, so the feature sets for NER systems won't help you (English NER systems tend to rely on capitalization quite strongly and will prefer nouns). This is a kind of information extraction/semantic interpretation. I suspect this is going to be quite hard in a machine learning setting because your annotation is really inconsistent:
When the (robot)/Entity had a (technical glitch)/Incident, the (object)/Entity was (thrown)/Action but was later (caught)/Action by another robot.
Why is "another robot" not annotated?
If you want to solve this kind of problem, you'd better start out with some regular expressions, maybe matched against POS-tagged versions of the string.
I can think of 2 approaches.
First is pattern matching over words in sentence. Something like this (pseudocode, though it is similar to NLTK chunk parser syntax):
<some_word>+ (<NN|NNS>) <have|has|had> (<NN|NNS>)
<NN|NNS> (<VB>|was <VB>) (<and|but> (<VB>|was <VB>))* <into|onto|by> (<NN|NNS>)
These 2 patterns can (roughly) catch 2 parts of your first sentence. This is a good choice if you have not very much kinds of sentences. I believe it is possible to get up to 90% accuracy with well-chosen patterns. Drawback is that this model is hard to extend/modify.
Another approach is to mine dependencies between words in sentence, for example, with Stanford Dependency Parser. Among other things, it allows to mine object, subject and predicate, that seems very similar to what you want: in your first sentence "robot" is subject, "had" is predicate and "glitch" is object.
You could try object role modeling at http://www.ormfoundation.com/ which looks at the semantics(facts) between one or more entities or names and their relationships with other objects. There are also tools to convert the orm models into xml and other languages and vice versa. See http://orm.sourceforge.net/

Categories