I am new to NLTK. I would like to create the negative of a sentence (which will usually be in the present tense). For example, is there a function to allow me to convert:
'I run' to 'I do not run'
or
'She runs' to 'She does not run'.
I suppose I could use POS to detect the verb and its preceding pronoun but I just wondered if there was a simpler built in function
No there is not. What is more important it is quite a complex problem, which can be a topic of research, and not something that "simple built in function" could solve. Such operation requires semantic analysis of the sentence, think about for example "I think that I could run faster" which of the 3 verbs should be negated? We know that "think", but for the algorithm they are just the same. Even the case of detection whether you should use "do" or "does" is not so easy. Consider "Mary and Jane walked down the road" and "Jane walked down the road", without parse tree you won't be able to distinguish the singular/plural problem. To sum up, there is no, and cannot be any simple solution. You can design any kind of heuristic you want (one of such is proposed POS-based negation) and if it fails, start a research in this area.
You should use a parser to find the head (verb) of the predicate of the sentence.
In case you assume that the original sentence is grammatically correct you can overcome the agreement issue (don't vs. doesn't) by relying on the properties of the original head-verb.
If it's an auxiliary1, replace it with its negative counterpart (was > wasn't, will > won't, have > haven't, etc.). If it's not an auxiliary, add the correct negative form of supportive-do: didn't if the head-verb is in the past form (i.e., walked), don't if it is in the non-3rd-person-singular present form (i.e., think), and doesn't if in the 3rd-person-singular present form (i.e., runs). Immediately following the supportive-do use the base form of the original head-verb (walk, think, run).
A harder issue to solve is what ShaiCohen is discussing in his answer. Notice that you don't always have to replace these items. There are many cases where you shouldn't. For example: I am the one who saw someone at the office > I'm not the one who saw someone at the office.
Have a look at the Contextors API.
1 Be careful of lexical verbs which look like auxiliaries. She has a dog...
In addition to the challenges discussed in the previous answer, there is the challenge posed by negative polarity-items, lexical items that require a preceding non-affirmative element. Consider the following sentences:
a. I didn’t see anyone at the office
b. * I saw anyone at the office
c. I saw someone at the office
The positive form of (a) is not (b) but (c), where anyone is replaced by someone.
Negative polarity items also present a challenge in the context of paraphrasing tasks like changing the voice of a sentence from active to passive and vice versa. You can read more about this topic in the post: Voice Alternation and Negative Polarity Items.
Related
I'm currently working on a neural network that evaluates students' answers to exam questions. Therefore, preprocessing the corpora for a Word2Vec network is needed. Hyphenation in german texts is quite common. There are mainly two different types of hyphenation:
1) End of line:
The text reaches the end of the line so the last word is sepa-
rated.
2) Short form of enumeration:
in case of two "elements":
Geistes- und Sozialwissenschaften
more "elements":
Wirtschafts-, Geistes- und Sozialwissenschaften
The de-hyphenated form of these enumerations should be:
Geisteswissenschaften und Sozialwissenschaften
Wirtschaftswissenschaften, Geisteswissenschaften und Sozialwissenschaften
I need to remove all hyphenations and put the words back together. I already found several solutions for the first problem.
But I have absoluteley no clue how to get the second part (in the example above "wissenschaften") of the words in the enumeration problem. I don't even know if it is possible at all.
I hope that I have pointet out my problem properly.
So has anyone an idea how to solve this problem?
Thank you very much in advance!
It's surely possible, as the pattern seems fairly regular. (Something vaguely analogous is sometimes seen in English. For example: The new requirements applied to under-, over-, and average-performing employees.)
The rule seems to be roughly, "when you see word-fragments with a trailing hyphen, and then an und, look for known words that begin with the word-fragments, and end the same as the terminal-word-after-und – and replace the word-fragments with the longer words".
Not being a German speaker and without language-specific knowledge, it wouldn't be possible to know exactly where breaks are appropriate. That is, in your Geistes- und Sozialwissenschaften example, without language-specific knowledge, it's unclear whether the first fragment should become Geisteszialwissenschaften or Geisteswissenschaften or Geistesenschaften or Geiestesaften or any other shared-suffix with Sozialwissenschaften. But if you've got a dictionary of word-fragments, or word-frequency info from other text that uses the same full-length word(s) without this particular enumeration-hyphenation, that could help choose.
(If there's more than one plausible suffix based on known words, this might even be a possible application of word2vec: the best suffix to choose might well be the one that creates a known-word that is closest to the terminal-word in word-vector-space.)
Since this seems a very German-specific issue, I'd try asking in forums specific to German natural-language-processing, or to libraries with specific German support. (Maybe, NLTK or Spacy?)
But also, knowing word2vec, this sort of patch-up may not actually be that important to your end-goals. Training without this logical-reassembly of the intended full words may still let the fragments achieve useful vectors, and the corresponding full words may achieve useful vectors from other usages. The fragments may wind up close enough to the full compound words that they're "good enough" for whatever your next regression/classifier step does. So if this seems a blocker, don't be afraid to just try ignoring it as a non-problem. (Then if you later find an adequate de-hyphenation approach, you can test whether it really helped or not.)
I'm writing a program that corrects 'a/an' vs 'the' article usage . I've been able to detect case of plurality ( article is always 'the' when the corresponding noun is plural ) .
I'm stumped on how to solve this issue for singular nouns. Without context, both "an apple" and the "apple" are correct. How would I approach such cases ?
I don't think this is something you will be able to get 100% accuracy on, but it seems to me that one of the most important cues is previous mention. If no apple has been mentioned before, then it is a little odd to say 'the apple'.
A very cheap (and less accurate) approach is to literally check for a token 'apple' in the preceding context and use that as a feature, possibly in conjunction with many other features, such as:
position in text (definiteness becomes likelier as the text progresses)
grammatical function via a dependency parse (grammatical subjects more likely to be definite)
phrase length (definite mentions are typically shorter, fewer adjectives)
etc. etc.
A better but more complex approach would be to insert "the" and then use a coreference resolution component to attempt to find a previous mention. Although automatic coreference resolution is not perfect, it is the best way to determine if there is a previous mention using NLP, and most systems will also attempt to resolve non-trivial cases, such as "John has Malaria ... the disease", which a simple string lookup will miss, as well as distiguishing non-co-referring mentions: a red apple ... != a green apple.
Finally, there is a large amount of nouns which can appear with an article despite not being mentioned previously, including names ("the Olympic Games"), generics ("the common ant"), contextually inferable words ("pass the salt") and uniquely identifiable ("the sun"). All of these could be learned from a training corpus, but that would probably require a separate classifier.
Hope this helps!
I am working on a school project and have a function that recognized a comment and finds the information from the comment and writes it down to a file. When how could I check an input string against a list of strings of information. Like if I have an input
input = "How many fingers do I have?"
How do I check which of these is closest to it?
fingers = "You have 10."
pigs = "yummy"
I want it to respond with fingers. I want to match it with the variable name and not the variable's value.
I suggest you to read this chapter.
This is a chapter from Natural Language Processing with Python, by Steven Bird, Ewan Klein and Edward Loper.
Detecting patterns is a central part of Natural Language Processing.
Words ending in -ed tend to be past tense verbs (5). Frequent use of
will is indicative of news text (3). These observable patterns — word
structure and word frequency — happen to correlate with particular
aspects of meaning, such as tense and topic. But how did we know where
to start looking, which aspects of form to associate with which
aspects of meaning?
The goal of this chapter is to answer the following questions:
How can we identify particular features of language data that are
salient for classifying it? How can we construct models of language
that can be used to perform language processing tasks automatically?
What can we learn about language from these models?
It all described in python, and it's very efficient.
http://www.nltk.org/book/ch06.html
Also, processing the text by using a keyword that matches a variable name is not good and not efficient. I won't recommend it.
Python provides the NLTK library which is a vast resource of text and corpus, along with a slew of text mining and processing methods. Is there any way we can compare sentences based on the meaning they convey for a possible match? That is, an intelligent sentence matcher?
For example, a sentence like giggling at bad jokes and I like to laugh myself silly at poor jokes. Both convey the same meaning, but the sentences don't remotely match (words are different, Levenstein Distance would fail badly!).
Now imagine we have an API which exposes functionality such as found here. So based on that, we have mechanisms to find out that the word giggle and laugh do match in the meaning they convey. Bad won't match up to poor, so we may need to add further layers (like they match in the context of words like joke, since bad joke is generally same as poor joke, although bad person is not same as poor person!).
A major challenge would be to discard stuff that don't much alter the meaning of the sentence. So, the algorithm should return the same degree of matchness between the the first sentence and this: I like to laugh myself silly at poor jokes, even though they are completely senseless, full of crap and serious chances of heart-attack!
So with that available, is there any algorithm like this that has been conceived yet? Or do I have to invent the wheel?
You will need a more advanced topic modeling algorithm, and of course some corpora to train your model, so that you can easily handle synonyms like giggle and laugh !
In python, you can try this package : http://radimrehurek.com/gensim/
I never used it but it includes classic semantic vector spaces methods like lsa/lsi, random projection and even lda.
My personal favourite is random projection, because it is faster and still very efficient (I'm doing it in java with another library though).
Background (TLDR; provided for the sake of completion)
Seeking advice on an optimal solution to an odd requirement.
I'm a (literature) student in my fourth year of college with only my own guidance in programming. I'm competent enough with Python that I won't have trouble implementing solutions I find (most of the time) and developing upon them, but because of my newbness, I'm seeking advice on the best ways I might tackle this peculiar problem.
Already using NLTK, but differently from the examples in the NLTK book. I'm already utilizing a lot of stuff from NLTK, particularly WordNet, so that material is not foreign to me. I've read most of the NLTK book.
I'm working with fragmentary, atomic language. Users input words and sentence fragments, and WordNet is used to find connections between the inputs, and generate new words and sentences/fragments. My question is about turning an uninflected word from WordNet (a synset) into something that makes sense contextually.
The problem: How to inflect the result in a grammatically sensible way? Without any kind of grammatical processing, the results are just a list of dictionary-searchable words, without agreement between words. First step is for my application to stem/pluralize/conjugate/inflect root-words according to context. (The "root words" I'm speaking of are synsets from WordNet and/or their human-readable equivalents.)
Example scenario
Let's assume we have a chunk of a poem, to which users are adding new inputs to. The new results need to be inflected in a grammatically sensible way.
The river bears no empty bottles, sandwich papers,
Silk handkerchiefs, cardboard boxes, cigarette ends
Or other testimony of summer nights. The sprites
Let's say now, it needs to print 1 of 4 possible next words/synsets: ['departure', 'to have', 'blue', 'quick']. It seems to me that 'blue' should be discarded; 'The sprites blue' seems grammatically odd/unlikely. From there it could use either of these verbs.
If it picks 'to have' the result could be sensibly inflected as 'had', 'have', 'having', 'will have', 'would have', etc. (but not 'has'). (The resulting line would be something like 'The sprites have' and the sensibly-inflected result will provide better context for future results ...)
I'd like for 'depature' to be a valid possibility in this case; while 'The sprites departure' doesn't make sense (it's not "sprites'"), 'The sprites departed' (or other verb conjugations) would.
Seemingly 'The sprites quick' wouldn't make sense, but something like 'The sprites quickly [...]' or 'The sprites quicken' could, so 'quick' is also a possibility for sensible inflection.
Breaking down the tasks
Tag part of speech, plurality, tense, etc. -- of original inputs. Taking note of this could help to select from the several possibilities (i.e. choosing between had/have/having could be more directed than random if a user had inputted 'having' rather than some other tense). I've heard the Stanford POS tagger is good, which has an implementation in NLTK. I am not sure how to handle tense detection here.
Consider context in order to rule out grammatically peculiar possibilities. Consider the last couple words and their part-of-speech tags (and tense?), as well as sentence boundaries if any, and from that, drop things that wouldn't make sense. After 'The sprites' we don't want another article (or determiner, as far as I can tell), nor an adjective, but an adverb or verb could work. Comparison of the current stuff with sequences in tagged corpora (and/or Markov chains?) -- or consultation of grammar-checking functions -- could provide a solution for this.
Select a word from the remaining possibilities (those that could be inflected sensibly). This isn't something I need an answer for -- I've got my methods for this. Let's say it's randomly selected.
Transform the selected word as needed. If the information from #1 can be folded in (for example, perhaps the "pluralize" flag was set to True), do so. If there are several possibilities (e.g. picked word is a verb, but a few tenses are possible) select, randomly. Regardless I'm going to need to morph/inflect the word.
I'm looking for advice on the soundness of this routine, as well as suggestions for steps to add. Ways to break down these steps further would also be helpful. Finally I'm looking for suggestions on what tool might best accomplish each task.
I think that the comment above on n-gram language model fits your requirements better than parsing and tagging. Parsers and taggers (unless modified) will suffer from the lack of right context of the target word (i.e., you don't have the rest of the sentence available at time of query). On the other hand, language models consider the past (left context) efficiently, especially for windows up to 5 words. The problem with n-grams is that they don't model long distance dependencies (more than n words).
NLTK has a language model: http://nltk.googlecode.com/svn/trunk/doc/api/nltk.model.ngram-pysrc.html . A tag lexicon may help you smooth the model more.
The steps as I see them: 1. Get a set of words from the users. 2. Create a larger set of all possible inflections of the words. 3. Ask the model which inflected word is most probable.