Python provides the NLTK library which is a vast resource of text and corpus, along with a slew of text mining and processing methods. Is there any way we can compare sentences based on the meaning they convey for a possible match? That is, an intelligent sentence matcher?
For example, a sentence like giggling at bad jokes and I like to laugh myself silly at poor jokes. Both convey the same meaning, but the sentences don't remotely match (words are different, Levenstein Distance would fail badly!).
Now imagine we have an API which exposes functionality such as found here. So based on that, we have mechanisms to find out that the word giggle and laugh do match in the meaning they convey. Bad won't match up to poor, so we may need to add further layers (like they match in the context of words like joke, since bad joke is generally same as poor joke, although bad person is not same as poor person!).
A major challenge would be to discard stuff that don't much alter the meaning of the sentence. So, the algorithm should return the same degree of matchness between the the first sentence and this: I like to laugh myself silly at poor jokes, even though they are completely senseless, full of crap and serious chances of heart-attack!
So with that available, is there any algorithm like this that has been conceived yet? Or do I have to invent the wheel?
You will need a more advanced topic modeling algorithm, and of course some corpora to train your model, so that you can easily handle synonyms like giggle and laugh !
In python, you can try this package : http://radimrehurek.com/gensim/
I never used it but it includes classic semantic vector spaces methods like lsa/lsi, random projection and even lda.
My personal favourite is random projection, because it is faster and still very efficient (I'm doing it in java with another library though).
Related
I would like to know if it is possible to group together same words included in the LDA's output, i.e. words generated by
doc_lda = lda_model[corpus]
for example
[(0,
'0.084*"tourism" + 0.013*"touristic" + 0.013*"Madrid" + '
'0.010*"travel" + 0.008*"half" + 0.007*"piare" + '
'0.007*"turism"')]
I would like to group tourism, touristic and turism (mispelled) together.
Would it be possible?
This is some relevant previous code:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=num_topics,
random_state=100,
update_every=1,
chunksize=100,
passes=10,
alpha=[0.01]*num_topics,
per_word_topics=True,
eta=[0.01]*len(id2word.keys()))
Thank you
The key thing to understand is LDA requires a great deal of tuning and iteration to work properly unlike say linear regression. But it can be useful for a certain set of problems.
Your intuition is right in that 'tourism', 'touristic' and 'turism' should all be one word. The fix, however, is not at the end where you are presented with their respective loadings but rather, early on with stemming and lemmatization (aka, stemming and lemming), adding unwanted words to your stopwords list, and some preprocessing to some degree or another. I'll address them separately but not as a group as I think that's fairly obvious. Also, because you only gave the one set of words and loadings, it's not really fruitful to go into providing the number of topics as you may be doing that just fine.
Stemming/Lemming (Pick One)
This is where the science and experience part starts, as well as the frustration. But, this is where you'll make the biggest and easiest gains. It seems like 'tourism' and 'touristic' might be best combined by stemming (as tour). The truth is a lot less clear cut as there are cases where one beats the other. In the below example, PortaStemer suffers from making sensible stems but lemmatizing fails to catch how 'studies' and studying are the same though it accurately catches 'cry'.
Using PorterStemer
studies is studi
studying is studi
cries is cri
cry is cri
Lemmatize
studies is study
studying is studying
cries is cry
cry is cry
There are multiple stemmers such as Porter2, Snowball, Hunspell, and Paice-Husk. So, the obvious first step would be to see if any of these is more useful out of the box.
As mentioned above, lemmatization will get you a similar -- but somewhat different -- set of results.
There is no substitute for the work here. This is what separates a data scientist from a hobbiest or data analyst with plussed up title. The best time to do this was in the past so you would have an intuition of what works best for this sort of corpus; the second-best time is now.
Iterate But Satisfice
I presume you don't have infinite resources; you have to satisfice. For the above, you might consider preprocessing your text to correct or remove misspelled words. What to do with non-English words is trickier. The easiest solution is to remove them or add them to your stopwords list but that may not be the best solution. Customizing your dictionary is an option too.
Know The Current Limits
As of 2020, no one is doing a good job with codeswitching; certainly not a free and opensource resource. Gridspace is about the best I know of, and while their demo is pretty amazing, they can't handle codeswitching well. Now, I'm doing some induction here because I'm assuming 'piare' is Spanish for 'I will', at least that what google translate says. If that's the case, your results will be confounded. But when you look at the loading (.007) that seems to be more work than would be worth it.
For my project at work I am tasked with going through a bunch of user generated text, and in some of that text are reasons for cancelling their internet service, as well as how often that reason is occurring. It could be they are moving, just don't like it, or bad service, etc.
While this may not necessarily be a Python question, I am wondering if there is some way I can use NLTK or Textblob in some way to determine reasons for cancellation. I highly doubt there is anything automated for such a specialized task and I realize that I may have to build a neural net, but any suggestions on how to tackle this problem would be appreciated.
This is what I have thought about so far:
1) Use stemming and tokenization and tally up most frequent words. Easy method, not that accurate.
2) n-grams. Computationally intensive, but may hold some promise.
3) POS tagging and chunking, maybe find words which follow conjunctions such as "because".
4) Go through all text fields manually and keep a note of reasons for cancellation. Not efficient, defeats the whole purpose of finding some algorithm.
5) NN, have absolutely no idea, and I have no idea if it is feasible.
I would really appreciate any advice on this.
Don't worry if this answer is too general or you can't understand
something - this is academic stuff and needs some basic preparations.
Feel free to contact me with questions, if you want (ask for my mail
in comment or smth, we'll figure something out).
I think that this question is more suited for CrossValidated.
Anyway, first thing that you need to do is to create some training set. You need to find as many documents with reasons as you can and annotate them, marking phrases specifying reason. The more documents the better.
If you're gonna work with user reports - use example reports, so that training data and real data will come from the same source.
This is how you'll build some kind of corpus for your processing.
Then you have to specify what features you'll need. This may be POS tag, n-gram feature, lemma/stem, etc. This needs experimentation and some practice. Here I'd use some n-gram features (probably 2-gram or 3-gram) and maybe some knowledge basing on some Wordnet.
Last step is building you chunker or annotator. It is a component that will take your training set, analyse it and learn what should it mark.
You'll meet something called "semantic gap" - this term describes situation when your program "learned" something else than you wanted (it's a simplification). For example, you may use such a set of features, that your chunker will learn finding "I don't " phrases instead of reason phrases. It is really dependent on your training set and feature set.
If that happens, you should try changing your feature set, and after a while - try working on training set, as it may be not representative.
How to build such chunker? For your case I'd use HMM (Hidden Markov Model) or - even better - CRF (Conditional Random Field). These two are statistical methods commonly used for stream annotation, and you text is basically a stream of tokens. Another approach could be using any "standard" classifier (from Naive Bayes, through some decision tress, NN to SVM) and using it on every n-gram in text.
Of course choosing feature set is highly dependent on chosen method, so read some about them and choose wisely.
PS. This is oversimplified answer, missing many important things about training set preparation, choosing features, preprocessing your corpora, finding sources for them, etc. This is not walk-through - these are basic steps that you should explore yourself.
PPS. Not sure, but NLTK may have some CRF or HMM implementation. If not, I can recommend scikit-learn for Markov and CRFP++ for CRF. Look out - the latter is powerful, but is a b*tch to install and to use from Java or python.
==EDIT==
Shortly about features:
First, what kinds of features can we imagine?
lemma/stem - you find stems or lemmas for each word in your corpus, choose the most important (usually those will have the highest frequency, or at least you'll start there) and then represent each word/n-gram as binary vector, stating whether represented word or sequence after stemming/lemmatization contains that feature lemma/stem
n-grams - similiar to above, but instead of single words you choose most important sequences of length n. "n-gram" means "sequence of length n", so e.g. bigrams (2-grams) for "I sat on a bench" will be: "I sat", "sat on", "on a", "a bench".
skipgrams - similiar to n-grams, but contains "gaps" in original sentence. For example, biskipgrams with gap size 3 for "Quick brown fox jumped over something" (sorry, I can't remember this phrase right now :P ) will be: ["Quick", "over"], ["brown", "something"]. In general, n-skipgrams with gap size m are obtained by getting a word, skipping m, getting a word, etc unless you have n words.
POS tags - I've always mistaken them with "positional" tags, but this is acronym for "Part Of Speech". It is useful when you need to find phrases that have common grammatical structure, not common words.
Of course you can combine them - for example use skipgrams of lemmas, or POS tags of lemmas, or even *-grams (choose your favourite :P) of POS-tags of lemmas.
What would be the sense of using POS tag of lemma? That would describe part of speech of basic form of word, so it would simplify your feature to facts like "this is a noun" instead of "this is plural female noun".
Remember that choosing features is one of the most important parts of the whole process (the other is data preparation, but that deserves the whole semester of courses, and feature selection can be handled in 3-4 lectures, so I'm trying to put basics here).
You need some kind of intuition while "hunting" for chunks - for example, if I wanted to find all expressions about colors, I'd probably try using 2- or 3-grams of words, represented as binary vector described whether such n-gram contains most popular color names and modifiers (like "light", "dark", etc) and POS tag. Even if you'd miss some colors (say, "magenta") you could find them in text if your method (I'd go with CRF again, this is wonderful tool for this kind of tasks) generalized learned knowledge enough.
While FilipMalczak's answer states the state-of-the-art method to solve your problem, a simpler solution (or maybe a preliminary first step) would be to do simple document clustering. This, done right, should cluster together responses that contain similar reasons. Also for this, you don't need any training data. The following article would be a good place to start: http://brandonrose.org/clustering
I am writing a program that should spit out a random sentence of a complexity of my choosing. As a concrete example, I would like to aid my language learning by spitting out valid sentences of a grammar structure and using words that I have already learned. I would like to use python and nltk to do this, although I am open to other ideas.
It seems like there are a couple of approaches:
Define a grammar file that uses the grammar and lexicon I know about, and then generate all valid sentences from this list, then selecting a random answer.
Load in corpora to train ngrams, which then can be used to construct a sentence.
Am I thinking about this correctly? Is one approach preferred over the other? Any tips are appreciated. Thanks!
If I'm getting it right and if the purpose is to test yourself on the vocabulary you already have learned, then another approach could be taken:
Instead of going through the difficult labor of NLG (Natural Language Generation), you could create a search program that goes online, reads news feeds or even simply Wikipedia, and finds sentences with only the words you have defined.
In any case, for what you want, you will have to create lists of words that you have learned. You could then create search algorithms for sentences that contain only / nearly only these words.
That would have the major advantage of testing yourself on real sentences, as opposed to artificially-constructed ones (which are likely to sound not quite right in a number of cases).
An app like this would actually be a great help for learning a foreign language. If you did it nicely I'm sure a lot of people would benefit from using it.
If your purpose is really to make a language learning aid, you need to generate grammatical (i.e., correct) sentences. If so, do not use ngrams. They stick together words at random, and you just get intriguingly natural-looking nonsense.
You could use a grammar in principle, but it will have to be a very good and probably very large grammar.
Another option you haven't considered is to use a template method. Get yourself a bunch of sentences, identify some word classes you are interested in, and generate variants by fitting, e.g., different nouns as the subject or object. This method is much more likely to give you usable results in a finite amount of time. There's any number of well-known bots that work on this principle, and it's also pretty much what language-teaching books do.
Background (TLDR; provided for the sake of completion)
Seeking advice on an optimal solution to an odd requirement.
I'm a (literature) student in my fourth year of college with only my own guidance in programming. I'm competent enough with Python that I won't have trouble implementing solutions I find (most of the time) and developing upon them, but because of my newbness, I'm seeking advice on the best ways I might tackle this peculiar problem.
Already using NLTK, but differently from the examples in the NLTK book. I'm already utilizing a lot of stuff from NLTK, particularly WordNet, so that material is not foreign to me. I've read most of the NLTK book.
I'm working with fragmentary, atomic language. Users input words and sentence fragments, and WordNet is used to find connections between the inputs, and generate new words and sentences/fragments. My question is about turning an uninflected word from WordNet (a synset) into something that makes sense contextually.
The problem: How to inflect the result in a grammatically sensible way? Without any kind of grammatical processing, the results are just a list of dictionary-searchable words, without agreement between words. First step is for my application to stem/pluralize/conjugate/inflect root-words according to context. (The "root words" I'm speaking of are synsets from WordNet and/or their human-readable equivalents.)
Example scenario
Let's assume we have a chunk of a poem, to which users are adding new inputs to. The new results need to be inflected in a grammatically sensible way.
The river bears no empty bottles, sandwich papers,
Silk handkerchiefs, cardboard boxes, cigarette ends
Or other testimony of summer nights. The sprites
Let's say now, it needs to print 1 of 4 possible next words/synsets: ['departure', 'to have', 'blue', 'quick']. It seems to me that 'blue' should be discarded; 'The sprites blue' seems grammatically odd/unlikely. From there it could use either of these verbs.
If it picks 'to have' the result could be sensibly inflected as 'had', 'have', 'having', 'will have', 'would have', etc. (but not 'has'). (The resulting line would be something like 'The sprites have' and the sensibly-inflected result will provide better context for future results ...)
I'd like for 'depature' to be a valid possibility in this case; while 'The sprites departure' doesn't make sense (it's not "sprites'"), 'The sprites departed' (or other verb conjugations) would.
Seemingly 'The sprites quick' wouldn't make sense, but something like 'The sprites quickly [...]' or 'The sprites quicken' could, so 'quick' is also a possibility for sensible inflection.
Breaking down the tasks
Tag part of speech, plurality, tense, etc. -- of original inputs. Taking note of this could help to select from the several possibilities (i.e. choosing between had/have/having could be more directed than random if a user had inputted 'having' rather than some other tense). I've heard the Stanford POS tagger is good, which has an implementation in NLTK. I am not sure how to handle tense detection here.
Consider context in order to rule out grammatically peculiar possibilities. Consider the last couple words and their part-of-speech tags (and tense?), as well as sentence boundaries if any, and from that, drop things that wouldn't make sense. After 'The sprites' we don't want another article (or determiner, as far as I can tell), nor an adjective, but an adverb or verb could work. Comparison of the current stuff with sequences in tagged corpora (and/or Markov chains?) -- or consultation of grammar-checking functions -- could provide a solution for this.
Select a word from the remaining possibilities (those that could be inflected sensibly). This isn't something I need an answer for -- I've got my methods for this. Let's say it's randomly selected.
Transform the selected word as needed. If the information from #1 can be folded in (for example, perhaps the "pluralize" flag was set to True), do so. If there are several possibilities (e.g. picked word is a verb, but a few tenses are possible) select, randomly. Regardless I'm going to need to morph/inflect the word.
I'm looking for advice on the soundness of this routine, as well as suggestions for steps to add. Ways to break down these steps further would also be helpful. Finally I'm looking for suggestions on what tool might best accomplish each task.
I think that the comment above on n-gram language model fits your requirements better than parsing and tagging. Parsers and taggers (unless modified) will suffer from the lack of right context of the target word (i.e., you don't have the rest of the sentence available at time of query). On the other hand, language models consider the past (left context) efficiently, especially for windows up to 5 words. The problem with n-grams is that they don't model long distance dependencies (more than n words).
NLTK has a language model: http://nltk.googlecode.com/svn/trunk/doc/api/nltk.model.ngram-pysrc.html . A tag lexicon may help you smooth the model more.
The steps as I see them: 1. Get a set of words from the users. 2. Create a larger set of all possible inflections of the words. 3. Ask the model which inflected word is most probable.
I am asking a related question here but this question is more general. I have taken a large corpora and annotated some words with their named-entities. In my case, they are domain-specific and I call them: Entity, Action, Incident. I want to use these as a seed for extracting more named-entities. For example, following is one sentence:
When the robot had a technical glitch, the object was thrown but was later caught by another robot.
is tagged as:
When the (robot)/Entity had a (technical glitch)/Incident, the
(object)/Entity was (thrown)/Action but was later (caught)/Action by
(another robot)/Entity.
Given examples like this, is there anyway I can train a classifier to recognize new named-entities? For instance, given a sentence like this:
The nanobot had a bug and so it crashed into the wall.
should be tagged somewhat like this:
The (nanobot)/Entity had a (bug)/Incident and so it (crashed)/Action into the (wall)/Entity.
Of course, I am aware that 100% accuracy is not possible but I would be interested in knowing any formal approaches to do this. Any suggestions?
This is not named-entity recognition at all, since none of the labeled parts are names, so the feature sets for NER systems won't help you (English NER systems tend to rely on capitalization quite strongly and will prefer nouns). This is a kind of information extraction/semantic interpretation. I suspect this is going to be quite hard in a machine learning setting because your annotation is really inconsistent:
When the (robot)/Entity had a (technical glitch)/Incident, the (object)/Entity was (thrown)/Action but was later (caught)/Action by another robot.
Why is "another robot" not annotated?
If you want to solve this kind of problem, you'd better start out with some regular expressions, maybe matched against POS-tagged versions of the string.
I can think of 2 approaches.
First is pattern matching over words in sentence. Something like this (pseudocode, though it is similar to NLTK chunk parser syntax):
<some_word>+ (<NN|NNS>) <have|has|had> (<NN|NNS>)
<NN|NNS> (<VB>|was <VB>) (<and|but> (<VB>|was <VB>))* <into|onto|by> (<NN|NNS>)
These 2 patterns can (roughly) catch 2 parts of your first sentence. This is a good choice if you have not very much kinds of sentences. I believe it is possible to get up to 90% accuracy with well-chosen patterns. Drawback is that this model is hard to extend/modify.
Another approach is to mine dependencies between words in sentence, for example, with Stanford Dependency Parser. Among other things, it allows to mine object, subject and predicate, that seems very similar to what you want: in your first sentence "robot" is subject, "had" is predicate and "glitch" is object.
You could try object role modeling at http://www.ormfoundation.com/ which looks at the semantics(facts) between one or more entities or names and their relationships with other objects. There are also tools to convert the orm models into xml and other languages and vice versa. See http://orm.sourceforge.net/