I am asking a related question here but this question is more general. I have taken a large corpora and annotated some words with their named-entities. In my case, they are domain-specific and I call them: Entity, Action, Incident. I want to use these as a seed for extracting more named-entities. For example, following is one sentence:
When the robot had a technical glitch, the object was thrown but was later caught by another robot.
is tagged as:
When the (robot)/Entity had a (technical glitch)/Incident, the
(object)/Entity was (thrown)/Action but was later (caught)/Action by
(another robot)/Entity.
Given examples like this, is there anyway I can train a classifier to recognize new named-entities? For instance, given a sentence like this:
The nanobot had a bug and so it crashed into the wall.
should be tagged somewhat like this:
The (nanobot)/Entity had a (bug)/Incident and so it (crashed)/Action into the (wall)/Entity.
Of course, I am aware that 100% accuracy is not possible but I would be interested in knowing any formal approaches to do this. Any suggestions?
This is not named-entity recognition at all, since none of the labeled parts are names, so the feature sets for NER systems won't help you (English NER systems tend to rely on capitalization quite strongly and will prefer nouns). This is a kind of information extraction/semantic interpretation. I suspect this is going to be quite hard in a machine learning setting because your annotation is really inconsistent:
When the (robot)/Entity had a (technical glitch)/Incident, the (object)/Entity was (thrown)/Action but was later (caught)/Action by another robot.
Why is "another robot" not annotated?
If you want to solve this kind of problem, you'd better start out with some regular expressions, maybe matched against POS-tagged versions of the string.
I can think of 2 approaches.
First is pattern matching over words in sentence. Something like this (pseudocode, though it is similar to NLTK chunk parser syntax):
<some_word>+ (<NN|NNS>) <have|has|had> (<NN|NNS>)
<NN|NNS> (<VB>|was <VB>) (<and|but> (<VB>|was <VB>))* <into|onto|by> (<NN|NNS>)
These 2 patterns can (roughly) catch 2 parts of your first sentence. This is a good choice if you have not very much kinds of sentences. I believe it is possible to get up to 90% accuracy with well-chosen patterns. Drawback is that this model is hard to extend/modify.
Another approach is to mine dependencies between words in sentence, for example, with Stanford Dependency Parser. Among other things, it allows to mine object, subject and predicate, that seems very similar to what you want: in your first sentence "robot" is subject, "had" is predicate and "glitch" is object.
You could try object role modeling at http://www.ormfoundation.com/ which looks at the semantics(facts) between one or more entities or names and their relationships with other objects. There are also tools to convert the orm models into xml and other languages and vice versa. See http://orm.sourceforge.net/
Related
I have ~138k records of user feedback that I'd like to analyze to understand broad patterns in what our users are most often saying. Each one has a rating between 1-5 stars, so I don't need to do any sort of sentiment analysis. I'm mostly interested in splitting the dataset into >=4 stars to see what we're doing well and <= 3 stars to see what we need to improve upon.
One key problem I'm running into is that I expect to see a lot of n-grams. Some of these I know, like "HOV lane", "carpool lane", "detour time", "out of my way", etc. But I also want to detect common bi- and tri-grams programmatically. I've been playing around with Spacy a bit, but it doesn't seem to have any capability to do analysis on the corpus level, only on the document level.
Ideally my pipeline would look something like this (I think):
Import a list of known n-grams into the tokenizer
Process each string into a tokenized document, removing punctuation,
stopwords, etc, while respecting the known n-grams during
tokenization (ie, "HOV lane" should be a single noun token)
Identify the most common bi- and tri- grams in the corpus that I
missed
Re-tokenize using the found n-grams
Split by rating (>=4 and <=3)
Find the most common topics for each split of data in the corpus
I can't seem to find a single tool, or even a collection of tools, that will allow me to do what I want here. Am I approaching this the wrong way somehow? Any pointers on how to get started would be greatly appreciated!
Bingo State of the art results for your problem!
Its called - Zero-Short learning.
State-of-the-art NLP models for text classification without annotated data.
For Code and details read the blog - https://joeddav.github.io/blog/2020/05/29/ZSL.html
Let me know if it works for you or for any other help.
VADER tool is perfect with sentiment analysis and NLP based applications.
I think the proposed workflow is fine with this case study. Closely work with your feature extraction as it matters a lot.
Most of the time tri-grams make a sound sense on these use cases.
Using Spacy would be a better decision as SpaCy's rules-based match engines and components not only help you to find what the terms and sentences are searching for but also allow you to access the tokens inside a text and its relationships compared with regular expressions.
For my project at work I am tasked with going through a bunch of user generated text, and in some of that text are reasons for cancelling their internet service, as well as how often that reason is occurring. It could be they are moving, just don't like it, or bad service, etc.
While this may not necessarily be a Python question, I am wondering if there is some way I can use NLTK or Textblob in some way to determine reasons for cancellation. I highly doubt there is anything automated for such a specialized task and I realize that I may have to build a neural net, but any suggestions on how to tackle this problem would be appreciated.
This is what I have thought about so far:
1) Use stemming and tokenization and tally up most frequent words. Easy method, not that accurate.
2) n-grams. Computationally intensive, but may hold some promise.
3) POS tagging and chunking, maybe find words which follow conjunctions such as "because".
4) Go through all text fields manually and keep a note of reasons for cancellation. Not efficient, defeats the whole purpose of finding some algorithm.
5) NN, have absolutely no idea, and I have no idea if it is feasible.
I would really appreciate any advice on this.
Don't worry if this answer is too general or you can't understand
something - this is academic stuff and needs some basic preparations.
Feel free to contact me with questions, if you want (ask for my mail
in comment or smth, we'll figure something out).
I think that this question is more suited for CrossValidated.
Anyway, first thing that you need to do is to create some training set. You need to find as many documents with reasons as you can and annotate them, marking phrases specifying reason. The more documents the better.
If you're gonna work with user reports - use example reports, so that training data and real data will come from the same source.
This is how you'll build some kind of corpus for your processing.
Then you have to specify what features you'll need. This may be POS tag, n-gram feature, lemma/stem, etc. This needs experimentation and some practice. Here I'd use some n-gram features (probably 2-gram or 3-gram) and maybe some knowledge basing on some Wordnet.
Last step is building you chunker or annotator. It is a component that will take your training set, analyse it and learn what should it mark.
You'll meet something called "semantic gap" - this term describes situation when your program "learned" something else than you wanted (it's a simplification). For example, you may use such a set of features, that your chunker will learn finding "I don't " phrases instead of reason phrases. It is really dependent on your training set and feature set.
If that happens, you should try changing your feature set, and after a while - try working on training set, as it may be not representative.
How to build such chunker? For your case I'd use HMM (Hidden Markov Model) or - even better - CRF (Conditional Random Field). These two are statistical methods commonly used for stream annotation, and you text is basically a stream of tokens. Another approach could be using any "standard" classifier (from Naive Bayes, through some decision tress, NN to SVM) and using it on every n-gram in text.
Of course choosing feature set is highly dependent on chosen method, so read some about them and choose wisely.
PS. This is oversimplified answer, missing many important things about training set preparation, choosing features, preprocessing your corpora, finding sources for them, etc. This is not walk-through - these are basic steps that you should explore yourself.
PPS. Not sure, but NLTK may have some CRF or HMM implementation. If not, I can recommend scikit-learn for Markov and CRFP++ for CRF. Look out - the latter is powerful, but is a b*tch to install and to use from Java or python.
==EDIT==
Shortly about features:
First, what kinds of features can we imagine?
lemma/stem - you find stems or lemmas for each word in your corpus, choose the most important (usually those will have the highest frequency, or at least you'll start there) and then represent each word/n-gram as binary vector, stating whether represented word or sequence after stemming/lemmatization contains that feature lemma/stem
n-grams - similiar to above, but instead of single words you choose most important sequences of length n. "n-gram" means "sequence of length n", so e.g. bigrams (2-grams) for "I sat on a bench" will be: "I sat", "sat on", "on a", "a bench".
skipgrams - similiar to n-grams, but contains "gaps" in original sentence. For example, biskipgrams with gap size 3 for "Quick brown fox jumped over something" (sorry, I can't remember this phrase right now :P ) will be: ["Quick", "over"], ["brown", "something"]. In general, n-skipgrams with gap size m are obtained by getting a word, skipping m, getting a word, etc unless you have n words.
POS tags - I've always mistaken them with "positional" tags, but this is acronym for "Part Of Speech". It is useful when you need to find phrases that have common grammatical structure, not common words.
Of course you can combine them - for example use skipgrams of lemmas, or POS tags of lemmas, or even *-grams (choose your favourite :P) of POS-tags of lemmas.
What would be the sense of using POS tag of lemma? That would describe part of speech of basic form of word, so it would simplify your feature to facts like "this is a noun" instead of "this is plural female noun".
Remember that choosing features is one of the most important parts of the whole process (the other is data preparation, but that deserves the whole semester of courses, and feature selection can be handled in 3-4 lectures, so I'm trying to put basics here).
You need some kind of intuition while "hunting" for chunks - for example, if I wanted to find all expressions about colors, I'd probably try using 2- or 3-grams of words, represented as binary vector described whether such n-gram contains most popular color names and modifiers (like "light", "dark", etc) and POS tag. Even if you'd miss some colors (say, "magenta") you could find them in text if your method (I'd go with CRF again, this is wonderful tool for this kind of tasks) generalized learned knowledge enough.
While FilipMalczak's answer states the state-of-the-art method to solve your problem, a simpler solution (or maybe a preliminary first step) would be to do simple document clustering. This, done right, should cluster together responses that contain similar reasons. Also for this, you don't need any training data. The following article would be a good place to start: http://brandonrose.org/clustering
I am writing a program that should spit out a random sentence of a complexity of my choosing. As a concrete example, I would like to aid my language learning by spitting out valid sentences of a grammar structure and using words that I have already learned. I would like to use python and nltk to do this, although I am open to other ideas.
It seems like there are a couple of approaches:
Define a grammar file that uses the grammar and lexicon I know about, and then generate all valid sentences from this list, then selecting a random answer.
Load in corpora to train ngrams, which then can be used to construct a sentence.
Am I thinking about this correctly? Is one approach preferred over the other? Any tips are appreciated. Thanks!
If I'm getting it right and if the purpose is to test yourself on the vocabulary you already have learned, then another approach could be taken:
Instead of going through the difficult labor of NLG (Natural Language Generation), you could create a search program that goes online, reads news feeds or even simply Wikipedia, and finds sentences with only the words you have defined.
In any case, for what you want, you will have to create lists of words that you have learned. You could then create search algorithms for sentences that contain only / nearly only these words.
That would have the major advantage of testing yourself on real sentences, as opposed to artificially-constructed ones (which are likely to sound not quite right in a number of cases).
An app like this would actually be a great help for learning a foreign language. If you did it nicely I'm sure a lot of people would benefit from using it.
If your purpose is really to make a language learning aid, you need to generate grammatical (i.e., correct) sentences. If so, do not use ngrams. They stick together words at random, and you just get intriguingly natural-looking nonsense.
You could use a grammar in principle, but it will have to be a very good and probably very large grammar.
Another option you haven't considered is to use a template method. Get yourself a bunch of sentences, identify some word classes you are interested in, and generate variants by fitting, e.g., different nouns as the subject or object. This method is much more likely to give you usable results in a finite amount of time. There's any number of well-known bots that work on this principle, and it's also pretty much what language-teaching books do.
Python provides the NLTK library which is a vast resource of text and corpus, along with a slew of text mining and processing methods. Is there any way we can compare sentences based on the meaning they convey for a possible match? That is, an intelligent sentence matcher?
For example, a sentence like giggling at bad jokes and I like to laugh myself silly at poor jokes. Both convey the same meaning, but the sentences don't remotely match (words are different, Levenstein Distance would fail badly!).
Now imagine we have an API which exposes functionality such as found here. So based on that, we have mechanisms to find out that the word giggle and laugh do match in the meaning they convey. Bad won't match up to poor, so we may need to add further layers (like they match in the context of words like joke, since bad joke is generally same as poor joke, although bad person is not same as poor person!).
A major challenge would be to discard stuff that don't much alter the meaning of the sentence. So, the algorithm should return the same degree of matchness between the the first sentence and this: I like to laugh myself silly at poor jokes, even though they are completely senseless, full of crap and serious chances of heart-attack!
So with that available, is there any algorithm like this that has been conceived yet? Or do I have to invent the wheel?
You will need a more advanced topic modeling algorithm, and of course some corpora to train your model, so that you can easily handle synonyms like giggle and laugh !
In python, you can try this package : http://radimrehurek.com/gensim/
I never used it but it includes classic semantic vector spaces methods like lsa/lsi, random projection and even lda.
My personal favourite is random projection, because it is faster and still very efficient (I'm doing it in java with another library though).
I want to parse a text and categorize the sentences according to their grammatical structure, but I have a very small understanding of NLP so I don't even know where to start.
As far as I have read, I need to parse the text and find out (or tag?) the part-of-speech of every word. Then I search for the verb clause or whatever other defining characteristic I want to use to categorize the sentences.
What I don't know is if there is already some method to do this more easily or if I need to define the grammar rules separately or what.
Any resources on NLP that discuss this would be great. Program examples are welcome as well. I have used NLTK before, but not extensively. Other parsers or languages are OK too!
Python Natural Language Toolkit is a library which is suitable for doing such a work. As with any NLP library, you will have to download the dataset for training separately and corpus(data) and scripts for training are available too.
There are also certain example tutorials which will help you identify parts of the speech for words. By all means, I think nltk.org should be the place to go for what you are looking for.
Specific questions could be posted here again.
May be you need simply define patterns like "noun verb noun" etc for each type of grammatical structure and search matches in part-of-speach tagger output sequence.