I'm currently working on a neural network that evaluates students' answers to exam questions. Therefore, preprocessing the corpora for a Word2Vec network is needed. Hyphenation in german texts is quite common. There are mainly two different types of hyphenation:
1) End of line:
The text reaches the end of the line so the last word is sepa-
rated.
2) Short form of enumeration:
in case of two "elements":
Geistes- und Sozialwissenschaften
more "elements":
Wirtschafts-, Geistes- und Sozialwissenschaften
The de-hyphenated form of these enumerations should be:
Geisteswissenschaften und Sozialwissenschaften
Wirtschaftswissenschaften, Geisteswissenschaften und Sozialwissenschaften
I need to remove all hyphenations and put the words back together. I already found several solutions for the first problem.
But I have absoluteley no clue how to get the second part (in the example above "wissenschaften") of the words in the enumeration problem. I don't even know if it is possible at all.
I hope that I have pointet out my problem properly.
So has anyone an idea how to solve this problem?
Thank you very much in advance!
It's surely possible, as the pattern seems fairly regular. (Something vaguely analogous is sometimes seen in English. For example: The new requirements applied to under-, over-, and average-performing employees.)
The rule seems to be roughly, "when you see word-fragments with a trailing hyphen, and then an und, look for known words that begin with the word-fragments, and end the same as the terminal-word-after-und – and replace the word-fragments with the longer words".
Not being a German speaker and without language-specific knowledge, it wouldn't be possible to know exactly where breaks are appropriate. That is, in your Geistes- und Sozialwissenschaften example, without language-specific knowledge, it's unclear whether the first fragment should become Geisteszialwissenschaften or Geisteswissenschaften or Geistesenschaften or Geiestesaften or any other shared-suffix with Sozialwissenschaften. But if you've got a dictionary of word-fragments, or word-frequency info from other text that uses the same full-length word(s) without this particular enumeration-hyphenation, that could help choose.
(If there's more than one plausible suffix based on known words, this might even be a possible application of word2vec: the best suffix to choose might well be the one that creates a known-word that is closest to the terminal-word in word-vector-space.)
Since this seems a very German-specific issue, I'd try asking in forums specific to German natural-language-processing, or to libraries with specific German support. (Maybe, NLTK or Spacy?)
But also, knowing word2vec, this sort of patch-up may not actually be that important to your end-goals. Training without this logical-reassembly of the intended full words may still let the fragments achieve useful vectors, and the corresponding full words may achieve useful vectors from other usages. The fragments may wind up close enough to the full compound words that they're "good enough" for whatever your next regression/classifier step does. So if this seems a blocker, don't be afraid to just try ignoring it as a non-problem. (Then if you later find an adequate de-hyphenation approach, you can test whether it really helped or not.)
Related
I would like to know if it is possible to group together same words included in the LDA's output, i.e. words generated by
doc_lda = lda_model[corpus]
for example
[(0,
'0.084*"tourism" + 0.013*"touristic" + 0.013*"Madrid" + '
'0.010*"travel" + 0.008*"half" + 0.007*"piare" + '
'0.007*"turism"')]
I would like to group tourism, touristic and turism (mispelled) together.
Would it be possible?
This is some relevant previous code:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=num_topics,
random_state=100,
update_every=1,
chunksize=100,
passes=10,
alpha=[0.01]*num_topics,
per_word_topics=True,
eta=[0.01]*len(id2word.keys()))
Thank you
The key thing to understand is LDA requires a great deal of tuning and iteration to work properly unlike say linear regression. But it can be useful for a certain set of problems.
Your intuition is right in that 'tourism', 'touristic' and 'turism' should all be one word. The fix, however, is not at the end where you are presented with their respective loadings but rather, early on with stemming and lemmatization (aka, stemming and lemming), adding unwanted words to your stopwords list, and some preprocessing to some degree or another. I'll address them separately but not as a group as I think that's fairly obvious. Also, because you only gave the one set of words and loadings, it's not really fruitful to go into providing the number of topics as you may be doing that just fine.
Stemming/Lemming (Pick One)
This is where the science and experience part starts, as well as the frustration. But, this is where you'll make the biggest and easiest gains. It seems like 'tourism' and 'touristic' might be best combined by stemming (as tour). The truth is a lot less clear cut as there are cases where one beats the other. In the below example, PortaStemer suffers from making sensible stems but lemmatizing fails to catch how 'studies' and studying are the same though it accurately catches 'cry'.
Using PorterStemer
studies is studi
studying is studi
cries is cri
cry is cri
Lemmatize
studies is study
studying is studying
cries is cry
cry is cry
There are multiple stemmers such as Porter2, Snowball, Hunspell, and Paice-Husk. So, the obvious first step would be to see if any of these is more useful out of the box.
As mentioned above, lemmatization will get you a similar -- but somewhat different -- set of results.
There is no substitute for the work here. This is what separates a data scientist from a hobbiest or data analyst with plussed up title. The best time to do this was in the past so you would have an intuition of what works best for this sort of corpus; the second-best time is now.
Iterate But Satisfice
I presume you don't have infinite resources; you have to satisfice. For the above, you might consider preprocessing your text to correct or remove misspelled words. What to do with non-English words is trickier. The easiest solution is to remove them or add them to your stopwords list but that may not be the best solution. Customizing your dictionary is an option too.
Know The Current Limits
As of 2020, no one is doing a good job with codeswitching; certainly not a free and opensource resource. Gridspace is about the best I know of, and while their demo is pretty amazing, they can't handle codeswitching well. Now, I'm doing some induction here because I'm assuming 'piare' is Spanish for 'I will', at least that what google translate says. If that's the case, your results will be confounded. But when you look at the loading (.007) that seems to be more work than would be worth it.
I have a database table containing well over a million strings. Each string is a term that can vary in length from two words to five or six.
["big giant cars", "zebra videos", "hotels in rio de janeiro".......]
I also have a blacklist of over several thousand smaller terms in a csv file. What I want to do is identify similar terms in the database to the blacklisted terms in my csv file. Similarity in this case can be construed as mis-spellings of the blacklisted terms.
I am familiar with libraries in python such as fuzzywuzzy that can assess string similarity using Levensthein distance and return an integer representation of the similarity. An example from this tutorial would be:
fuzz.ratio("NEW YORK METS", "NEW YORK MEATS") ⇒ 96
A downside with this approach would be that it may falsely identify terms that may mean something in a different context.
A simple example of this would be "big butt", a blacklisted string, being confused with a more innocent string like "big but".
My question is, is it programmatically possible in python to accomplish this or would it be easier to just retrieve all the similar looking keywords and filter for false positives?
I'm not sure there's any definitive answer to this problem, so the best I can do is to explain how I'd approach this problem, and hopefully you'll be able to get any ideas from my ramblings. :-)
First.
On an unrelated angle, fuzzy string matching might not be enough. People are going to be using similar-looking characters and non-character symbols to get around any text matches, to the point where there's nearly zero match between a blacklisted word and actual text, and yet it's still readable for what it is. So perhaps you will need some normalization of your dictionary and search text, like converting all '0' (zeroes) to 'O' (capital O), '><' to 'X' etc. I believe there are libraries and/or conversion references to that purpose. Non-latin symbols are also a distinct possibility and should be accounted for.
Second.
I don't think you'll be able to differentiate between blacklisted words and similar-looking legal variants in a single pass. So yes, most likely you will have to search for possible blacklisted matches and then check if what you found matches some legal words too. Which means you will need not only the blacklisted dictionary, but a whitelisted dictionary as well. On a more positive note, there's probably no need to normalize the whitelisted dictionary, as people who're writing acceptable text are probably going to write it in acceptable language without any tricks outlined above. Or you could normalize it if you're feeling paranoid. :-)
Third.
However the problem is that matching words/expressions against black and white lists doesn't actually give you a reliable answer. Using your example, a person might write "big butt" as a honest typo which will be obvious in context (or vice versa, write a "big but" intentionally to get a higher match against a whitelisted word, even if context makes it quite obvious what the real meaning is). So you might have to actually check the context in case there are good enough matches against both black and white lists. This is an area I'm not intimately familiar with. It's probably possible to build correlation maps for various words (from both dictionaries) to identify what words are more (or less) frequently used in combination with them, and use them to check your specific example. Using this very paragraph as example, a word "black" could be whitelisted if it's used together with "list" but blacklisted in some other situations.
Fourth.
Even applying all those measures together you might want to leave a certain amount of gray area. That is, unless there's a high enough certainty in either direction, leave the final decision for a human (screening comments/posts for a time, automatically putting them into moderation queue, or whatever else your project dictates).
Fifth.
You might try to dabble in learning algorithms, collecting human input from previous step and using it to automatically fine-tune your algorithm as time goes by.
Hope that helps. :-)
For my project at work I am tasked with going through a bunch of user generated text, and in some of that text are reasons for cancelling their internet service, as well as how often that reason is occurring. It could be they are moving, just don't like it, or bad service, etc.
While this may not necessarily be a Python question, I am wondering if there is some way I can use NLTK or Textblob in some way to determine reasons for cancellation. I highly doubt there is anything automated for such a specialized task and I realize that I may have to build a neural net, but any suggestions on how to tackle this problem would be appreciated.
This is what I have thought about so far:
1) Use stemming and tokenization and tally up most frequent words. Easy method, not that accurate.
2) n-grams. Computationally intensive, but may hold some promise.
3) POS tagging and chunking, maybe find words which follow conjunctions such as "because".
4) Go through all text fields manually and keep a note of reasons for cancellation. Not efficient, defeats the whole purpose of finding some algorithm.
5) NN, have absolutely no idea, and I have no idea if it is feasible.
I would really appreciate any advice on this.
Don't worry if this answer is too general or you can't understand
something - this is academic stuff and needs some basic preparations.
Feel free to contact me with questions, if you want (ask for my mail
in comment or smth, we'll figure something out).
I think that this question is more suited for CrossValidated.
Anyway, first thing that you need to do is to create some training set. You need to find as many documents with reasons as you can and annotate them, marking phrases specifying reason. The more documents the better.
If you're gonna work with user reports - use example reports, so that training data and real data will come from the same source.
This is how you'll build some kind of corpus for your processing.
Then you have to specify what features you'll need. This may be POS tag, n-gram feature, lemma/stem, etc. This needs experimentation and some practice. Here I'd use some n-gram features (probably 2-gram or 3-gram) and maybe some knowledge basing on some Wordnet.
Last step is building you chunker or annotator. It is a component that will take your training set, analyse it and learn what should it mark.
You'll meet something called "semantic gap" - this term describes situation when your program "learned" something else than you wanted (it's a simplification). For example, you may use such a set of features, that your chunker will learn finding "I don't " phrases instead of reason phrases. It is really dependent on your training set and feature set.
If that happens, you should try changing your feature set, and after a while - try working on training set, as it may be not representative.
How to build such chunker? For your case I'd use HMM (Hidden Markov Model) or - even better - CRF (Conditional Random Field). These two are statistical methods commonly used for stream annotation, and you text is basically a stream of tokens. Another approach could be using any "standard" classifier (from Naive Bayes, through some decision tress, NN to SVM) and using it on every n-gram in text.
Of course choosing feature set is highly dependent on chosen method, so read some about them and choose wisely.
PS. This is oversimplified answer, missing many important things about training set preparation, choosing features, preprocessing your corpora, finding sources for them, etc. This is not walk-through - these are basic steps that you should explore yourself.
PPS. Not sure, but NLTK may have some CRF or HMM implementation. If not, I can recommend scikit-learn for Markov and CRFP++ for CRF. Look out - the latter is powerful, but is a b*tch to install and to use from Java or python.
==EDIT==
Shortly about features:
First, what kinds of features can we imagine?
lemma/stem - you find stems or lemmas for each word in your corpus, choose the most important (usually those will have the highest frequency, or at least you'll start there) and then represent each word/n-gram as binary vector, stating whether represented word or sequence after stemming/lemmatization contains that feature lemma/stem
n-grams - similiar to above, but instead of single words you choose most important sequences of length n. "n-gram" means "sequence of length n", so e.g. bigrams (2-grams) for "I sat on a bench" will be: "I sat", "sat on", "on a", "a bench".
skipgrams - similiar to n-grams, but contains "gaps" in original sentence. For example, biskipgrams with gap size 3 for "Quick brown fox jumped over something" (sorry, I can't remember this phrase right now :P ) will be: ["Quick", "over"], ["brown", "something"]. In general, n-skipgrams with gap size m are obtained by getting a word, skipping m, getting a word, etc unless you have n words.
POS tags - I've always mistaken them with "positional" tags, but this is acronym for "Part Of Speech". It is useful when you need to find phrases that have common grammatical structure, not common words.
Of course you can combine them - for example use skipgrams of lemmas, or POS tags of lemmas, or even *-grams (choose your favourite :P) of POS-tags of lemmas.
What would be the sense of using POS tag of lemma? That would describe part of speech of basic form of word, so it would simplify your feature to facts like "this is a noun" instead of "this is plural female noun".
Remember that choosing features is one of the most important parts of the whole process (the other is data preparation, but that deserves the whole semester of courses, and feature selection can be handled in 3-4 lectures, so I'm trying to put basics here).
You need some kind of intuition while "hunting" for chunks - for example, if I wanted to find all expressions about colors, I'd probably try using 2- or 3-grams of words, represented as binary vector described whether such n-gram contains most popular color names and modifiers (like "light", "dark", etc) and POS tag. Even if you'd miss some colors (say, "magenta") you could find them in text if your method (I'd go with CRF again, this is wonderful tool for this kind of tasks) generalized learned knowledge enough.
While FilipMalczak's answer states the state-of-the-art method to solve your problem, a simpler solution (or maybe a preliminary first step) would be to do simple document clustering. This, done right, should cluster together responses that contain similar reasons. Also for this, you don't need any training data. The following article would be a good place to start: http://brandonrose.org/clustering
I am writing a program that should spit out a random sentence of a complexity of my choosing. As a concrete example, I would like to aid my language learning by spitting out valid sentences of a grammar structure and using words that I have already learned. I would like to use python and nltk to do this, although I am open to other ideas.
It seems like there are a couple of approaches:
Define a grammar file that uses the grammar and lexicon I know about, and then generate all valid sentences from this list, then selecting a random answer.
Load in corpora to train ngrams, which then can be used to construct a sentence.
Am I thinking about this correctly? Is one approach preferred over the other? Any tips are appreciated. Thanks!
If I'm getting it right and if the purpose is to test yourself on the vocabulary you already have learned, then another approach could be taken:
Instead of going through the difficult labor of NLG (Natural Language Generation), you could create a search program that goes online, reads news feeds or even simply Wikipedia, and finds sentences with only the words you have defined.
In any case, for what you want, you will have to create lists of words that you have learned. You could then create search algorithms for sentences that contain only / nearly only these words.
That would have the major advantage of testing yourself on real sentences, as opposed to artificially-constructed ones (which are likely to sound not quite right in a number of cases).
An app like this would actually be a great help for learning a foreign language. If you did it nicely I'm sure a lot of people would benefit from using it.
If your purpose is really to make a language learning aid, you need to generate grammatical (i.e., correct) sentences. If so, do not use ngrams. They stick together words at random, and you just get intriguingly natural-looking nonsense.
You could use a grammar in principle, but it will have to be a very good and probably very large grammar.
Another option you haven't considered is to use a template method. Get yourself a bunch of sentences, identify some word classes you are interested in, and generate variants by fitting, e.g., different nouns as the subject or object. This method is much more likely to give you usable results in a finite amount of time. There's any number of well-known bots that work on this principle, and it's also pretty much what language-teaching books do.
I've finished gathering my data I plan to use for my corpus, but I'm a bit confused about whether I should normalize the text. I plan to tag & chunk the corpus in the future. Some of NLTK's corpora are all lower case and others aren't.
Can anyone shed some light on this subject, please?
By "normalize" do you just mean making everything lowercase?
The decision about whether to lowercase everything is really dependent of what you plan to do. For some purposes, lowercasing everything is better because it lowers the sparsity of the data (uppercase words are rarer and might confuse the system unless you have a massive corpus such that the statistics on capitalized words are decent). In other tasks, case information might be valuable.
Additionally, there are other considerations you'll have to make that are similar. For example, should "can't" be treated as ["can't"], ["can", "'t"], or ["ca", "n't"] (I've seen all three in different corpora). What about 7-year-old? Is it one long word? Or three words that should be separated?
That said, there's no reason to reformat the corpus. You can just have your code make these changes on the fly. That way the original information is still around later if you ever need it.