Python NLTK tokenizing text using already found bigrams - python

Background: I got a lot of text that has some technical expressions, which are not always standard.
I know how to find the bigrams and filter them.
Now, I want to use them when tokenizing the sentences. So words that should stay together (according to the calculated bigrams) are kept together.
I would like to know if there is a correct way to doing this within NLTK. If not, I can think of various non efficient ways of rejoining all the broken words by checking dictionaries.

The way how topic modelers usually pre-process text with n-grams is they connect them by underscore (say, topic_modeling or white_house). You can do that when identifying big rams themselves. And don't forget to make sure that your tokenizer does not split by underscore (Mallet does if not setting token-regex explicitly).
P.S. NLTK native bigrams collocation finder is super slow - if you want something more efficient look around if you haven't yet or create your own based on, say, Dunning (1993).

Related

How to handle bigrams of same word in different sequence in topics modeling in python? Ex. 'lease extension' and 'extension lease'

Hello Stackoverflow Community,
I am reaching out to you all for ideas on how to handle bigrams of the same word in a different sequence in topics modeling in python.
I have a topic model where two bigrams which mean the same are treated as different features because they are in different order. I need a way to have to treat those two bigrams as synonyms.
Ideas and suggestions are welcome.
Ex. ‘lease extension’ and ‘extension lease’
I want to treat them as the same word in a word matrix
Any type of suggestions and ideas are most welcome.
Thank you in advance,
Nikhar
Before you treat these bigrams as interchangeable, you have to make sure that they actually are. If they are not, it will reduce the quality of your analysis. 'foot_doctor' and 'doctor_foot' may not refer to the same thing - especially if you took other preprocessing steps, such as stemming or lemmatizing, i.e. turning 'the doctor's foot' into 'doctor foot'.
Assuming the meaning of these bigrams is interchangeable: Treat them as interchangeable - you can just rewrite one to be the other. Python offers a lot of built-in string functions. In your example, using replace(), we can replace one bigram with another.
oldfakedoc = 'my landlord gave me a lease extension'
newfakedoc = oldfakedoc.replace('lease extension', 'extension lease')
print (newfakedoc)
gives my landlord gave me a extension lease. Loop over all bigrams you want to replace, and then run your model.
You can use this approach also if you do not want to stem or lemmatize all of your documents, but have topics that load very heavily on words that are strongly related, such as "jump" and "jumping". Also, make sure you do not overwrite your raw data, so you can go back and reconstruct where these replacements were made, if you need to.

Heuristics for determining whether something is a "word" or random data?

I am writing a web crawler in python that downloads a list of URLS, extracts all visible text from the HTML, tokenizes the text (using nltk.tokenize) and then creates a positional inverted index of words in each document for use by a search feature.
However, right now, the index contains a bunch of useless entries like:
1) //roarmag.org/2015/08/water-conflict-turkey-middle-east/
2) ———-
3) ykgnwym+ccybj9z1cgzqovrzu9cni0yf7yycim6ttmjqroz3wwuxiseulphetnu2
4) iazl+xcmwzc3da==
Some of these, like #1, are where URLs appear in the text. Some, like #3, are excerpts from PGP keys, or other random data that is embedded in the text.
I am trying to understand how to filter out useless data like this. But I don't just want to keep words that I would find in an English dictionary, but also things like names, places, nonsense words like "Jabberwocky" or "Rumpelstiltskin", acronyms like "TANSTAAFL", obscure technical/scientific terms, etc ...
That is, I'm looking for a way to heuristically strip out strings that are "jibberish". (1) exceedingly "long" (2) filled with a bunch of punctuation (3) composed of random strings of characters like afhdkhfadhkjasdhfkldashfkjahsdkfhdsakfhsadhfasdhfadskhkf ... I understand that there is no way to do this with 100% accuracy, but if I could remove even 75% of the junk I'd be happy.
Are there any techniques that I can use to separate "words" from junk data like this?
Excessively long words are trivial to filter. It's pretty easy to filter out URLs, too. I don't know about Python, but other languages have libraries you can use to determine if something is a relative or absolute URL. Or you could just use your "strings with punctuation" filter to filter out anything that contains a slash.
Words are trickier, but you can do a good job with n-gram language models. Basically, you build or obtain a language model, and run each string through the model to determine the likelihood of that string being a word in the particular language. For example, "Rumplestiltskin" will have a much higher likelihood of being an English word than, say, "xqjzipdg".
See https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark for a trained model that might be useful to you in determining if a string is an actual word in some language.
See also NLTK and language detection.

Generate Random Sentence From Grammar or Ngrams?

I am writing a program that should spit out a random sentence of a complexity of my choosing. As a concrete example, I would like to aid my language learning by spitting out valid sentences of a grammar structure and using words that I have already learned. I would like to use python and nltk to do this, although I am open to other ideas.
It seems like there are a couple of approaches:
Define a grammar file that uses the grammar and lexicon I know about, and then generate all valid sentences from this list, then selecting a random answer.
Load in corpora to train ngrams, which then can be used to construct a sentence.
Am I thinking about this correctly? Is one approach preferred over the other? Any tips are appreciated. Thanks!
If I'm getting it right and if the purpose is to test yourself on the vocabulary you already have learned, then another approach could be taken:
Instead of going through the difficult labor of NLG (Natural Language Generation), you could create a search program that goes online, reads news feeds or even simply Wikipedia, and finds sentences with only the words you have defined.
In any case, for what you want, you will have to create lists of words that you have learned. You could then create search algorithms for sentences that contain only / nearly only these words.
That would have the major advantage of testing yourself on real sentences, as opposed to artificially-constructed ones (which are likely to sound not quite right in a number of cases).
An app like this would actually be a great help for learning a foreign language. If you did it nicely I'm sure a lot of people would benefit from using it.
If your purpose is really to make a language learning aid, you need to generate grammatical (i.e., correct) sentences. If so, do not use ngrams. They stick together words at random, and you just get intriguingly natural-looking nonsense.
You could use a grammar in principle, but it will have to be a very good and probably very large grammar.
Another option you haven't considered is to use a template method. Get yourself a bunch of sentences, identify some word classes you are interested in, and generate variants by fitting, e.g., different nouns as the subject or object. This method is much more likely to give you usable results in a finite amount of time. There's any number of well-known bots that work on this principle, and it's also pretty much what language-teaching books do.

Using WordNet to determine semantic similarity between two texts?

How can you determine the semantic similarity between two texts in python using WordNet?
The obvious preproccessing would be removing stop words and stemming, but then what?
The only way I can think of would be to calculate the WordNet path distance between each word in the two texts. This is standard for unigrams. But these are large (400 word) texts, that are natural language documents, with words that are not in any particular order or structure (other than those imposed by English grammar). So, which words would you compare between texts? How would you do this in python?
One thing that you can do is:
Kill the stop words
Find as many words as possible that have maximal intersections of synonyms and antonyms with those of other words in the same doc. Let's call these "the important words"
Check to see if the set of the important words of each document is the same. The closer they are together, the more semantically similar your documents.
There is another way. Compute sentence trees out of the sentences in each doc. Then compare the two forests. I did some similar work for a course a long time ago. Here's the code (keep in mind this was a long time ago and it was for class. So the code is extremely hacky, to say the least).
Hope this helps

NLTK - when to normalize the text?

I've finished gathering my data I plan to use for my corpus, but I'm a bit confused about whether I should normalize the text. I plan to tag & chunk the corpus in the future. Some of NLTK's corpora are all lower case and others aren't.
Can anyone shed some light on this subject, please?
By "normalize" do you just mean making everything lowercase?
The decision about whether to lowercase everything is really dependent of what you plan to do. For some purposes, lowercasing everything is better because it lowers the sparsity of the data (uppercase words are rarer and might confuse the system unless you have a massive corpus such that the statistics on capitalized words are decent). In other tasks, case information might be valuable.
Additionally, there are other considerations you'll have to make that are similar. For example, should "can't" be treated as ["can't"], ["can", "'t"], or ["ca", "n't"] (I've seen all three in different corpora). What about 7-year-old? Is it one long word? Or three words that should be separated?
That said, there's no reason to reformat the corpus. You can just have your code make these changes on the fly. That way the original information is still around later if you ever need it.

Categories