Word tokenizer for Python implementation of hedonometer

Word tokenizer for Python implementation of hedonometer - python

I'm studying this paper "The emotional arcs of stories are dominated by six basic shapes", in which a hedonometer (previosly used mainly for sentiment analysis on Twitter) is applied to a book to make an emotional arc.
I'm trying to reproduce the results of such paper in Python, but, while I understand the hedonometer algorithm, I can't understand the way words were tokenized (the paper is really poor on this matter, and I'm not really confident about it).
In the link above, at the bottom, there is the Electronic Supplementary Material in which, at page S8, it is claimed that the book "The Picture of Dorian Gray" (after removing the front and back matter) has 84,591 words.
I tried two tokenizers (suppose the text is in a variable text).
The first is a naive one
words = text.split()
and gets only 78,934 words.
The second one, using a well-know library (the code might by sub-optimal, since I've never used NLTK before neither done anything related to Natural Language Processing)
from nltk.tokenize import sent_tokenize, word_tokenize
words = [word for sent in sent_tokenize(text) for word in word_tokenize(sent)]
I get an incredible amount of 95,285 words! If I filter out the punctuation with the following code
words2 = [word for word in words if word not in ',-.:\'\"!?;()[]']
it reduces to 83,141 words.
Yet, nothing really close to the paper quantity of 84,591 words. What am I doing wrong? Am I not understading how a tokenizer works?
P.S. I've tried with other books, but the results are similar. I always either underestimate or overestimate the words count by 2000 words.
Bonus question: as the paper points out, there is a dictionary of words with associated happiness score to be used (to assign averaged happiness score at chunks of text). In such dictionary there are also words such as "can't", while the NLTK tokenizer splits "can't" into "ca" and "n't". Also, looking at the tokenized text on page S5 of the Electronic Supplementary Material I assume that the NLTK behaviour is the desired one. Yet, shouldn't it work otherwise, i.e. leaving "can't" united, by the way the dictionary is built?

Related

How to find text reuse with fuzzy match?

I try to find, effectively, a similarity between a short phrase and a large corpus. For example, suppose my corpus is the book Moby Dick. This book has tens of thousands of words.
In addition to that, I have a few short phrases. for example:
phrase1 = "Call me Ishmael" # This is the first sentence in the book exactly.
phrase2 = "Call me Isabel" # This is like the previous with changes of few letters from the third word.
phrase3 = "Call me Is mael" #It's a similar sentence but one word split in two.
In addition, I have of course many other sentences that are not similar to sentences from the book.
I wonder what is the generic and effective way to identify sentences that have a sentence similar to them in the book.
What have I tried to do that seems less appropriate to me?
I split all the input sentences into 3/4/5/6 n-grams.
I split all the corpus sentences into 3/4/5/6 n-grams.
Then I tried to find an approximate match (with FuzzySet) between all possible combinations of corpus n-grams and input n-grams (The combinations are required to grasp even cases where words have split or merged.)
It is quite clear to me that this method is very wasteful and probably also not the most accurate for my needs. I would love to understand how best to do it.

You can use corpus-based spell correction followed by fuzzyset. For spell correction, you can use a python implementation of symspell algorithm. Here you can find the list of repository implementing symspell. Use symspell with compounding configuration.
Use a speller trained on the corpus to spell correct short sentences.
Use fuzzyset/fuzzywuzzy to find a similarity score between spell
corrected sentence and each sentence in the corpus.
Define a threshold by experiment, if the similarity is above the
threshold,call it a match

How can I get unique words per each topic LDA?

I am trying to get unique words for each topic.
I am using gensim and this is the line that help me to generate my model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word = dictionary)
But I have repeated words in two different topics, I would like to have different words per topic

You cannot enforce words uniqueness by topic in LDA since each topic is a distribution over all the words in the vocabulary. This distribution measure the probability that words co-occur inside a topic. Thus, nothing ensures that a word won't co-occur with different words in different contexts, which will leads to words represented in different topics.
Let's take an example by considering these two documents:
doc1: The python is a beautiful snake living in the forest.
doc2: Python is a beautiful language used by programmer and data scientist.
In doc1 the word python co-occur with snake, forest and living which might give a good probability for this word to appear in a topic, let's say, about biology.
In doc2, the word python co-occur with language, programmer and data which, in this case, will associate this word in a topic about computer science.
What you can eventually do, is to look for words that have the highest probability in topics in order to achieve what you want.

Words that are grouped into one topic do not mean that they are semantically similar(low distance in space mapped from word2vec). They are just co-occurred more often.

NLP - When to lowercase text during preprocessing

I want to build a model for language modelling, which should predict the next words in a sentence, given the previous word(s) and/or the previous sentence.
Use case: I want to automate writing reports. So the model should automatically complete the sentence I am writing. Therefore, it is important that nouns and the words at the beginning of a sentence are capitalized.
Data: The data is in German and contains a lot of technical jargon.
My text corpus is in German and I am currently working on the preprocessing. Because my model should predict gramatically correct sentences I have decided to use/not use the following preprocessing steps:
no stopword removal
no lemmatization
replace all expressions with numbers by NUMBER
normalisation of synonyms and abbreviations
replace rare words with RARE
However, I am not sure whether to convert the corpus to lowercase. When searching the web I found different opinions. Although lower-casing is quite common it will cause my model to wrongly predict the capitalization of nouns, sentence beginnings etc.
I also found the idea to convert only the words at the beginning of a sentence to lower-case on the following Stanford page.
What is the best strategy for this use-case? Should I convert the text to lower-case and change the words to the correct case after prediction? Should I leave the capitalization as it is? Should I only lowercase words at the beginning of a sentence?
Thanks a lot for any suggestions and experiences!

I think for your particular use-case, it would be better to convert it to lowercase because ultimately, you will need to predict the words given a certain context. You probably won't be needing to predict sentence beginnings in your use-case. Also, if a noun is predicted you can capitalize it later. However consider the other way round. (Assuming your corpus is in English) Your model might treat a word which is in the beginning of a sentence with a capital letter different from the same word which appears later in the sentence but without any capital latter. This might lead to decline in the accuracy. Whereas I think, lowering the words would be a better trade off. I did a project on Question-answering system and converting the text to lowercase was a good trade off.
Edit : Since your corpus is in German, it would be better to retain the capitalization since it is an important aspect of German Language.
If it is of any help, Spacey supports German Language. You use it to train your model.

In general, tRuEcasIng helps.
Truecasing is the process of restoring case information to badly-cased or noncased text.
See
How can I best determine the correct capitalization for a word?
https://github.com/nreimers/truecaser
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/train-truecaser.perl

definitely convert the majority of the words to lowercase, cut consider the following cases:
Acronyms e.g. MIT if you lower case it to mit which is a word (in German) you'll be in trouble
Initials e.g. J. A. Snow
Enumerations e.g. (I),(II),(III),APPENDIX A
I would also advise against the <RARE> token, what percentage of your corpus is <RARE>, what about unknown words ?
Since you are dealing with German, and words can be arbitrary long and rare, you might need a way to break them down further.
Thus some sort of lemmatization and tokenization are needed
I recommend using spacy that support German from day one, and the support and docs are very helpful (Thank you Mathew and Ines)

Extracting only meaningful text from webpages

I am getting a list of urls and scraping them using nltk. My end result is in the form of a list with all the words on the webpage in a list. The trouble is that I am only looking for keywords and phrases that are not the usual english "sugar" words such as "as, and, like, to, am, for" etc etc. I know I can construct a file with all common english words and simply remove them from my scraped tokens list, but is there a built in feature for some library that does this automatically?
I am essentially looking for useful words on a page that are not fluff and can give some context to what the page is about. Almost like the tags on stackoverflow or the tags google uses for seo.

I think what you are looking for is the stopwords.words from nltk.corpus:
>>> from nltk.corpus import stopwords
>>> sw = set(stopwords.words('english'))
>>> sentence = "a long sentence that contains a for instance"
>>> [w for w in sentence.split() if w not in sw]
['long', 'sentence', 'contains', 'instance']
Edit: searching for stopword give possible duplicates: Stopword removal with NLTK, How to remove stop words using nltk or python. See the answers of these question. And consider Effects of Stemming on the term frequency? too

While you might get robust lists of stop-words in NLTK (and elsewhere), you can easily build your own lists according to the kind of data (register) you process. Most of the words you do not want are so-called grammatical words: they are extremely frequent, so you catch them easily by sorting a frequency list by descending order and discarding the n-top items.
In my experience, the first 100 ranks of any moderately large corpus (>10k tokens of running text) hardly contain any content words.
It seems that you are interested in extracting keywords, however. For this task, pure frequency signatures are not very useful. You will need to transform the frequencies into some other value with respect to a reference corpus: this is called weighting and there are many different ways to achieve it. TfIdf is the industry standard since 1972.
If you are going to spend time doing these tasks, get an introductory handbook for corpus linguistics or computational linguistics.

You can look for available corpora linquistics for data on frequency of words (along with other annotations).
You can start from links on wikipedia: http://en.wikipedia.org/wiki/Corpus_linguistics#External_links
More information you can probably find at https://linguistics.stackexchange.com/

Implementing idf with nltk

Given the sentence: "the quick brown fox jumped over the lazy dog", I would like to get a score of how frequent each word is from an nltk corpus (which ever corpus is most generic/comprehensive)
EDIT:
This question is in relation to this question: python nltk keyword extraction from sentence where #adi92 suggested using the technique of idf to calculate the 'rareness' of a word. I would like to see what this would look like in practice. The broader problem here is, how do you calculate the rareness of a word's use in the english language. I appreciate that this is a hard problem to solve, but nonetheless nltk idf (with something like the brown or reuters corpus??) might get us part of the way there?

If you want to know word frequencies you need a table of word frequencies. Words have different frequencies depending on text genre, so the best frequency table might be based on a domain-specific corpus.
If you're just messing around, it's easy enough to pick a corpus at random and count the words-- use <corpus>.words() and the nltk's FreqDist, and/or see the NLTK book for details.
But for serious use, don't bother counting words yourself: If you're not interested in a specific domain, grab a large word frequency table. There are gazillions out there (it's evidently the first thing a corpus creator thinks of), and the largest one is probably the "1-gram" tables compiled by google. You can download them at http://books.google.com/ngrams/datasets

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.