Sentence tokenization w/o relying on punctuations and capitalizations - python

Is there a possible approach for extracting sentences from paragraphs / sentence tokenization for paragraphs that doesn't have any punctuations and/or all lowercased? We have a specific need for being able to split paragraphs into sentences while expecting the worst case that paragraph inputted are improper.
Example:
this is a sentence this is a sentence this is a sentence this is a sentence this is a sentence
into
["this is a sentence", "this is a sentence", "this is a sentence", "this is a sentence", "this is a sentence"]
The sentence tokenizer that we have tried so far seems to rely on punctuations and true casing:
Using nltk.sent_tokenize
"This is a sentence. This is a sentence. This is a sentence"
into
['This is a sentence.', 'This is a sentence.', 'This is a sentence']

This is a hard problem, and you are likely better off trying to figure out how to deal with imperfect sentence segmentation. That said there are some ways you can deal with this.
You can try to train a sentence segmenter from scratch using a sequence labeller. The sentencizer in spaCy is one such model. This should be pretty easy to configure, but without punctuation or case I'm not sure how well it'd work.
The other thing you can do is use a parser that segments text into sentences. The spaCy parser does this, but its training data is properly cased and punctuated, so you'd need to train your own model to do this. You could use the output of the parser on normal sentences, with everything lower cased and punctuation removed, as training data. Normally this kind of training data is inferior to the original, but given your specific needs it should be easy to get at least.
Other possibilities involve using models to add punctuation and casing back, but in that case you run into issues that errors in the models will compound, so it's probably harder than predicting sentence boundaries directly.

The only thing I can think of is to use a statistical classifier based on words that typically start or end sentences. This will not necessarily work in your example (I think only a full grammatical analysis would be able to identify sentence boundaries in that case), but you might get some way towards your goal.
Simply build a list of words that typically come at the beginning of a sentence. Words like the or this will probably be quite high on that list; count how many times the word occurs in your training text, and how many of these times it is at the beginning of a sentence. Then do the same for the end -- here you should never get the, as it cannot end a sentence in any but the most contrived examples.
With these two lists, go through your text and work out if you have a word that is likely to end a sentence followed by one that is likely to start one; if yes, you have a candidate for a potential sentence boundary. In your example, this would be likely to start a sentence, and sentence would be likely to be the sentence-final word. Obviously it depends on your data whether it works or not. If you're feeling adventurous, use parts-of-speech tags instead of the actual words; then your lists will be much shorter, and it should probably still work just as well.
However, you might find that you also get phrase boundaries (as each sentence will start with a phrase, and the end of the last phrase of a sentence will also coincide with the end of the sentence). It is hard to predict whether it will work without actually trying it out, but it should be quick and easy to implement and is better than nothing.

Related

How to find text reuse with fuzzy match?

I try to find, effectively, a similarity between a short phrase and a large corpus. For example, suppose my corpus is the book Moby Dick. This book has tens of thousands of words.
In addition to that, I have a few short phrases. for example:
phrase1 = "Call me Ishmael" # This is the first sentence in the book exactly.
phrase2 = "Call me Isabel" # This is like the previous with changes of few letters from the third word.
phrase3 = "Call me Is mael" #It's a similar sentence but one word split in two.
In addition, I have of course many other sentences that are not similar to sentences from the book.
I wonder what is the generic and effective way to identify sentences that have a sentence similar to them in the book.
What have I tried to do that seems less appropriate to me?
I split all the input sentences into 3/4/5/6 n-grams.
I split all the corpus sentences into 3/4/5/6 n-grams.
Then I tried to find an approximate match (with FuzzySet) between all possible combinations of corpus n-grams and input n-grams (The combinations are required to grasp even cases where words have split or merged.)
It is quite clear to me that this method is very wasteful and probably also not the most accurate for my needs. I would love to understand how best to do it.
You can use corpus-based spell correction followed by fuzzyset. For spell correction, you can use a python implementation of symspell algorithm. Here you can find the list of repository implementing symspell. Use symspell with compounding configuration.
Use a speller trained on the corpus to spell correct short sentences.
Use fuzzyset/fuzzywuzzy to find a similarity score between spell
corrected sentence and each sentence in the corpus.
Define a threshold by experiment, if the similarity is above the
threshold,call it a match

text segmentation based on punctuation marks, especially at clause level

I want to segment the text when we encounter the punctuation mark in a sentence or paragraph. If I use comma(,) in my regex it is also chunking the individual nouns verbs or adjectives separated by comma.
Suppose we have "dogs, cats, rats and other animals". Dogs becomes a separate chunk, which I do not want to happen.
Is there anyway I can ignore that using regex or any other means in nltk where I can only get comma separated clause as a text segment
Code
from nltk import sent_tokenize
import re
text = "Peter Mattei's 'Love in the Time of Money' is a visually stunning film to watch. Mrs. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what money, power and success do to people in the different situation we encounter.
text= re.sub("(?<=..Dr|.Mrs|..Mr|..Ms|Prof)[.]","<prd>", text)
txt = re.split(r'\.\s|;|:|\?|\'\s|"\s|!|\s\'|\s\"', text)
print(txt)
This is too complicated to be solved with a regex: there is no way for the regex to know that there is a predicate (verb) within the clause candidate and if you expand it, you would break into another clause.
The problem you are going to solve is called chunking in NLP. Traditionally, here were regex-based algorithms based on POS tags (so, you need to do POS tagging first). NLTK has a tutorial for that, however, this is a rather outdated approach.
Now, when fast and reliable taggers and parsers are available (e.g., in Spacy). I would suggest analyzing the sentence first and then finding chunks in a constituency parse.

NLP - When to lowercase text during preprocessing

I want to build a model for language modelling, which should predict the next words in a sentence, given the previous word(s) and/or the previous sentence.
Use case: I want to automate writing reports. So the model should automatically complete the sentence I am writing. Therefore, it is important that nouns and the words at the beginning of a sentence are capitalized.
Data: The data is in German and contains a lot of technical jargon.
My text corpus is in German and I am currently working on the preprocessing. Because my model should predict gramatically correct sentences I have decided to use/not use the following preprocessing steps:
no stopword removal
no lemmatization
replace all expressions with numbers by NUMBER
normalisation of synonyms and abbreviations
replace rare words with RARE
However, I am not sure whether to convert the corpus to lowercase. When searching the web I found different opinions. Although lower-casing is quite common it will cause my model to wrongly predict the capitalization of nouns, sentence beginnings etc.
I also found the idea to convert only the words at the beginning of a sentence to lower-case on the following Stanford page.
What is the best strategy for this use-case? Should I convert the text to lower-case and change the words to the correct case after prediction? Should I leave the capitalization as it is? Should I only lowercase words at the beginning of a sentence?
Thanks a lot for any suggestions and experiences!
I think for your particular use-case, it would be better to convert it to lowercase because ultimately, you will need to predict the words given a certain context. You probably won't be needing to predict sentence beginnings in your use-case. Also, if a noun is predicted you can capitalize it later. However consider the other way round. (Assuming your corpus is in English) Your model might treat a word which is in the beginning of a sentence with a capital letter different from the same word which appears later in the sentence but without any capital latter. This might lead to decline in the accuracy. Whereas I think, lowering the words would be a better trade off. I did a project on Question-answering system and converting the text to lowercase was a good trade off.
Edit : Since your corpus is in German, it would be better to retain the capitalization since it is an important aspect of German Language.
If it is of any help, Spacey supports German Language. You use it to train your model.
In general, tRuEcasIng helps.
Truecasing is the process of restoring case information to badly-cased or noncased text.
See
How can I best determine the correct capitalization for a word?
https://github.com/nreimers/truecaser
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/train-truecaser.perl
definitely convert the majority of the words to lowercase, cut consider the following cases:
Acronyms e.g. MIT if you lower case it to mit which is a word (in German) you'll be in trouble
Initials e.g. J. A. Snow
Enumerations e.g. (I),(II),(III),APPENDIX A
I would also advise against the <RARE> token, what percentage of your corpus is <RARE>, what about unknown words ?
Since you are dealing with German, and words can be arbitrary long and rare, you might need a way to break them down further.
Thus some sort of lemmatization and tokenization are needed
I recommend using spacy that support German from day one, and the support and docs are very helpful (Thank you Mathew and Ines)

Word tokenizer for Python implementation of hedonometer

I'm studying this paper "The emotional arcs of stories are dominated by six basic shapes", in which a hedonometer (previosly used mainly for sentiment analysis on Twitter) is applied to a book to make an emotional arc.
I'm trying to reproduce the results of such paper in Python, but, while I understand the hedonometer algorithm, I can't understand the way words were tokenized (the paper is really poor on this matter, and I'm not really confident about it).
In the link above, at the bottom, there is the Electronic Supplementary Material in which, at page S8, it is claimed that the book "The Picture of Dorian Gray" (after removing the front and back matter) has 84,591 words.
I tried two tokenizers (suppose the text is in a variable text).
The first is a naive one
words = text.split()
and gets only 78,934 words.
The second one, using a well-know library (the code might by sub-optimal, since I've never used NLTK before neither done anything related to Natural Language Processing)
from nltk.tokenize import sent_tokenize, word_tokenize
words = [word for sent in sent_tokenize(text) for word in word_tokenize(sent)]
I get an incredible amount of 95,285 words! If I filter out the punctuation with the following code
words2 = [word for word in words if word not in ',-.:\'\"!?;()[]']
it reduces to 83,141 words.
Yet, nothing really close to the paper quantity of 84,591 words. What am I doing wrong? Am I not understading how a tokenizer works?
P.S. I've tried with other books, but the results are similar. I always either underestimate or overestimate the words count by 2000 words.
Bonus question: as the paper points out, there is a dictionary of words with associated happiness score to be used (to assign averaged happiness score at chunks of text). In such dictionary there are also words such as "can't", while the NLTK tokenizer splits "can't" into "ca" and "n't". Also, looking at the tokenized text on page S5 of the Electronic Supplementary Material I assume that the NLTK behaviour is the desired one. Yet, shouldn't it work otherwise, i.e. leaving "can't" united, by the way the dictionary is built?

Tf-Idf vectorizer analyze vectors from lines instead of words

I'm trying to analyze a text which is given by lines, and I wish to vectorize the lines using sckit-learn package's TF-IDF-vectorization in python.
The problem is that the vectorization can be done either by words or n-grams but I want them to be done for lines, and I already ruled out a work around that just vectorize each line as a single word (since in that way the words and their meaning wont be considered).
Looking through the documentation I didnt find how to do that, so is there any such option?
You seem to be misunderstanding what the TF-IDF vectorization is doing. For each word (or N-gram), it assigns a weight to the word which is a function of both the frequency of the term (TF) and of its inverse frequency of the other terms in the document (IDF). It makes sense to use it for words (e.g. knowing how often the word "pizza" comes up) or for N-grams (e.g. "Cheese pizza" for a 2-gram)
Now, if you do it on lines, what will happen? Unless you happen to have a corpus in which lines are repeated exactly (e.g. "I need help in Python"), your TF-IDF transformation will be garbage, as each sentence will appear exactly once in the document. And if your sentences are indeed always similar to the punctuation mark, then for all intents and purposes they are not sentences in your corpus, but words. This is why there is no option to do TF-IDF with sentences: it makes zero practical or theoretical sense.

Categories