I want to segment the text when we encounter the punctuation mark in a sentence or paragraph. If I use comma(,) in my regex it is also chunking the individual nouns verbs or adjectives separated by comma.
Suppose we have "dogs, cats, rats and other animals". Dogs becomes a separate chunk, which I do not want to happen.
Is there anyway I can ignore that using regex or any other means in nltk where I can only get comma separated clause as a text segment
Code
from nltk import sent_tokenize
import re
text = "Peter Mattei's 'Love in the Time of Money' is a visually stunning film to watch. Mrs. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what money, power and success do to people in the different situation we encounter.
text= re.sub("(?<=..Dr|.Mrs|..Mr|..Ms|Prof)[.]","<prd>", text)
txt = re.split(r'\.\s|;|:|\?|\'\s|"\s|!|\s\'|\s\"', text)
print(txt)
This is too complicated to be solved with a regex: there is no way for the regex to know that there is a predicate (verb) within the clause candidate and if you expand it, you would break into another clause.
The problem you are going to solve is called chunking in NLP. Traditionally, here were regex-based algorithms based on POS tags (so, you need to do POS tagging first). NLTK has a tutorial for that, however, this is a rather outdated approach.
Now, when fast and reliable taggers and parsers are available (e.g., in Spacy). I would suggest analyzing the sentence first and then finding chunks in a constituency parse.
Related
Is there a possible approach for extracting sentences from paragraphs / sentence tokenization for paragraphs that doesn't have any punctuations and/or all lowercased? We have a specific need for being able to split paragraphs into sentences while expecting the worst case that paragraph inputted are improper.
Example:
this is a sentence this is a sentence this is a sentence this is a sentence this is a sentence
into
["this is a sentence", "this is a sentence", "this is a sentence", "this is a sentence", "this is a sentence"]
The sentence tokenizer that we have tried so far seems to rely on punctuations and true casing:
Using nltk.sent_tokenize
"This is a sentence. This is a sentence. This is a sentence"
into
['This is a sentence.', 'This is a sentence.', 'This is a sentence']
This is a hard problem, and you are likely better off trying to figure out how to deal with imperfect sentence segmentation. That said there are some ways you can deal with this.
You can try to train a sentence segmenter from scratch using a sequence labeller. The sentencizer in spaCy is one such model. This should be pretty easy to configure, but without punctuation or case I'm not sure how well it'd work.
The other thing you can do is use a parser that segments text into sentences. The spaCy parser does this, but its training data is properly cased and punctuated, so you'd need to train your own model to do this. You could use the output of the parser on normal sentences, with everything lower cased and punctuation removed, as training data. Normally this kind of training data is inferior to the original, but given your specific needs it should be easy to get at least.
Other possibilities involve using models to add punctuation and casing back, but in that case you run into issues that errors in the models will compound, so it's probably harder than predicting sentence boundaries directly.
The only thing I can think of is to use a statistical classifier based on words that typically start or end sentences. This will not necessarily work in your example (I think only a full grammatical analysis would be able to identify sentence boundaries in that case), but you might get some way towards your goal.
Simply build a list of words that typically come at the beginning of a sentence. Words like the or this will probably be quite high on that list; count how many times the word occurs in your training text, and how many of these times it is at the beginning of a sentence. Then do the same for the end -- here you should never get the, as it cannot end a sentence in any but the most contrived examples.
With these two lists, go through your text and work out if you have a word that is likely to end a sentence followed by one that is likely to start one; if yes, you have a candidate for a potential sentence boundary. In your example, this would be likely to start a sentence, and sentence would be likely to be the sentence-final word. Obviously it depends on your data whether it works or not. If you're feeling adventurous, use parts-of-speech tags instead of the actual words; then your lists will be much shorter, and it should probably still work just as well.
However, you might find that you also get phrase boundaries (as each sentence will start with a phrase, and the end of the last phrase of a sentence will also coincide with the end of the sentence). It is hard to predict whether it will work without actually trying it out, but it should be quick and easy to implement and is better than nothing.
Since Chinese is different from English, so how we can split a Chinese paragraph into sentences (in Python)? A Chinese paragraph sample is given as
我是中文段落,如何为我分句呢?我的宗旨是“先谷歌搜索,再来问问题”,我已经搜索了,但是没找到好的答案。
To my best knowledge,
from nltk import tokenize
tokenize.sent_tokenize(paragraph, "chinese")
does not work because tokenize.sent_tokenize() doesn't support Chinese.
All the methods I found through Google search rely on Regular Expression (such as
re.split('(。|!|\!|\.|?|\?)', paragraph_variable)
). Those method are not complete enough. It seems that there is no a single regular expression pattern could be employed to split a Chinese paragraph into sentences correctly. I guess there should be some learned patterns to accomplish this task. But, I can't find them.
I want to build a model for language modelling, which should predict the next words in a sentence, given the previous word(s) and/or the previous sentence.
Use case: I want to automate writing reports. So the model should automatically complete the sentence I am writing. Therefore, it is important that nouns and the words at the beginning of a sentence are capitalized.
Data: The data is in German and contains a lot of technical jargon.
My text corpus is in German and I am currently working on the preprocessing. Because my model should predict gramatically correct sentences I have decided to use/not use the following preprocessing steps:
no stopword removal
no lemmatization
replace all expressions with numbers by NUMBER
normalisation of synonyms and abbreviations
replace rare words with RARE
However, I am not sure whether to convert the corpus to lowercase. When searching the web I found different opinions. Although lower-casing is quite common it will cause my model to wrongly predict the capitalization of nouns, sentence beginnings etc.
I also found the idea to convert only the words at the beginning of a sentence to lower-case on the following Stanford page.
What is the best strategy for this use-case? Should I convert the text to lower-case and change the words to the correct case after prediction? Should I leave the capitalization as it is? Should I only lowercase words at the beginning of a sentence?
Thanks a lot for any suggestions and experiences!
I think for your particular use-case, it would be better to convert it to lowercase because ultimately, you will need to predict the words given a certain context. You probably won't be needing to predict sentence beginnings in your use-case. Also, if a noun is predicted you can capitalize it later. However consider the other way round. (Assuming your corpus is in English) Your model might treat a word which is in the beginning of a sentence with a capital letter different from the same word which appears later in the sentence but without any capital latter. This might lead to decline in the accuracy. Whereas I think, lowering the words would be a better trade off. I did a project on Question-answering system and converting the text to lowercase was a good trade off.
Edit : Since your corpus is in German, it would be better to retain the capitalization since it is an important aspect of German Language.
If it is of any help, Spacey supports German Language. You use it to train your model.
In general, tRuEcasIng helps.
Truecasing is the process of restoring case information to badly-cased or noncased text.
See
How can I best determine the correct capitalization for a word?
https://github.com/nreimers/truecaser
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/train-truecaser.perl
definitely convert the majority of the words to lowercase, cut consider the following cases:
Acronyms e.g. MIT if you lower case it to mit which is a word (in German) you'll be in trouble
Initials e.g. J. A. Snow
Enumerations e.g. (I),(II),(III),APPENDIX A
I would also advise against the <RARE> token, what percentage of your corpus is <RARE>, what about unknown words ?
Since you are dealing with German, and words can be arbitrary long and rare, you might need a way to break them down further.
Thus some sort of lemmatization and tokenization are needed
I recommend using spacy that support German from day one, and the support and docs are very helpful (Thank you Mathew and Ines)
I need to split some text sections into sentences and I'm using the NLTK tokenizer for this purpose. These text pieces are all lower-case and in general of low quality, making it more difficult. However, it's okay with a few errors from time to time as long as the general rules of the language is upheld. For instance, I want the sentences to be split after a dot. The text sections may contain many individual sentences as well as abbreviations etc.
How do I ensure that NLTK ignores capitalization and splits the text below into 2 sentences between "2006." and "though"?
from nltk.tokenize import sent_tokenize
print sent_tokenize('no drop in its quality as it got nearer to its end, in 2006. though i didn\'t like the movie much.')
>> ["no drop in its quality as it got nearer to its end, in 2006. though i didn't like the movie much."]
I have 30,000+ French-language articles in a JSON file. I would like to perform some text analysis on both individual articles and on the set as a whole. Before I go further, I'm starting with simple goals:
Identify important entities (people, places, concepts)
Find significant changes in the importance (~=frequency) of those entities over time (using the article sequence number as a proxy for time)
The steps I've taken so far:
Imported the data into a python list:
import json
json_articles=open('articlefile.json')
articlelist = json.load(json_articles)
Selected a single article to test, and concatenated the body text into a single string:
txt = ' '.join(data[10000]['body'])
Loaded a French sentence tokenizer and split the string into a list of sentences:
nltk.data.load('tokenizers/punkt/french.pickle')
tokens = [french_tokenizer.tokenize(s) for s in sentences]
Attempted to split the sentences into words using the WhiteSpaceTokenizer:
from nltk.tokenize import WhitespaceTokenizer
wst = WhitespaceTokenizer()
tokens = [wst.tokenize(s) for s in sentences]
This is where I'm stuck, for the following reasons:
NLTK doesn't have a built-in tokenizer which can split French into words. White space doesn't work well, particular due to the fact it won't correctly separate on apostrophes.
Even if I were to use regular expressions to split into individual words, there's no French PoS (parts of speech) tagger that I can use to tag those words, and no way to chunk them into logical units of meaning
For English, I could tag and chunk the text like so:
tagged = [nltk.pos_tag(token) for token in tokens]
chunks = nltk.batch_ne_chunk(tagged)
My main options (in order of current preference) seem to be:
Use nltk-trainer to train my own tagger and chunker.
Use the python wrapper for TreeTagger for just this part, as TreeTagger can already tag French, and someone has written a wrapper which calls the TreeTagger binary and parses the results.
Use a different tool altogether.
If I were to do (1), I imagine I would need to create my own tagged corpus. Is this correct, or would it be possible (and premitted) to use the French Treebank?
If the French Treebank corpus format (example here) is not suitable for use with nltk-trainer, is it feasible to convert it into such a format?
What approaches have French-speaking users of NLTK taken to PoS tag and chunk text?
There is also TreeTagger (supporting french corpus) with a Python wrapper. This is the solution I am currently using and it works quite good.
As of version 3.1.0 (January 2012), the Stanford PoS tagger supports French.
It should be possible to use this French tagger in NLTK, using Nitin Madnani's Interface to the Stanford POS-tagger
I haven't tried this yet, but it sounds easier than the other approaches I've considered, and I should be able to control the entire pipeline from within a Python script. I'll comment on this post when I have an outcome to share.
Here are some suggestions:
WhitespaceTokenizer is doing what it's meant to. If you want to split on apostrophes, try WordPunctTokenizer, check out the other available tokenizers, or roll your own with Regexp tokenizer or directly with the re module.
Make sure you've resolved text encoding issues (unicode or latin1), otherwise the tokenization will still go wrong.
The nltk only comes with the English tagger, as you discovered. It sounds like using TreeTagger would be the least work, since it's (almost) ready to use.
Training your own is also a practical option. But you definitely shouldn't create your own training corpus! Use an existing tagged corpus of French. You'll get best results if the genre of the training text matches your domain (articles). Also, you can use nltk-trainer but you could also use the NLTK features directly.
You can use the French Treebank corpus for training, but I don't know if there's a reader that knows its exact format. If not, you must start with XMLCorpusReader and subclass it to provide a tagged_sents() method.
If you're not already on the nltk-users mailing list, I think you'll want to get on it.