NLTK tokenize text with dialog into sentences

NLTK tokenize text with dialog into sentences - python

I am able to tokenize non-dialog text into sentences but when I add quotation marks to the sentence the NLTK tokenizer doesn't split them up correctly. For example, this works as expected:
import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
text1 = 'Is this one sentence? This is separate. This is a third he said.'
tokenizer.tokenize(text1)
This results in a list of three different sentences:
['Is this one sentence?', 'This is separate.', 'This is a third he said.']
However, if I make it into a dialogue, the same process doesn't work.
text2 = '“Is this one sentence?” “This is separate.” “This is a third” he said.'
tokenizer.tokenize(text2)
This returns it as a single sentence:
['“Is this one sentence?” “This is separate.” “This is a third” he said.']
How can I make the NLTK tokenizer work in this case?

It seems the tokenizer doesn't know what to do with the directed quotes. Replace them with regular ASCII double quotes and the example works fine.
>>> text3 = re.sub('[“”]', '"', text2)
>>> nltk.sent_tokenize(text3)
['"Is this one sentence?"', '"This is separate."', '"This is a third" he said.']

Related

How to replace multiple substrings in a list of sentences using regex in python?

I have a list of sentences as below :
sentences = ["I am learning to code", "coding seems to be intresting in python", "how to code in python", "practicing how to code is the key"]
Now I wish to replace few substrings in this list of sentences using dictionary of words and its replacements.
word_list = {'intresting': 'interesting', 'how to code': 'learning how to code', 'am learning':'love learning', 'in python': 'using python'}
I tried the following code:
replaced_sentences = [' '.join([word_list.get(w, w) for w in sentence.split()])
for sentence in sentences]
But only the one word string is getting replaced and not the keys with more than one word. It is because am using sentence.split() which tokenizes sentences word by word and misses out replacing substrings greater than one word.
How do I get to replace the substring with exact match using regex or any other suggestions?
expected output:
sentences = ["I love learning to code", "coding seems to be interesting using python", "learning how to code using python", "practicing learning how to code is the key"]
Thanks in advance.

It's probably easiest to read if you break this into a function that replaces all the words for a single sentence. Then you can apply it to all the sentences in the list. Here we make a single regex by concaving all the keys of the dict with '|'. Then use re.sub grab the found value associated with the key, and return it as the replacement.
import re
def replace_words(s, word_lookup):
rx = '|'.join(word_lookup.keys())
return re.sub(rx, lambda match: word_lookup[match.group(0)], s)
[replace_words(s, word_list) for s in sentences]
This will result in:
['I love learning to code',
'coding seems to be interesting using python',
'learning how to code using python',
'practicing learning how to code is the key']
You could optimize a bit by making the regex once instead of each time in the function. This would allow you to do something like:
import re
rx = re.compile('|'.join(word_list.keys()))
[rx.sub(lambda match: word_list[match.group(0)], s) for s in sentences]

Tokenize list of strings without comma separation

I'm still new to Python and want to know how I can tokenize a list of strings without every word being seperated by a comma.
For example, starting from a list like ['I have to get groceries.','I need some bananas.','Anything else?'], I want to obtain a list like this: ['I have to get groceries .', 'I need some bananas .', 'Anything else ?']. The point is thus not to create a list with separate tokens necessarily, but to create a list with sentences in which all words and punctuation marks are separated from each other.
Any ideas? I only managed to create a list of comma separated tokens, using this code:
nltk.download('punkt')
from nltk import word_tokenize
tokenized = []
for line in unique:
tokenized.append(word_tokenize(line))```

You can join the tokenized lines with a space, just use
from nltk import word_tokenize
unique = ['I have to get groceries.','I need some bananas.','Anything else?']
tokenized = [" ".join(word_tokenize(line)) for line in unique]
print(tokenized)
# => ['I have to get groceries .', 'I need some bananas .', 'Anything else ?']

Python - specify corpus from whose sentence to perform a function on?

I have imported all the books from the NLTK Book library, and I am just trying to figure out how to define a corpus then sentence to be printed.
For example, if I wanted to print sentence 1 of text 3, then sentence 2 of text 4
import nltk
from nltk.book import *
print(???)
print(???)
I've tried the below combinations, which do not work:
print(text3.sent1)
print(text4.sent2)
print(sent1.text3)
print(sent2.text4)
print(text3(sent1))
print(text4(sent2))
I am new to python, so it is likely a v. basic question, but I cannot seem to find the solution elsewhere.
Many thanks, in advance!

Simple example can be given as :
from nltk.tokenize import sent_tokenize
# List of sentences
sentences = "This is first sentence. This is second sentence. Let's try to tokenize the sentences. how are you? I am doing good"
# define function
def sentence_tokenizer(sentences):
sentence_tokenize_list = sent_tokenize(sentences)
print "tokenized sentences are = ", sentence_tokenize_list
return sentence_tokenize_list
# call function
tokenized_sentences = sentence_tokenizer(sentences)
# print first sentence
print tokenized_sentences[0]
Hope this helps.

You need to split the texts into lists of sentences first.
If you already have text3 and text4:
from nltk.tokenize import sent_tokenize
sents = sent_tokenize(text3)
print(sents[0]) # the first sentence in the list is at position 0
sents = sent_tokenize(text4)
print(sents[1]) # the second sentence in the list is at position 1
print(text3[0]) # prints the first word of text3
You seem to need both a NLTK tutorial and a python tutorial. Luckily, the NLTK book is both.

NLTK: How can I extract information based on sentence maps?

I know you can use noun extraction to get nouns out of sentences but how can I use sentence overlays/maps to take out phrases?
For example:
Sentence Overlay:
"First, #action; Second, Foobar"
Input:
"First, Dance and Code; Second, Foobar"
I want to return:
action = "Dance and Code"
Normal Noun Extractions wont work because it wont always be nouns
The way sentences are phrased differs so it cant be words[x] ... because the positioning of the words changes

You can slightly rewrite your string templates to turn them into regexps, and see which one (or which ones) match.
>>> template = "First, (?P<action>.*); Second, Foobar"
>>> mo = re.search(template, "First, Dance and Code; Second, Foobar")
>>> if mo:
print(mo.group("action"))
Dance and Code
You can even transform your existing strings into this kind of regexp (after escaping regexp metacharacters like .?*()).
>>> template = "First, #action; (Second, Foobar...)"
>>> re_template = re.sub(r"\\#(\w+)", r"(?P<\g<1>>.*)", re.escape(template))
>>> print(re_template)
First\,\ (?P<action>.*)\;\ \(Second\,\ Foobar\.\.\.\)

filtering stopwords near punctuation

I am trying to filter out stopwords in my text like so:
clean = ' '.join([word for word in text.split() if word not in (stopwords)])
The problem is that text.split() has elements like 'word.' that don't match to the stopword 'word'.
I later use clean in sent_tokenize(clean), however, so I don't want to get rid of the punctuation altogether.
How do I filter out stopwords while retaining punctuation, but filtering words like 'word.'?
I thought it would be possible to change the punctuation:
text = text.replace('.',' . ')
and then
clean = ' '.join([word for word in text.split() if word not in (stop words)] or word == ".")
But is there a better way?

Tokenize the text first, than clean it from stopwords. A tokenizer usually recognizes punctuation.
import nltk
text = 'Son, if you really want something in this life,\
you have to work for it. Now quiet! They are about\
to announce the lottery numbers.'
stopwords = ['in', 'to', 'for', 'the']
sents = []
for sent in nltk.sent_tokenize(text):
tokens = nltk.word_tokenize(sent)
sents.append(' '.join([w for w in tokens if w not in stopwords]))
print sents
['Son , if you really want something this life , you have work it .', 'Now quiet !', 'They are about announce lottery numbers .']

You could use something like this:
import re
clean = ' '.join([word for word in text.split() if re.match('([a-z]|[A-Z])+', word).group().lower() not in (stopwords)])
This pulls out everything except lowercase and uppercase ascii letters and matches it to words in your stopcase set or list. Also, it assumes that all of your words in stopwords are lowercase, which is why I converted the word to all lowercase. Take that out if I made to great of an assumption
Also, I'm not proficient in regex, sorry if there's a cleaner or robust way of doing this.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

NLTK tokenize text with dialog into sentences - python

It seems the tokenizer doesn't know what to do with the directed quotes. Replace them with regular ASCII double quotes and the example works fine. >>> text3 = re.sub('[“”]', '"', text2) >>> nltk.sent_tokenize(text3) ['"Is this one sentence?"', '"This is separate."', '"This is a third" he said.']

Related

How to replace multiple substrings in a list of sentences using regex in python?

Tokenize list of strings without comma separation

Python - specify corpus from whose sentence to perform a function on?

NLTK: How can I extract information based on sentence maps?

filtering stopwords near punctuation

Categories

Resources