How to split a string on commas or periods in nltk - python

I want to separate a string on commas and/or periods in nltk. I've tried with sent_tokenize() but it separates only on periods.
I've also tried this code
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktLanguageVars
ex_sent = "This is an example showing sentence filtration.This is how it is done, in case of Python I want to learn more. So, that i can have some experience over it, by it I mean python."
class CommaPoint(PunktLanguageVars):
sent_end_chars = ('.','?','!',',')
tokenizer = PunktSentenceTokenizer(lang_vars = CommaPoint())
n_w=tokenizer.tokenize(ex_sent)
print n_w
The output for the code above is
['This is an example showing sentence filtration.This is how it is done,' 'in case of Python I want to learn more.' 'So,' 'that i can have some experience over it,' 'by it I mean python.\n']
When I try to give '.' without any space it is taking it as a word
I want the output as
['This is an example showing sentence filtration.' 'This is how it is done,' 'in case of Python I want to learn more.' 'So,' 'that i can have some experience over it,' 'by it I mean python.']

How about something simpler with re:
>>> import re
>>> sent = "This is an example showing sentence filtration.This is how it is done, in case of Python I want to learn more. So, that i can have some experience over it, by it I mean python."
>>> re.split(r'[.,]', sent)
['This is an example showing sentence filtration', 'This is how it is done', ' in case of Python I want to learn more', ' So', ' that i can have some experience over it', ' by it I mean python', '']
To keep the delimiter, you can use group:
>>> re.split(r'([.,])', sent)
['This is an example showing sentence filtration', '.', 'This is how it is done', ',', ' in case of Python I want to learn more', '.', ' So', ',', ' that i can have some experience over it', ',', ' by it I mean python', '.', '']

in this case you maybe can replace all commas with dots in the string and then tokenize it:
from nltk.tokenize import sent_tokenize
ex_sent = "This is an example showing sentence filtration.This is how it is done, in case of Python I want to learn more. So, that i can have some experience over it, by it I mean python."
ex_sent = ex_sent.replace(",", ".")
n_w = sent_tokenize(texto2, 'english')
print(n_w)

Related

Split text but include pattern in the first splitted part

Looks very obvious but couldn't find anything similar. I want to split some text and want the pattern of the split condition to be part of the first split part.
some_text = "Hi there. It's a nice weather. Have a great day."
pattern = re.compile(r'\.')
splitted_text = pattern.split(some_text)
returns:
['Hi there', " It's a nice weather", ' Have a great day', '']
What I want is that it returns:
['Hi there.', " It's a nice weather.", ' Have a great day.']
btw: I am only interested in the re solution and not some nltk library what is doing it with other methods.
It would be simpler and more efficient to use re.findall instead of splitting in this case:
re.findall(r'[^.]*\.', some_text)
This returns:
['Hi there.', " It's a nice weather.", ' Have a great day.']
You can use capture groups with re.split:
>>> re.split(r'([^.]+\.)', some_text)
['', 'Hi there.', '', " It's a nice weather.", '', ' Have a great day.', '']
If you want to also strip the leading spaces from the second two sentences, you can have \s* outside the capture group:
>>> re.split(r'([^.]+\.)\s*', some_text)
['', 'Hi there.', '', "It's a nice weather.", '', 'Have a great day.', '']
Or, (with Python 3.7+ or with the regex module) use a zero width lookbehind that will split immediately after a .:
>>> re.split(r'(?<=\.)', some_text)
['Hi there.', " It's a nice weather.", ' Have a great day.', '']
That will split the same even if there is no space after the ..
And you can filter the '' fields to remove the blank results from splitting:
>>> [field for field in re.split(r'([^.]+\.)', some_text) if field]
['Hi there.', " It's a nice weather.", ' Have a great day.']
You can split on the whitespace with a lookbehind to account for the period. Additionally, to account for the possibility of no whitespace, a lookahead can be used:
import re
some_text = "Hi there. It's a nice weather. Have a great day.It is a beautify day."
result = re.split('(?<=\.)\s|\.(?=[A-Z])', some_text)
Output:
['Hi there.', "It's a nice weather.", 'Have a great day', 'It is a beautify day.']
re explanation:
(?<=\.) => position lookbehind, a . must be matched for the next sequence to be matched.
\s => matches whitespace ().
| => Conditional that will attempt to match either the expression to its left or its right, depending on what side matches first.
\. => matches a period
(?=[A-Z]) matches the latter period if the next character is a capital letter.
If each sentence always ends with a ., it would be simpler and more efficient to use the str.split method instead of using any regular expression at all:
[s + '.' for s in some_text.split('.') if s]
This returns:
['Hi there.', " It's a nice weather.", ' Have a great day.']

Split a string into its sentences using python

I have this following string:
string = 'This is one sentence ${w_{1},..,w_{i}}$. This is another sentence. '
Now, I want to split it into two sentence.
However, when I do:
string.split('.')
I get:
['This is one sentence ${w_{1},',
'',
',w_{i}}$',
' This is another sentence',
' ']
Anyone has an idea of how to improve it, in order to not detect the "." within the $ $ ?
Also, how would you go about this:
string2 = 'This is one sentence ${w_{1},..,w_{i}}$! This is another sentence. Is this a sentence? Maybe ! '
EDIT 1:
The desired outputs would be:
For string 1:
['This is one sentence ${w_{1},..,w_{i}}$','This is another sentence']
For string 2:
['This is one sentence ${w_{1},..,w_{i}}$','This is another sentence', 'Is this a sentence', 'Maybe ! ']
For the more general case, you could use re.split like so:
import re
mystr = 'This is one sentence ${w_{1},..,w_{i}}$. This is another sentence. '
re.split("[.!?]\s{1,}", mystr)
# ['This is one sentence ${w_{1},..,w_{i}}$', 'This is another sentence', '']
str2 = 'This is one sentence ${w_{1},..,w_{i}}$! This is another sentence. Is this a sentence? Maybe ! '
re.split("[.!?]\s{1,}", str2)
['This is one sentence ${w_{1},..,w_{i}}$', 'This is another sentence', 'Is this a sentence', 'Maybe ', '']
Where the chars in the brackets are what you pick as your punctuation, and you add at least one space at the end \s{1,} to ignore the other .'s, which have no spacing. This will also handle your exclamation point case
Here's a (somewhat hacky) way to get the punctuation back
punct = re.findall("[.!?]\s{1,}", str2)
['! ', '. ', '? ', '! ']
sent = [x+y for x,y in zip(re.split("[.!?]\s{1,}", str2), punct)]
sent
['This is one sentence ${w_{1},..,w_{i}}$! ', 'This is another sentence. ', 'Is this a sentence? ', 'Maybe ! ']
You can use re.findall with an alternation pattern. To ensure that the sentence starts and ends with a non-whitespace, use a positive lookahead pattern at the start and a positive lookbehind pattern at the end:
re.findall(r'((?=[^.!?\s])(?:$.*?\$|[^.!?])*(?<=[^.!?\s]))\s*[.!?]', string)
This returns, for the first string:
['This is one sentence ${w_{1},..,w_{i}}$', 'This is another sentence']
and for the second string:
['This is one sentence ${w_{1},', ',w_{i}}$', 'This is another sentence', 'Is this a sentence', 'Maybe']
Use '. ' (with a space after the .) because that only exists when a sentence ends, not mid-sentence.
string = 'This is one sentence ${w_{1},..,w_{i}}$. This is another sentence. '
string.split('. ')
this returns:
['This is one sentence ${w_{1},..,w_{i}}$', 'This is another sentence', '']

Why is & tokenized as "&" in Python NLTK

When trying to use the Toktok word tokenizer from NLTK in Python3
string='&& Test & and L&R '
from nltk.tokenize.toktok import ToktokTokenizer
ToktokTokenizer().tokenize(string)
I obtain the following output:
['&&', 'Test', '&', 'and', 'L&R']
Looks like it escapes the & in a strange way.
I'm using NLTK version 3.3 and Python 3.6.4.
Any guess why this happens and an efficient way of solving it?
I know I can go through the answer with
[tok.replace("&","&") for tok in tokenized_sentence]
but it seems a dirty hack. I would like to know if there is a way of not producing this effect in the first way.
As mentioned by #snakecharmerb for the & the source states:
# Replace problematic character with numeric character reference.
One approach to solve the issue is to override the fields on the ToktokTokenizer instance, for example:
import re
from nltk.tokenize.toktok import ToktokTokenizer
string = '&& Test & and L&R '
tokenizer = ToktokTokenizer()
tokenizer.AMPERCENT = re.compile('& '), '& '
tokenizer.TOKTOK_REGEXES = [(regex, sub) if sub != '& ' else (re.compile('& '), '& ') for (regex, sub) in
ToktokTokenizer.TOKTOK_REGEXES]
result = tokenizer.tokenize(string)
print(result)
Output
['&&', 'Test', '&', 'and', 'L&R']

NLTK tokenize text with dialog into sentences

I am able to tokenize non-dialog text into sentences but when I add quotation marks to the sentence the NLTK tokenizer doesn't split them up correctly. For example, this works as expected:
import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
text1 = 'Is this one sentence? This is separate. This is a third he said.'
tokenizer.tokenize(text1)
This results in a list of three different sentences:
['Is this one sentence?', 'This is separate.', 'This is a third he said.']
However, if I make it into a dialogue, the same process doesn't work.
text2 = '“Is this one sentence?” “This is separate.” “This is a third” he said.'
tokenizer.tokenize(text2)
This returns it as a single sentence:
['“Is this one sentence?” “This is separate.” “This is a third” he said.']
How can I make the NLTK tokenizer work in this case?
It seems the tokenizer doesn't know what to do with the directed quotes. Replace them with regular ASCII double quotes and the example works fine.
>>> text3 = re.sub('[“”]', '"', text2)
>>> nltk.sent_tokenize(text3)
['"Is this one sentence?"', '"This is separate."', '"This is a third" he said.']

python regex finding all groups of words

Here is what I have so far
text = "Hello world. It is a nice day today. Don't you think so?"
re.findall('\w{3,}\s{1,}\w{3,}',text)
#['Hello world', 'nice day', 'you think']
The desired output would be ['Hello world', 'nice day', 'day today', 'today Don't', 'Don't you', 'you think']
Can this be done with a simple regex pattern?
import itertools as it
import re
three_pat=re.compile(r'\w{3}')
text = "Hello world. It is a nice day today. Don't you think so?"
for key,group in it.groupby(text.split(),lambda x: bool(three_pat.match(x))):
if key:
group=list(group)
for i in range(0,len(group)-1):
print(' '.join(group[i:i+2]))
# Hello world.
# nice day
# day today.
# today. Don't
# Don't you
# you think
It not clear to me what you want done with all punctuation. On the one hand, it looks like you want periods to be removed, but single quotation marks to be kept. It would be easy to implement the removal of periods, but before I do, would you clarify what you want to happen to all punctuation?
map(lambda x: x[0] + x[1], re.findall('(\w{3,}(?=(\s{1,}\w{3,})))',text))
May be you can rewrite the lambda for shorter (like just '+')
And BTW ' is not part of \w or \s
Something like this with additional checks for list boundaries should do:
>>> text = "Hello world. It is a nice day today. Don't you think so?"
>>> k = text.split()
>>> k
['Hello', 'world.', 'It', 'is', 'a', 'nice', 'day', 'today.', "Don't", 'you', 'think', 'so?']
>>> z = [x for x in k if len(x) > 2]
>>> z
['Hello', 'world.', 'nice', 'day', 'today.', "Don't", 'you', 'think', 'so?']
>>> [z[n]+ " " + z[n+1] for n in range(0, len(z)-1, 2)]
['Hello world.', 'nice day', "today. Don't", 'you think']
>>>
There are two problems with your approach:
Neither \w nor \s matches punctuation.
When you match a string with a regular expression using findall, that part of the string is consumed. Searching for the next match commences immediately after the end of the previous match. Because of this a word can't be included in two separate matches.
To solve the first issue you need to decide what you mean by a word. Regular expressions aren't good for this sort of parsing. You might want to look at a natural language parsing library instead.
But assuming that you can come up with a regular expression that works for your needs, to fix the second problem you can use a lookahead assertion to check the second word. This won't return the entire match as you want but you can at least find the first word in each word pair using this method.
re.findall('\w{3,}(?=\s{1,}\w{3,})',text)
^^^ ^
lookahead assertion

Categories