Split a string into its sentences using python - python

I have this following string:
string = 'This is one sentence ${w_{1},..,w_{i}}$. This is another sentence. '
Now, I want to split it into two sentence.
However, when I do:
string.split('.')
I get:
['This is one sentence ${w_{1},',
'',
',w_{i}}$',
' This is another sentence',
' ']
Anyone has an idea of how to improve it, in order to not detect the "." within the $ $ ?
Also, how would you go about this:
string2 = 'This is one sentence ${w_{1},..,w_{i}}$! This is another sentence. Is this a sentence? Maybe ! '
EDIT 1:
The desired outputs would be:
For string 1:
['This is one sentence ${w_{1},..,w_{i}}$','This is another sentence']
For string 2:
['This is one sentence ${w_{1},..,w_{i}}$','This is another sentence', 'Is this a sentence', 'Maybe ! ']

For the more general case, you could use re.split like so:
import re
mystr = 'This is one sentence ${w_{1},..,w_{i}}$. This is another sentence. '
re.split("[.!?]\s{1,}", mystr)
# ['This is one sentence ${w_{1},..,w_{i}}$', 'This is another sentence', '']
str2 = 'This is one sentence ${w_{1},..,w_{i}}$! This is another sentence. Is this a sentence? Maybe ! '
re.split("[.!?]\s{1,}", str2)
['This is one sentence ${w_{1},..,w_{i}}$', 'This is another sentence', 'Is this a sentence', 'Maybe ', '']
Where the chars in the brackets are what you pick as your punctuation, and you add at least one space at the end \s{1,} to ignore the other .'s, which have no spacing. This will also handle your exclamation point case
Here's a (somewhat hacky) way to get the punctuation back
punct = re.findall("[.!?]\s{1,}", str2)
['! ', '. ', '? ', '! ']
sent = [x+y for x,y in zip(re.split("[.!?]\s{1,}", str2), punct)]
sent
['This is one sentence ${w_{1},..,w_{i}}$! ', 'This is another sentence. ', 'Is this a sentence? ', 'Maybe ! ']

You can use re.findall with an alternation pattern. To ensure that the sentence starts and ends with a non-whitespace, use a positive lookahead pattern at the start and a positive lookbehind pattern at the end:
re.findall(r'((?=[^.!?\s])(?:$.*?\$|[^.!?])*(?<=[^.!?\s]))\s*[.!?]', string)
This returns, for the first string:
['This is one sentence ${w_{1},..,w_{i}}$', 'This is another sentence']
and for the second string:
['This is one sentence ${w_{1},', ',w_{i}}$', 'This is another sentence', 'Is this a sentence', 'Maybe']

Use '. ' (with a space after the .) because that only exists when a sentence ends, not mid-sentence.
string = 'This is one sentence ${w_{1},..,w_{i}}$. This is another sentence. '
string.split('. ')
this returns:
['This is one sentence ${w_{1},..,w_{i}}$', 'This is another sentence', '']

Related

How to split string in numpy.ndarray?

I have a lot of text in a numpy.ndarray that looks like this:
['This is example sentence 1.|This is example sentence 2.'
'This is example sentence 3.'
'This is example sentence 4.'
'This is example sentence 5.'
'This is example sentence 6.|This is example sentence 7.|This is example sentence 8.|This is example sentence 9.|This is example sentence 10.']
The array can have a large and varying number of elements and individual elements can have many sentences separated with "|".
How do I convert the example above into this:
['This is example sentence 1.'
'This is example sentence 2.'
'This is example sentence 3.'
'This is example sentence 4.'
'This is example sentence 5.'
'This is example sentence 6.'
'This is example sentence 7.'
'This is example sentence 8.'
'This is example sentence 9.'
'This is example sentence 10.']
Basically, I'm trying to create a 1-dimensional array that will split elements with "|" into their own separate elements. I've tried many versions of split and can't get them to work for one reason or another.
Thanks!
You can try np.char.split:
# np.concatenate or np.hstack
>>> np.concatenate(np.char.split(arr.astype(str), sep='|'))
array(['This is example sentence 1.', 'This is example sentence 2.',
'This is example sentence 3.', 'This is example sentence 4.',
'This is example sentence 5.', 'This is example sentence 6.',
'This is example sentence 7.', 'This is example sentence 8.',
'This is example sentence 9.', 'This is example sentence 10.'],
dtype='<U28')

Split text but include pattern in the first splitted part

Looks very obvious but couldn't find anything similar. I want to split some text and want the pattern of the split condition to be part of the first split part.
some_text = "Hi there. It's a nice weather. Have a great day."
pattern = re.compile(r'\.')
splitted_text = pattern.split(some_text)
returns:
['Hi there', " It's a nice weather", ' Have a great day', '']
What I want is that it returns:
['Hi there.', " It's a nice weather.", ' Have a great day.']
btw: I am only interested in the re solution and not some nltk library what is doing it with other methods.
It would be simpler and more efficient to use re.findall instead of splitting in this case:
re.findall(r'[^.]*\.', some_text)
This returns:
['Hi there.', " It's a nice weather.", ' Have a great day.']
You can use capture groups with re.split:
>>> re.split(r'([^.]+\.)', some_text)
['', 'Hi there.', '', " It's a nice weather.", '', ' Have a great day.', '']
If you want to also strip the leading spaces from the second two sentences, you can have \s* outside the capture group:
>>> re.split(r'([^.]+\.)\s*', some_text)
['', 'Hi there.', '', "It's a nice weather.", '', 'Have a great day.', '']
Or, (with Python 3.7+ or with the regex module) use a zero width lookbehind that will split immediately after a .:
>>> re.split(r'(?<=\.)', some_text)
['Hi there.', " It's a nice weather.", ' Have a great day.', '']
That will split the same even if there is no space after the ..
And you can filter the '' fields to remove the blank results from splitting:
>>> [field for field in re.split(r'([^.]+\.)', some_text) if field]
['Hi there.', " It's a nice weather.", ' Have a great day.']
You can split on the whitespace with a lookbehind to account for the period. Additionally, to account for the possibility of no whitespace, a lookahead can be used:
import re
some_text = "Hi there. It's a nice weather. Have a great day.It is a beautify day."
result = re.split('(?<=\.)\s|\.(?=[A-Z])', some_text)
Output:
['Hi there.', "It's a nice weather.", 'Have a great day', 'It is a beautify day.']
re explanation:
(?<=\.) => position lookbehind, a . must be matched for the next sequence to be matched.
\s => matches whitespace ().
| => Conditional that will attempt to match either the expression to its left or its right, depending on what side matches first.
\. => matches a period
(?=[A-Z]) matches the latter period if the next character is a capital letter.
If each sentence always ends with a ., it would be simpler and more efficient to use the str.split method instead of using any regular expression at all:
[s + '.' for s in some_text.split('.') if s]
This returns:
['Hi there.', " It's a nice weather.", ' Have a great day.']

How do I nextline after a dot in a set of next lined words in .txt file

I have a problem with my code. I have a text file and inside of this text file is a thousand of tabbed/next lined words which came from a sentence. My problem is I want to revert the words inside of this text file and make it a sentence again.
I have thought of a way which is making a for loop statement and if it hits the dot . then it will store the sentence inside the list.
with('test','r') as f:
text = f.open()
sentence = []
sentences = []
for words in text:
if words != "."
sentence.append(words)
elif words == "."
sentence.append(words)
sentences.append(sentence)
sentence = []
#Sample output
#[['This', 'is', 'a', 'sentence', '.'], ['This', 'is', 'the', 'second', 'sentence', '.'],
#['This', 'is', 'the', 'third', 'sentence', '.']],
#This is the text file
This
is
a
sentence
.
This
is
the
second
sentence
.
This
is
thr
third
sentence
.
The code kinda works but it's a little bit complicated. I'm finding a much shorter and not so complicated idea. Thank you in advance.
This is pretty straightforward. Read from file, split into lines by period, split each line by any whitespace, rejoin line with single spaces, throw the period back at the end of the sentence.
sentences = [' '.join(x.split()) + '.' for x in open('test','r').read().split('.')]
You could use str.split().
For example:
text = 'First sentence. Second sentence. This is the third sentence. '
text.split('. ')[:-1]
>>> ['First sentence', 'Second sentence', 'This is the third sentence']
If you want to include the . you have to do it like this:
text = 'First sentence. Second sentence. This is the third sentence. '
split_text = [e+'.' for e in text.split('. ')][:-1]
split_text
>>> ['First sentence.', 'Second sentence.', 'This is the third sentence.']
Below is one liner for the same, Let me know if need more help:
sentences = open('test','r').read().split('\.')

How to split a string on commas or periods in nltk

I want to separate a string on commas and/or periods in nltk. I've tried with sent_tokenize() but it separates only on periods.
I've also tried this code
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktLanguageVars
ex_sent = "This is an example showing sentence filtration.This is how it is done, in case of Python I want to learn more. So, that i can have some experience over it, by it I mean python."
class CommaPoint(PunktLanguageVars):
sent_end_chars = ('.','?','!',',')
tokenizer = PunktSentenceTokenizer(lang_vars = CommaPoint())
n_w=tokenizer.tokenize(ex_sent)
print n_w
The output for the code above is
['This is an example showing sentence filtration.This is how it is done,' 'in case of Python I want to learn more.' 'So,' 'that i can have some experience over it,' 'by it I mean python.\n']
When I try to give '.' without any space it is taking it as a word
I want the output as
['This is an example showing sentence filtration.' 'This is how it is done,' 'in case of Python I want to learn more.' 'So,' 'that i can have some experience over it,' 'by it I mean python.']
How about something simpler with re:
>>> import re
>>> sent = "This is an example showing sentence filtration.This is how it is done, in case of Python I want to learn more. So, that i can have some experience over it, by it I mean python."
>>> re.split(r'[.,]', sent)
['This is an example showing sentence filtration', 'This is how it is done', ' in case of Python I want to learn more', ' So', ' that i can have some experience over it', ' by it I mean python', '']
To keep the delimiter, you can use group:
>>> re.split(r'([.,])', sent)
['This is an example showing sentence filtration', '.', 'This is how it is done', ',', ' in case of Python I want to learn more', '.', ' So', ',', ' that i can have some experience over it', ',', ' by it I mean python', '.', '']
in this case you maybe can replace all commas with dots in the string and then tokenize it:
from nltk.tokenize import sent_tokenize
ex_sent = "This is an example showing sentence filtration.This is how it is done, in case of Python I want to learn more. So, that i can have some experience over it, by it I mean python."
ex_sent = ex_sent.replace(",", ".")
n_w = sent_tokenize(texto2, 'english')
print(n_w)

Non-consuming regular expression split in Python

How can a string be split on a separator expression while leaving that separator on the preceding string?
>>> text = "This is an example. Is it made up of more than once sentence? Yes, it is."
>>> re.split("[\.\?!] ", text)
['This is an example', 'Is it made up of more than one sentence', 'Yes, it is.']
I would like the result to be.
['This is an example.', 'Is it made up of more than one sentence?', 'Yes, it is.']
So far I have only tried a lookahead assertion but this fails to split at all.
>>> re.split("(?<=[\.\?!]) ", text)
['This is an example.', 'Is it made up of more than once sentence?', 'Yes, it is.']
The crucial thing is the use of a look-behind assertion with ?<=.
import re
text = "This is an example.A particular case.Made up of more "\
"than once sentence?Yes, it is.But no blank !!!That's"\
" a problem ????Yes.I think so! :)"
for x in re.split("(?<=[\.\?!]) ", text):
print repr(x)
print '\n'
for x in re.findall("[^.?!]*[.?!]|[^.?!]+(?=\Z)",text):
print repr(x)
result
"This is an example.A particular case.Made up of more than once sentence?Yes, it is.But no blank !!!That'sa problem ????Yes.I think so!"
':)'
'This is an example.'
'A particular case.'
'Made up of more than once sentence?'
'Yes, it is.'
'But no blank !'
'!'
'!'
"That's a problem ?"
'?'
'?'
'?'
'Yes.'
'I think so!'
' :)'
.
EDIT
Also
import re
text = "! This is an example.A particular case.Made up of more "\
"than once sentence?Yes, it is.But no blank !!!That's"\
" a problem ????Yes.I think so! :)"
res = re.split('([.?!])',text)
print [ ''.join(res[i:i+2]) for i in xrange(0,len(res),2) ]
gives
['!', ' This is an example.', 'A particular case.', 'Made up of more than once sentence?', 'Yes, it is.', 'But no blank !', '!', '!', "That's a problem ?", '?', '?', '?', 'Yes.', 'I think so!', ' :)']

Categories