Non-consuming regular expression split in Python - python

How can a string be split on a separator expression while leaving that separator on the preceding string?
>>> text = "This is an example. Is it made up of more than once sentence? Yes, it is."
>>> re.split("[\.\?!] ", text)
['This is an example', 'Is it made up of more than one sentence', 'Yes, it is.']
I would like the result to be.
['This is an example.', 'Is it made up of more than one sentence?', 'Yes, it is.']
So far I have only tried a lookahead assertion but this fails to split at all.

>>> re.split("(?<=[\.\?!]) ", text)
['This is an example.', 'Is it made up of more than once sentence?', 'Yes, it is.']
The crucial thing is the use of a look-behind assertion with ?<=.

import re
text = "This is an example.A particular case.Made up of more "\
"than once sentence?Yes, it is.But no blank !!!That's"\
" a problem ????Yes.I think so! :)"
for x in re.split("(?<=[\.\?!]) ", text):
print repr(x)
print '\n'
for x in re.findall("[^.?!]*[.?!]|[^.?!]+(?=\Z)",text):
print repr(x)
result
"This is an example.A particular case.Made up of more than once sentence?Yes, it is.But no blank !!!That'sa problem ????Yes.I think so!"
':)'
'This is an example.'
'A particular case.'
'Made up of more than once sentence?'
'Yes, it is.'
'But no blank !'
'!'
'!'
"That's a problem ?"
'?'
'?'
'?'
'Yes.'
'I think so!'
' :)'
.
EDIT
Also
import re
text = "! This is an example.A particular case.Made up of more "\
"than once sentence?Yes, it is.But no blank !!!That's"\
" a problem ????Yes.I think so! :)"
res = re.split('([.?!])',text)
print [ ''.join(res[i:i+2]) for i in xrange(0,len(res),2) ]
gives
['!', ' This is an example.', 'A particular case.', 'Made up of more than once sentence?', 'Yes, it is.', 'But no blank !', '!', '!', "That's a problem ?", '?', '?', '?', 'Yes.', 'I think so!', ' :)']

Related

Split text but include pattern in the first splitted part

Looks very obvious but couldn't find anything similar. I want to split some text and want the pattern of the split condition to be part of the first split part.
some_text = "Hi there. It's a nice weather. Have a great day."
pattern = re.compile(r'\.')
splitted_text = pattern.split(some_text)
returns:
['Hi there', " It's a nice weather", ' Have a great day', '']
What I want is that it returns:
['Hi there.', " It's a nice weather.", ' Have a great day.']
btw: I am only interested in the re solution and not some nltk library what is doing it with other methods.
It would be simpler and more efficient to use re.findall instead of splitting in this case:
re.findall(r'[^.]*\.', some_text)
This returns:
['Hi there.', " It's a nice weather.", ' Have a great day.']
You can use capture groups with re.split:
>>> re.split(r'([^.]+\.)', some_text)
['', 'Hi there.', '', " It's a nice weather.", '', ' Have a great day.', '']
If you want to also strip the leading spaces from the second two sentences, you can have \s* outside the capture group:
>>> re.split(r'([^.]+\.)\s*', some_text)
['', 'Hi there.', '', "It's a nice weather.", '', 'Have a great day.', '']
Or, (with Python 3.7+ or with the regex module) use a zero width lookbehind that will split immediately after a .:
>>> re.split(r'(?<=\.)', some_text)
['Hi there.', " It's a nice weather.", ' Have a great day.', '']
That will split the same even if there is no space after the ..
And you can filter the '' fields to remove the blank results from splitting:
>>> [field for field in re.split(r'([^.]+\.)', some_text) if field]
['Hi there.', " It's a nice weather.", ' Have a great day.']
You can split on the whitespace with a lookbehind to account for the period. Additionally, to account for the possibility of no whitespace, a lookahead can be used:
import re
some_text = "Hi there. It's a nice weather. Have a great day.It is a beautify day."
result = re.split('(?<=\.)\s|\.(?=[A-Z])', some_text)
Output:
['Hi there.', "It's a nice weather.", 'Have a great day', 'It is a beautify day.']
re explanation:
(?<=\.) => position lookbehind, a . must be matched for the next sequence to be matched.
\s => matches whitespace ().
| => Conditional that will attempt to match either the expression to its left or its right, depending on what side matches first.
\. => matches a period
(?=[A-Z]) matches the latter period if the next character is a capital letter.
If each sentence always ends with a ., it would be simpler and more efficient to use the str.split method instead of using any regular expression at all:
[s + '.' for s in some_text.split('.') if s]
This returns:
['Hi there.', " It's a nice weather.", ' Have a great day.']

Python - capture all string between start and end of the string

How do I capture all the strings into a list given a start and end characters?
Here is what I tried:
import re
sequence = "This is start #\n hello word #\n #\n my code#\n this is end"
query = '#\n'
r = re.compile(query)
findall = re.findall(query,sequence)
print(findall)
This gives:
['#\n', '#\n', '#\n', '#\n']
Looking for output like:
[' hello word ',' my code']
Simple split() would be enough:
sequence = "This is start #\n hello word #\n #\n my code#\n this is end"
parts = sequence.split("#\n")[1:-1] # discard 1st and last because it is not between #\n
print(parts)
This will give you (the 1st and last part is immediately discarded because it is not between '#\n':
[' hello word ', ' ', ' my code'] # ' ' is strictly also between two #\n
You can clean this up:
# remove spaces and "empty" hits if it is only whitespace
mod_parts = [p.strip() for p in parts if p.strip()]
print(mod_parts)
to get to:
['hello word', 'my code']
or in short:
shorter = [x.strip() for x in sequence.split("#\n")[1:-1]]
Try:
print(re.findall("#\n(.*?)#\n", sequence))
The regex is to capture (non-greedily) anything between two '#\n', but never reuse that for next capture. But if you want it as a delimiter (like split(), you can try to use lookahead:
print(re.findall("#\n(.*?)(?=#\n)", sequence))
and in which case the output will be
[' hello word ', ' ', ' my code']
In this case, it would be better to just use the string function .split() and pass it #\n as what you want to split on. You can check for the length using s.strip() and filter out empty lines. If for some reason you don't want the first and last portions, you can use slices [1:-1] to remove them.
sequence = "This is start #\n hello word #\n #\n my code#\n this is end"
print(sequence.split("#\n"))
# ['This is start ', ' hello word ', ' ', ' my code', ' this is end']
print([s.strip() for s in sequence.split("#\n") if s.strip()])
# ['This is start', 'hello word', 'my code', 'this is end']
print([s.strip() for s in sequence.split("#\n") if s.strip()][1:-1])
# ['hello word', 'my code']
Just as Brian suggested, you can use split function. However, if you consider those start and end patterns like parenthesis, the correct way to find the tokens is:
print([s.strip() for s in sequence.split("#\n")][1:-1:2])
it simply skips the strings between an end to its following start. For example, if the input is
sequence = "This is start #\n hello word #\n BETWEEN END1 AND START2 #\n my code#\n this is end"
the term BETWEEN END1 AND START2 should not be captured; so, the correct output is:
['hello word', 'my code']
You could use
#\n([\s\S]+?)#\n
As in
import re
rx = re.compile(r'#\n([\s\S]+?)#\n')
text = """This is start #
hello word #
#
my code#
this is end"""
matches = rx.findall(text)
print(matches)
This yields
[' hello word ', ' my code']
See a demo for the expression on regex101.com.

Split a string into its sentences using python

I have this following string:
string = 'This is one sentence ${w_{1},..,w_{i}}$. This is another sentence. '
Now, I want to split it into two sentence.
However, when I do:
string.split('.')
I get:
['This is one sentence ${w_{1},',
'',
',w_{i}}$',
' This is another sentence',
' ']
Anyone has an idea of how to improve it, in order to not detect the "." within the $ $ ?
Also, how would you go about this:
string2 = 'This is one sentence ${w_{1},..,w_{i}}$! This is another sentence. Is this a sentence? Maybe ! '
EDIT 1:
The desired outputs would be:
For string 1:
['This is one sentence ${w_{1},..,w_{i}}$','This is another sentence']
For string 2:
['This is one sentence ${w_{1},..,w_{i}}$','This is another sentence', 'Is this a sentence', 'Maybe ! ']
For the more general case, you could use re.split like so:
import re
mystr = 'This is one sentence ${w_{1},..,w_{i}}$. This is another sentence. '
re.split("[.!?]\s{1,}", mystr)
# ['This is one sentence ${w_{1},..,w_{i}}$', 'This is another sentence', '']
str2 = 'This is one sentence ${w_{1},..,w_{i}}$! This is another sentence. Is this a sentence? Maybe ! '
re.split("[.!?]\s{1,}", str2)
['This is one sentence ${w_{1},..,w_{i}}$', 'This is another sentence', 'Is this a sentence', 'Maybe ', '']
Where the chars in the brackets are what you pick as your punctuation, and you add at least one space at the end \s{1,} to ignore the other .'s, which have no spacing. This will also handle your exclamation point case
Here's a (somewhat hacky) way to get the punctuation back
punct = re.findall("[.!?]\s{1,}", str2)
['! ', '. ', '? ', '! ']
sent = [x+y for x,y in zip(re.split("[.!?]\s{1,}", str2), punct)]
sent
['This is one sentence ${w_{1},..,w_{i}}$! ', 'This is another sentence. ', 'Is this a sentence? ', 'Maybe ! ']
You can use re.findall with an alternation pattern. To ensure that the sentence starts and ends with a non-whitespace, use a positive lookahead pattern at the start and a positive lookbehind pattern at the end:
re.findall(r'((?=[^.!?\s])(?:$.*?\$|[^.!?])*(?<=[^.!?\s]))\s*[.!?]', string)
This returns, for the first string:
['This is one sentence ${w_{1},..,w_{i}}$', 'This is another sentence']
and for the second string:
['This is one sentence ${w_{1},', ',w_{i}}$', 'This is another sentence', 'Is this a sentence', 'Maybe']
Use '. ' (with a space after the .) because that only exists when a sentence ends, not mid-sentence.
string = 'This is one sentence ${w_{1},..,w_{i}}$. This is another sentence. '
string.split('. ')
this returns:
['This is one sentence ${w_{1},..,w_{i}}$', 'This is another sentence', '']

How to split a string on commas or periods in nltk

I want to separate a string on commas and/or periods in nltk. I've tried with sent_tokenize() but it separates only on periods.
I've also tried this code
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktLanguageVars
ex_sent = "This is an example showing sentence filtration.This is how it is done, in case of Python I want to learn more. So, that i can have some experience over it, by it I mean python."
class CommaPoint(PunktLanguageVars):
sent_end_chars = ('.','?','!',',')
tokenizer = PunktSentenceTokenizer(lang_vars = CommaPoint())
n_w=tokenizer.tokenize(ex_sent)
print n_w
The output for the code above is
['This is an example showing sentence filtration.This is how it is done,' 'in case of Python I want to learn more.' 'So,' 'that i can have some experience over it,' 'by it I mean python.\n']
When I try to give '.' without any space it is taking it as a word
I want the output as
['This is an example showing sentence filtration.' 'This is how it is done,' 'in case of Python I want to learn more.' 'So,' 'that i can have some experience over it,' 'by it I mean python.']
How about something simpler with re:
>>> import re
>>> sent = "This is an example showing sentence filtration.This is how it is done, in case of Python I want to learn more. So, that i can have some experience over it, by it I mean python."
>>> re.split(r'[.,]', sent)
['This is an example showing sentence filtration', 'This is how it is done', ' in case of Python I want to learn more', ' So', ' that i can have some experience over it', ' by it I mean python', '']
To keep the delimiter, you can use group:
>>> re.split(r'([.,])', sent)
['This is an example showing sentence filtration', '.', 'This is how it is done', ',', ' in case of Python I want to learn more', '.', ' So', ',', ' that i can have some experience over it', ',', ' by it I mean python', '.', '']
in this case you maybe can replace all commas with dots in the string and then tokenize it:
from nltk.tokenize import sent_tokenize
ex_sent = "This is an example showing sentence filtration.This is how it is done, in case of Python I want to learn more. So, that i can have some experience over it, by it I mean python."
ex_sent = ex_sent.replace(",", ".")
n_w = sent_tokenize(texto2, 'english')
print(n_w)

Split string with delimiters in Python

I have such a String as an example:
"[greeting] Hello [me] my name is John."
I want to split it and get such a result
('[greetings]', 'Hello' , '[me]', 'my name is John')
Can it be done in one line of code?
OK another example as it seems that many misunderstood the question.
"[greeting] Hello my friends [me] my name is John. [bow] nice to meet you."
then I should get
('[greetings]', ' Hello my friends ' , '[me]', ' my name is John. ', '[bow]', ' nice to meet you.')
I basically want to send this kind of string to my robot. It will automatically decompose it and do some motion corresponding to [greetings] [me] and [bow] and in between speak the other strings.
Using regex:
>>> import re
>>> s = "[greeting] Hello my friends [me] my name is John. [bow] nice to meet you."
>>> re.findall(r'\[[\w\s.]+\]|[\w\s.]+', s)
['[greeting]', ' Hello my friends ', '[me]', ' my name is John. ', '[bow]', ' nice to meet you.']
Edit:
>>> s = "I can't see you"
>>> re.findall(r'\[.*?\]|.*?(?=\[|$)', s)[:-1]
["I can't see you"]
>>> s = "[greeting] Hello my friends [me] my name is John. [bow] nice to meet you."
>>> re.findall(r'\[.*?\]|.*?(?=\[|$)', s)[:-1]
['[greeting]', ' Hello my friends ', '[me]', ' my name is John. ', '[bow]', ' nice to meet you.'
The function you're after is .split(). The function accepts a delimiter as its argument and returns a list made by splitting the string at every occurrence of the delimiter. To split a string, using either "[" or "]" as a delimiter, you should use a regular expression:
import re
str = "[greeting] Hello [me] my name is John."
re.split("\]|\[", str)
# returns ['', 'greeting', ' Hello ', 'me', ' my name is John.']
This uses a regular expression to split the string.
\] # escape the right bracket
| # OR
\[ # escape the left bracket
I think can't be done in one line, you need first split by ], then [:
# Run in the python shell
sentence = "[greeting] Hello [me] my name is John."
for part in sentence.split(']')
part.split('[')
# Output
['', 'greeting']
[' Hello ', 'me']
[' my name is John.']

Categories