Split string with delimiters in Python - python

I have such a String as an example:
"[greeting] Hello [me] my name is John."
I want to split it and get such a result
('[greetings]', 'Hello' , '[me]', 'my name is John')
Can it be done in one line of code?
OK another example as it seems that many misunderstood the question.
"[greeting] Hello my friends [me] my name is John. [bow] nice to meet you."
then I should get
('[greetings]', ' Hello my friends ' , '[me]', ' my name is John. ', '[bow]', ' nice to meet you.')
I basically want to send this kind of string to my robot. It will automatically decompose it and do some motion corresponding to [greetings] [me] and [bow] and in between speak the other strings.

Using regex:
>>> import re
>>> s = "[greeting] Hello my friends [me] my name is John. [bow] nice to meet you."
>>> re.findall(r'\[[\w\s.]+\]|[\w\s.]+', s)
['[greeting]', ' Hello my friends ', '[me]', ' my name is John. ', '[bow]', ' nice to meet you.']
Edit:
>>> s = "I can't see you"
>>> re.findall(r'\[.*?\]|.*?(?=\[|$)', s)[:-1]
["I can't see you"]
>>> s = "[greeting] Hello my friends [me] my name is John. [bow] nice to meet you."
>>> re.findall(r'\[.*?\]|.*?(?=\[|$)', s)[:-1]
['[greeting]', ' Hello my friends ', '[me]', ' my name is John. ', '[bow]', ' nice to meet you.'

The function you're after is .split(). The function accepts a delimiter as its argument and returns a list made by splitting the string at every occurrence of the delimiter. To split a string, using either "[" or "]" as a delimiter, you should use a regular expression:
import re
str = "[greeting] Hello [me] my name is John."
re.split("\]|\[", str)
# returns ['', 'greeting', ' Hello ', 'me', ' my name is John.']
This uses a regular expression to split the string.
\] # escape the right bracket
| # OR
\[ # escape the left bracket

I think can't be done in one line, you need first split by ], then [:
# Run in the python shell
sentence = "[greeting] Hello [me] my name is John."
for part in sentence.split(']')
part.split('[')
# Output
['', 'greeting']
[' Hello ', 'me']
[' my name is John.']

Related

Split text but include pattern in the first splitted part

Looks very obvious but couldn't find anything similar. I want to split some text and want the pattern of the split condition to be part of the first split part.
some_text = "Hi there. It's a nice weather. Have a great day."
pattern = re.compile(r'\.')
splitted_text = pattern.split(some_text)
returns:
['Hi there', " It's a nice weather", ' Have a great day', '']
What I want is that it returns:
['Hi there.', " It's a nice weather.", ' Have a great day.']
btw: I am only interested in the re solution and not some nltk library what is doing it with other methods.
It would be simpler and more efficient to use re.findall instead of splitting in this case:
re.findall(r'[^.]*\.', some_text)
This returns:
['Hi there.', " It's a nice weather.", ' Have a great day.']
You can use capture groups with re.split:
>>> re.split(r'([^.]+\.)', some_text)
['', 'Hi there.', '', " It's a nice weather.", '', ' Have a great day.', '']
If you want to also strip the leading spaces from the second two sentences, you can have \s* outside the capture group:
>>> re.split(r'([^.]+\.)\s*', some_text)
['', 'Hi there.', '', "It's a nice weather.", '', 'Have a great day.', '']
Or, (with Python 3.7+ or with the regex module) use a zero width lookbehind that will split immediately after a .:
>>> re.split(r'(?<=\.)', some_text)
['Hi there.', " It's a nice weather.", ' Have a great day.', '']
That will split the same even if there is no space after the ..
And you can filter the '' fields to remove the blank results from splitting:
>>> [field for field in re.split(r'([^.]+\.)', some_text) if field]
['Hi there.', " It's a nice weather.", ' Have a great day.']
You can split on the whitespace with a lookbehind to account for the period. Additionally, to account for the possibility of no whitespace, a lookahead can be used:
import re
some_text = "Hi there. It's a nice weather. Have a great day.It is a beautify day."
result = re.split('(?<=\.)\s|\.(?=[A-Z])', some_text)
Output:
['Hi there.', "It's a nice weather.", 'Have a great day', 'It is a beautify day.']
re explanation:
(?<=\.) => position lookbehind, a . must be matched for the next sequence to be matched.
\s => matches whitespace ().
| => Conditional that will attempt to match either the expression to its left or its right, depending on what side matches first.
\. => matches a period
(?=[A-Z]) matches the latter period if the next character is a capital letter.
If each sentence always ends with a ., it would be simpler and more efficient to use the str.split method instead of using any regular expression at all:
[s + '.' for s in some_text.split('.') if s]
This returns:
['Hi there.', " It's a nice weather.", ' Have a great day.']

Python - capture all string between start and end of the string

How do I capture all the strings into a list given a start and end characters?
Here is what I tried:
import re
sequence = "This is start #\n hello word #\n #\n my code#\n this is end"
query = '#\n'
r = re.compile(query)
findall = re.findall(query,sequence)
print(findall)
This gives:
['#\n', '#\n', '#\n', '#\n']
Looking for output like:
[' hello word ',' my code']
Simple split() would be enough:
sequence = "This is start #\n hello word #\n #\n my code#\n this is end"
parts = sequence.split("#\n")[1:-1] # discard 1st and last because it is not between #\n
print(parts)
This will give you (the 1st and last part is immediately discarded because it is not between '#\n':
[' hello word ', ' ', ' my code'] # ' ' is strictly also between two #\n
You can clean this up:
# remove spaces and "empty" hits if it is only whitespace
mod_parts = [p.strip() for p in parts if p.strip()]
print(mod_parts)
to get to:
['hello word', 'my code']
or in short:
shorter = [x.strip() for x in sequence.split("#\n")[1:-1]]
Try:
print(re.findall("#\n(.*?)#\n", sequence))
The regex is to capture (non-greedily) anything between two '#\n', but never reuse that for next capture. But if you want it as a delimiter (like split(), you can try to use lookahead:
print(re.findall("#\n(.*?)(?=#\n)", sequence))
and in which case the output will be
[' hello word ', ' ', ' my code']
In this case, it would be better to just use the string function .split() and pass it #\n as what you want to split on. You can check for the length using s.strip() and filter out empty lines. If for some reason you don't want the first and last portions, you can use slices [1:-1] to remove them.
sequence = "This is start #\n hello word #\n #\n my code#\n this is end"
print(sequence.split("#\n"))
# ['This is start ', ' hello word ', ' ', ' my code', ' this is end']
print([s.strip() for s in sequence.split("#\n") if s.strip()])
# ['This is start', 'hello word', 'my code', 'this is end']
print([s.strip() for s in sequence.split("#\n") if s.strip()][1:-1])
# ['hello word', 'my code']
Just as Brian suggested, you can use split function. However, if you consider those start and end patterns like parenthesis, the correct way to find the tokens is:
print([s.strip() for s in sequence.split("#\n")][1:-1:2])
it simply skips the strings between an end to its following start. For example, if the input is
sequence = "This is start #\n hello word #\n BETWEEN END1 AND START2 #\n my code#\n this is end"
the term BETWEEN END1 AND START2 should not be captured; so, the correct output is:
['hello word', 'my code']
You could use
#\n([\s\S]+?)#\n
As in
import re
rx = re.compile(r'#\n([\s\S]+?)#\n')
text = """This is start #
hello word #
#
my code#
this is end"""
matches = rx.findall(text)
print(matches)
This yields
[' hello word ', ' my code']
See a demo for the expression on regex101.com.

Stripping out \\n plus whitespace using .strip() and regex is not working

I've been attempting to strip out the \n plus the whitespace before and after the words from a string, but it is not working for some reason.
This is what I tried:
.strip(my_string)
and
re.sub('\n', '', my string)
I have tried using .strip and re in order to get it working, but it simply returns the same string.
Example input:
\\n The people who steal our cards already know all of this...\\n
\\n , \\n I\'m sure every fraud minded person in America is taking notes.\\n
\\n
Expected output would be:
The people who steal our cards already know all of this..., I\'m sure every fraud minded person in America is taking notes.
You're probably looking for something like this:
re.sub(r'\s+', r' ', x)
A usage example follows:
In [10]: x
Out[10]: 'hello \n world \n blue'
In [11]: re.sub(r'\s+', r' ', x)
Out[11]: 'hello world blue'
If you'd also like to grab the sequence of characters r'\n', then let's grab them as well:
re.sub(r'(\s|\\n)+', r' ', x)
And the output:
In [14]: x
Out[14]: 'hello \\n world \n \\n blue'
In [15]: re.sub(r'(\s|\\n)+', r' ', x)
Out[15]: 'hello world blue'

How do I replace a word in a string in python?

Let us say I have a string
c = "a string is like this and roberta a a thanks"
I want the output to be as
' string is like this and roberta thanks"
This is what I am trying
c.replace('a', ' ')
' string is like this nd robert thnks'
But this replaces each 'a' in the string
So I tried this
c.replace(' a ', ' ')
'a string is like this and roberta thanks'
But this leaves out 'a' in the starting of the string.
How do i do this?
this looks like a job for re :
import re
while re.subn('(\s+a\s+|^a\s+)',' ',txt)[1]!=0:
txt=re.subn('(\s+a\s+|^a\s+)',' ',txt)[0]
I myself figured it out.
c = "a string is like this and roberta a a thanks"
import re
re.sub('\\ba\\b', ' ', c)
' string is like this and roberta thanks'
Here you go myself! Enjoy!

Non-consuming regular expression split in Python

How can a string be split on a separator expression while leaving that separator on the preceding string?
>>> text = "This is an example. Is it made up of more than once sentence? Yes, it is."
>>> re.split("[\.\?!] ", text)
['This is an example', 'Is it made up of more than one sentence', 'Yes, it is.']
I would like the result to be.
['This is an example.', 'Is it made up of more than one sentence?', 'Yes, it is.']
So far I have only tried a lookahead assertion but this fails to split at all.
>>> re.split("(?<=[\.\?!]) ", text)
['This is an example.', 'Is it made up of more than once sentence?', 'Yes, it is.']
The crucial thing is the use of a look-behind assertion with ?<=.
import re
text = "This is an example.A particular case.Made up of more "\
"than once sentence?Yes, it is.But no blank !!!That's"\
" a problem ????Yes.I think so! :)"
for x in re.split("(?<=[\.\?!]) ", text):
print repr(x)
print '\n'
for x in re.findall("[^.?!]*[.?!]|[^.?!]+(?=\Z)",text):
print repr(x)
result
"This is an example.A particular case.Made up of more than once sentence?Yes, it is.But no blank !!!That'sa problem ????Yes.I think so!"
':)'
'This is an example.'
'A particular case.'
'Made up of more than once sentence?'
'Yes, it is.'
'But no blank !'
'!'
'!'
"That's a problem ?"
'?'
'?'
'?'
'Yes.'
'I think so!'
' :)'
.
EDIT
Also
import re
text = "! This is an example.A particular case.Made up of more "\
"than once sentence?Yes, it is.But no blank !!!That's"\
" a problem ????Yes.I think so! :)"
res = re.split('([.?!])',text)
print [ ''.join(res[i:i+2]) for i in xrange(0,len(res),2) ]
gives
['!', ' This is an example.', 'A particular case.', 'Made up of more than once sentence?', 'Yes, it is.', 'But no blank !', '!', '!', "That's a problem ?", '?', '?', '?', 'Yes.', 'I think so!', ' :)']

Categories