Splitting a sentence by ending characters - python

A recent project has me needing to split incoming phrases (as strings) into their component sentences. For instance, this string:
"Your mother was a hamster, and your father smelt of elderberries! Now go away, or I shall taunt you a second time. You know what, never mind. This entire sentence is far too silly. Wouldn't you agree? I think it is."
Would need to be turned into a list composed of the following elements:
["Your mother was a hamster, and your father smelt of elderberries",
"Now go away, or I shall taunt you a second time",
"You know what, never mind",
"This entire sentence is far too silly",
"Wouldn't you agree",
"I think it is"]
For the purposes of this function, a "sentence" is a string terminated by !, ?, or . Note that punctuation should be removed from the output as shown above.
I've got a working version, but it's quite ugly, leaves leading and trailing spaces, and I can't help but think there's a better way:
from functools import reduce
def split_sentences(st):
if type(st) is not str:
raise TypeError("Cannot split non-strings")
sl = st.split('.')
sl = [s.split('?') for s in sl]
sl = reduce(lambda x, y: x+y, sl) #Flatten the list
sl = [s.split('!') for s in sl]
return reduce(lambda x, y: x+y, sl)

Use re.split instead to specify a regular expression matching any sentence-ending character (and any following whitespace).
def split_sentences(st):
sentences = re.split(r'[.?!]\s*', st)
if sentences[-1]:
return sentences
else:
return sentences[:-1]

You can also do this without regexes:
result = [s.strip() for s in String.replace('!', '.').replace('?', '.').split('.')]
Or, you could've written a bleeding-edge algorithm that doesn't copy data around so much:
String = list(String)
for i in range(len(String)):
if (String[i] == '?') or (String[i] == '!'):
String[i] = '.'
String = [s.strip() for s in String.split('.')]

import re
st1 = " Another example!! Let me contribute 0.50 cents here?? \
How about pointer '.' character inside the sentence? \
Uni Mechanical Pencil Kurutoga, Blue, 0.3mm (M310121P.33). \
Maybe there could be a multipoint delimeter?.. Just maybe... "
st2 = "One word"
def split_sentences(st):
st = st.strip() + '. '
sentences = re.split(r'[.?!][.?!\s]+', st)
return sentences[:-1]
print(split_sentences(st1))
print(split_sentences(st2))

You can use regex split to split them at specific special characters.
import re
str = "Your mother was a hamster, and your father smelt of elderberries! Now go away, or I shall taunt you a second time. You know what, never mind. This entire sentence is far too silly. Wouldn't you agree? I think it is."
re.compile(r'[?.!]\s+').split(str)

Related

How to extract specific strings using Python Regex

I have very challenging strings that I have been struggling.
For example,
str1 = '95% for Pikachu, 92% for Sandshrew'
str2 = '70% for Paras & 100% Arcanine'
str3 = '99% Diglett, 40% Dugtrio'
str4 = '10% Squirtle, 100% for Alakazam'
str5 = '30% Metopod & 99% Dewgong'
String starts with % integer and may have for or not, then following with name of pokemon. There might be comma(,) or & sign then new % integer. Finally there is another name of pokemon.(All start with capital case alphabet) I want to extract two pokemons, for example result,
['Pikachu', 'Sandshrew']
['Paras', 'Arcanine']
['Diglett', 'Dugtrio']
['Squirtle', 'Alakazam']
['Metopod', 'Dewgong']
I can create a list of all pokemen then using in syntax, but it is not the best way(In case they add more pokemon). Is it possible to extract using Regex?
Thanks in advance!
EDIT
As requested, I am adding my code,
str_list = [str1, str2, str3, str4, str5]
for x in str_list:
temp_list = []
if 'for' in x:
temp = x.split('% for', 1)[1].strip()
temp_list.append(temp)
else:
temp = x.split(" ", 1)[1]
temp_list.append(temp)
print(temp_list)
I know it is not regex express. The expression I tried is, \d+ to
extract integer to start... but have no idea how to start.
EDIT2
#b_c has good edge case so, I am adding it here
edge_str = '100% for Pikachu, 29% Pika Pika Pikachu'
result
['Pikachu', 'Pika Pika Pikachu']
Hopefully I didn't over engineer this, but I wanted to cover the edge cases of the slightly-more-complicated named pokemon, such as "Mr. Mime", "Farfetch'd", and/or "Nidoran♂" (only looking at the first 151).
The pattern I used is (?:(?:\d+%(?: |for)+([A-Z](?:[\w\.♀♂']|(?: (?=[A-Z])))+))+)[, &]*, which looks to be working in my testing (here's the regex101 link for a breakdown).
For a general summary, I'm looking for:
1+ digits followed by a %
A space or the word "for" at least once
(To start the capture) A starting capital letter
At least one of (ending the capture group):
a word character, a period, the male/female symbols, or an apostrophe
Note: If you want to catch additional "weird" pokemon characters, like numbers, colon, etc., add them to this portion (the [\w\.♀♂'] bit).
OR a space, but only if followed by a capital letter
A comma, space, or ampersand, any number of times
Unless it's changed, Python's builtin re module doesn't support repeated capture groups (which I believe I did correctly), so I just used re.findall and organized them into pairs (I replaced a couple names from your input with the complicated ones):
import re
str1 = '95% for Pikachu, 92% for Mr. Mime'
str2 = '70% for Paras & 100% Arcanine'
str3 = '99% Diglett, 40% Dugtrio'
str4 = "10% Squirtle, 100% for Farfetch'd"
str5 = '30% Metopod & 99% Nidoran♂'
pattern = r"(?:(?:\d+%(?: |for)+([A-Z](?:[\w\.♀♂']|(?: (?=[A-Z])))+))+)[, &]*"
# Find matches in each string, then unpack each list of
# matches into a flat list
all_matches = [match
for s in [str1, str2, str3, str4, str5]
for match in re.findall(pattern, s)]
# Pair up the matches
pairs = zip(all_matches[::2], all_matches[1::2])
for pair in pairs:
print(pair)
This then prints out:
('Pikachu', 'Mr. Mime')
('Paras', 'Arcanine')
('Diglett', 'Dugtrio')
('Squirtle', "Farfetch'd")
('Metopod', 'Nidoran♂')
Also, as was already mentioned, you do have a few typos in the pokemon names, but a regex isn't the right fix for that unfortunately :)
Since there seems to be no other upper-case letter in your strings you can simply use [A-Z]\w+ as regex.
See regex101
Code:
import re
str1 = '95% for Pikachu, 92% for Sandsherew'
str2 = '70% for Paras & 100% Arcanine'
str3 = '99% Diglett, 40% Dugtrio'
str4 = '10% Squirtle, 100% for Alakazam'
str5 = '30% Metopod & 99% Dewgong'
str_list = [str1, str2, str3, str4, str5]
regex = re.compile('[A-Z]\w+')
pokemon_list = []
for x in str_list:
pokemon_list.append(re.findall(regex, x))
print(pokemon_list)
Output:
[['Pikachu', 'Sandsherew'], ['Paras', 'Arcanine'], ['Diglett', 'Dugtrio'], ['Squirtle', 'Alakazam'], ['Metopod', 'Dewgong']]
An alternate method if you dont want to use regex and you don't want to rely on capitalization
def pokeFinder(strng):
wordList = strng.split()
pokeList = []
for word in wordList:
if not set('[~!##$%^&*()_+{}":;\']+$').intersection(word) and 'for' not in word:
pokeList.append(word.replace(',', ''))
return pokeList
This won't add words with special chars. It also won't add words that are for. Then it removes commas from the found words.
A print of str2 returns ['Diglett', 'Dugtrio']
EDIT
In light of the fact that there are apparently Pokemon with two words and special chars, I made this slightly more convoluted version of the above code
def pokeFinder(strng):
wordList = strng.split()
pokeList = []
prevWasWord = False
for word in wordList:
if not set('%&').intersection(word) and 'for' not in word:
clnWord = word.replace(',', '')
if prevWasWord is True: # 2 poke in a row means same poke
pokeList[-1] = pokeList[-1] + ' ' + clnWord
else:
pokeList.append(clnWord)
prevWasWord = True
else:
prevWasWord = False
return pokeList
If there's no "three word" pokemon, and the rules OP set remain constant, this should always work. 2 poke matches in a row adds to the previous pokemon.
So printing a string of '30% for Mr. Mime & 20% for Type: Null' gets
['Mr. Mime', 'Type: Null']
Use a positive lookbehind, this will work regardless of capitalization.
(?<=\d\d% for )[A-Za-z]+|(?<=\d% for )[A-Za-z]+
EDIT: Changed it to work in Python.

Convert negation text to text in python

I have a lot of data which contain the word "not".
For example : "not good".
I want to convert "not good" to "notgood" (without a space).
How can I convert all of the "not"s in the data, erasing the space after "not".
For example in the below list:
1. I am not beautiful → I am notbeautiful
2. She is not good to be a teacher → She is notgood to be a teacher
3. If I choose A, I think it's not bad decision → If I choose A, I think it's notbad decision
A simple way to do this would be to replace not_ with not, removing the space.
text = "I am not beautiful"
new_text = text.replace("not ", "not")
print(new_text)
Will output:
I am notbeautiful
I suggest that you use regular expression to match the word with boundary, in order to avoid matching phrases like "tying the knot with someone":
import re
output = re.replace(r'(?<=\bnot)\s+(?=\w+)', '', text)
OR:
s = "I am not beautiful"
news=''.join(i+' ' if i != 'not' else i for i in s.split())
print(news)
Output:
I am notbeautiful
If you care about the space at the end do:
print(news.rstrip())

Python splitting string by parentheses

I asked a question a little while ago (Python splitting unknown string by spaces and parentheses) which worked great until I had to change my way of thinking. I have still not grasped regex so I need some help with this.
If the user types this:
new test (test1 test2 test3) test "test5 test6"
I would like it to look like the output to the variable like this:
["new", "test", "test1 test2 test3", "test", "test5 test6"]
In other words if it is one word seperated by a space then split it from the next word, if it is in parentheses then split the whole group of words in the parentheses and remove them. Same goes for the quotation marks.
I currently am using this code which does not meet the above standard (From the answers in the link above):
>>>import re
>>>strs = "Hello (Test1 test2) (Hello1 hello2) other_stuff"
>>>[", ".join(x.split()) for x in re.split(r'[()]',strs) if x.strip()]
>>>['Hello', 'Test1, test2', 'Hello1, hello2', 'other_stuff']
This works well but there is a problem, if you have this:
strs = "Hello Test (Test1 test2) (Hello1 hello2) other_stuff"
It combines the Hello and Test as one split instead of two.
It also doesn't allow the use of parentheses and quotation marks splitting at the same time.
The answer was simply:
re.findall('\[[^\]]*\]|\([^\)]*\)|\"[^\"]*\"|\S+',strs)
This is pushing what regexps can do. Consider using pyparsing instead. It does recursive descent. For this task, you could use:
from pyparsing import *
import string, re
RawWord = Word(re.sub('[()" ]', '', string.printable))
Token = Forward()
Token << ( RawWord |
Group('"' + OneOrMore(RawWord) + '"') |
Group('(' + OneOrMore(Token) + ')') )
Phrase = ZeroOrMore(Token)
Phrase.parseString(s, parseAll=True)
This is robust against strange whitespace and handles nested parentheticals. It's also a bit more readable than a large regexp, and therefore easier to tweak.
I realize you've long since solved your problem, but this is one of the highest google-ranked pages for problems like this, and pyparsing is an under-known library.
Your problem is not well defined.
Your description of the rules is
In other words if it is one word seperated by a space then split it
from the next word, if it is in parentheses then split the whole group
of words in the parentheses and remove them. Same goes for the commas.
I guess with commas you mean inverted commas == quotation marks.
Then with this
strs = "Hello (Test1 test2) (Hello1 hello2) other_stuff"
you should get that
["Hello (Test1 test2) (Hello1 hello2) other_stuff"]
since everything is surrounded by inverted commas. Most probably, you want to work with no care of largest inverted commas.
I propose this, although a bot ugly
import re, itertools
strs = raw_input("enter a string list ")
print [ y for y in list(itertools.chain(*[re.split(r'\"(.*)\"', x)
for x in re.split(r'\((.*)\)', strs)]))
if y <> '']
gets
>>>
enter a string list here there (x y ) thereagain "there there"
['here there ', 'x y ', ' thereagain ', 'there there']
This is doing what you expect
import re, itertools
strs = raw_input("enter a string list ")
res1 = [ y for y in list(itertools.chain(*[re.split(r'\"(.*)\"', x)
for x in re.split(r'\((.*)\)', strs)]))
if y <> '']
set1 = re.search(r'\"(.*)\"', strs).groups()
set2 = re.search(r'\((.*)\)', strs).groups()
print [k for k in res1 if k in list(set1) or k in list(set2) ]
+ list(itertools.chain(*[k.split() for k in res1 if k
not in set1 and k not in set2 ]))
For python 3.6 - 3.8
I had a similar question, however I like none of those answers, maybe because most of them are from 2013. So I elaborated my own solution.
regex = r'\(.+?\)|".+?"|\w+'
test = 'Hello Test (Test1 test2) (Hello1 hello2) other_stuff'
result = re.findall(regex, test)
Here you are looking for three different groups:
Something that is included inside (); parenthesis should be written together with backslashes
Something that is included inside ""
Just words
The use of ? makes your search lazy instead of greedy

operating over strings, python

How to define a function that takes a string (sentence) and inserts an extra space after a period if the period is directly followed by a letter.
sent = "This is a test.Start testing!"
def normal(sent):
list_of_words = sent.split()
...
This should print out
"This is a test. Start testing!"
I suppose I should use split() to brake a string into a list, but what next?
P.S. The solution has to be as simple as possible.
Use re.sub. Your regular expression will match a period (\.) followed by a letter ([a-zA-Z]). Your replacement string will contain a reference to the second group (\2), which was the letter matched in the regular expression.
>>> import re
>>> re.sub(r'\.([a-zA-Z])', r'. \1', 'This is a test.This is a test. 4.5 balloons.')
'This is a test. This is a test. 4.5 balloons'
Note the choice of [a-zA-Z] for the regular expression. This matches just letters. We do not use \w because it would insert spaces into a decimal number.
One-liner non-regex answer:
def normal(sent):
return ".".join(" " + s if i > 0 and s[0].isalpha() else s for i, s in enumerate(sent.split(".")))
Here is a multi-line version using a similar approach. You may find it more readable.
def normal(sent):
sent = sent.split(".")
result = sent[:1]
for item in sent[1:]:
if item[0].isalpha():
item = " " + item
result.append(item)
return ".".join(result)
Using a regex is probably the better way, though.
Brute force without any checks:
>>> sent = "This is a test.Start testing!"
>>> k = sent.split('.')
>>> ". ".join(l)
'This is a test. Start testing!'
>>>
For removing spaces:
>>> sent = "This is a test. Start testing!"
>>> k = sent.split('.')
>>> l = [x.lstrip(' ') for x in k]
>>> ". ".join(l)
'This is a test. Start testing!'
>>>
Another regex-based solution, might be a tiny bit faster than Steven's (only one pattern match, and a blacklist instead of a whitelist):
import re
re.sub(r'\.([^\s])', r'. \1', some_string)
Improving pyfunc's answer:
sent="This is a test.Start testing!"
k=sent.split('.')
k='. '.join(k)
k.replace('. ','. ')
'This is a test. Start testing!'

Python Regular expression must strip whitespace except between quotes

I need a way to remove all whitespace from a string, except when that whitespace is between quotes.
result = re.sub('".*?"', "", content)
This will match anything between quotes, but now it needs to ignore that match and add matches for whitespace..
I don't think you're going to be able to do that with a single regex. One way to do it is to split the string on quotes, apply the whitespace-stripping regex to every other item of the resulting list, and then re-join the list.
import re
def stripwhite(text):
lst = text.split('"')
for i, item in enumerate(lst):
if not i % 2:
lst[i] = re.sub("\s+", "", item)
return '"'.join(lst)
print stripwhite('This is a string with some "text in quotes."')
Here is a one-liner version, based on #kindall's idea - yet it does not use regex at all! First split on ", then split() every other item and re-join them, that takes care of whitespaces:
stripWS = lambda txt:'"'.join( it if i%2 else ''.join(it.split())
for i,it in enumerate(txt.split('"')) )
Usage example:
>>> stripWS('This is a string with some "text in quotes."')
'Thisisastringwithsome"text in quotes."'
You can use shlex.split for a quotation-aware split, and join the result using " ".join. E.g.
print " ".join(shlex.split('Hello "world this is" a test'))
Oli, resurrecting this question because it had a simple regex solution that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)
Here's the small regex:
"[^"]*"|(\s+)
The left side of the alternation matches complete "quoted strings". We will ignore these matches. The right side matches and captures spaces to Group 1, and we know they are the right spaces because they were not matched by the expression on the left.
Here is working code (and an online demo):
import re
subject = 'Remove Spaces Here "But Not Here" Thank You'
regex = re.compile(r'"[^"]*"|(\s+)')
def myreplacement(m):
if m.group(1):
return ""
else:
return m.group(0)
replaced = regex.sub(myreplacement, subject)
print(replaced)
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...
Here little longish version with check for quote without pair. Only deals with one style of start and end string (adaptable for example for example start,end='()')
start, end = '"', '"'
for test in ('Hello "world this is" atest',
'This is a string with some " text inside in quotes."',
'This is without quote.',
'This is sentence with bad "quote'):
result = ''
while start in test :
clean, _, test = test.partition(start)
clean = clean.replace(' ','') + start
inside, tag, test = test.partition(end)
if not tag:
raise SyntaxError, 'Missing end quote %s' % end
else:
clean += inside + tag # inside not removing of white space
result += clean
result += test.replace(' ','')
print result

Categories